From 869d3304cc3e5d2c3569d8ee8cd7738d980ba75f Mon Sep 17 00:00:00 2001 From: naush14 <33265876+naush14@users.noreply.github.com> Date: Thu, 22 Aug 2019 12:54:23 -0400 Subject: [PATCH 001/854] Update handling-data.tex Added some content, fixed typos. The chapter reads well and is easy to understand. I do not believe any pertinent info is missing but I will read it again and share my comments (if any) via email. Thanks --- chapters/handling-data.tex | 36 +++++++++++++++++++----------------- 1 file changed, 19 insertions(+), 17 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index d69e501b6..8a807ee20 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -8,20 +8,20 @@ asked by development researchers grow, so too does the (rightful) scrutiny under which methods and results are placed. This scrutiny involves two major components: data handling and analytical quality. -Performing at a high standard in both means that research participants +Performing at a high standard in both areas means that research participants are appropriately protected, and that consumers of research can have confidence in its conclusions. What we call ethical standards in this chapter is a set of practices for data privacy and research transparency that address these two components. -Their adoption is an objective measure of to judge a research product's performance in both. -Without these transparent measures of credibility, reputation is the primary signal for the quality of evidence, and two failures may occur: +Their adoption is an objective measure to judge a research product's performance in both. +Without these transparent measures of credibility, reputation becomes the primary signal for the quality of evidence, and hence, two failures may occur: low-quality studies from reputable sources may be used as evidence when in fact they don't warrant it, and high-quality studies from sources without an international reputation may be ignored. Both these outcomes reduce the quality of evidence overall. Even more importantly, they usually mean that credibility in development research accumulates at international institutions and top global universities instead of the people and places directly involved in and affected by it. -Simple transparency standards mean that it is easier to judge research quality, and making high-quality research identifiable also increases its impact. +Simple transparency standards mean that it is easier to judge research quality. Additionally, making high-quality research identifiable also increases its impact. This section provides some basic guidelines and resources for collecting, handling, and using field data ethically and responsibly to publish research findings. \end{fullwidth} @@ -49,16 +49,16 @@ \section{Protecting confidence in development research} so it is hard for others to verify that it was collected, handled, and analyzed appropriately. Maintaining confidence in research via the components of credibility, transparency, and replicability is the most important way that researchers can avoid serious error, -and therefore these are not by-products but core components of research output. +and therefore these principles are not by-products but core components of research output. \subsection{Research replicability} Replicable research, first and foremost, means that the actual analytical processes you used are executable by others.\cite{dafoe2014science} (We use ``replicable'' and ``reproducible'' somewhat interchangeably, -referring only to the code processes themselves in a specific study; +referring only to the code processes in a specific study; in other contexts they may have more specific meanings.\sidenote{\url{http://datacolada.org/76}}) -All your code files involving data construction and analysis +All your code files involving data cleaning, construction and analysis should be public -- nobody should have to guess what exactly comprises a given index, or what controls are included in your main regression, or whether or not you clustered standard errors correctly. @@ -76,17 +76,17 @@ \subsection{Research replicability} Secondly, reproducible research\sidenote{\url{https://dimewiki.worldbank.org/wiki/Reproducible_Research}} enables other researchers to re-utilize your code and processes -to do their own work more easily in the future. +to do their own work more easily and effectively in the future. This may mean applying your techniques to their data or implementing a similar structure in a different context. As a pure public good, this is nearly costless. The useful tools and standards you create will have high value to others. If you are personally or professionally motivated by citations, producing these kinds of resources will almost certainly lead to that as well. -Therefore, your code should be written neatly and published openly. +Therefore, your code should be written neatly with clear instructions and published openly. It should be easy to read and understand in terms of structure, style, and syntax. Finally, the corresponding dataset should be openly accessible -unless for legal or ethical reasons it cannot.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} +unless for legal or ethical reasons it cannot be.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} \subsection{Research transparency} @@ -94,13 +94,13 @@ \subsection{Research transparency} This means that readers be able to judge for themselves if the research was done well, and if the decision-making process was sound. If the research is well-structured, and all relevant documentation\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} is shared, -this is as easy as possible for the reader to do. +this makes it as easy as possible for the reader to implement. This is also an incentive for researchers to make better decisions, -be skeptical about their assumptions, +be skeptical and thorough about their assumptions, and, as we hope to convince you, make the process easier for themselves, -because it requires methodical organization that is labor-saving over the complete course of a project. +because it requires methodical organization that is labor-saving and efficient over the complete course of a project. -\textbf{Registered Reports}\sidenote{\url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} can help with this process where they are available. +\textbf{Registered Reports}\sidenote{\url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} can help with this process where available. By setting up a large portion of the research design in advance,\sidenote{\url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis/}} a great deal of work has already been completed, and at least some research questions are pre-committed for publication regardless of the outcome. @@ -110,11 +110,11 @@ \subsection{Research transparency} Documenting a project in detail greatly increases transparency. This means explicitly noting decisions as they are made, and explaining the process behind them. -Documentation on data processing and additional hypothesis tested will be expected in the supplemental materials to any publication. +Documentation on data processing and additional hypotheses tested will be expected in the supplemental materials to any publication. Careful documentation will also save the research team a lot of time during a project, as it prevents you to have the same discussion twice (or more!), since you have a record of why something was done in a particular way. -There is a number of available tools +There are a number of available tools that will contribute to producing documentation, \index{project documentation} but project documentation should always be an active and ongoing process, @@ -151,7 +151,8 @@ \subsection{Research credibility} all experimental and observational studies should be \textbf{pre-registered}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Pre-Registration}} using the \textbf{AEA} database,\sidenote{\url{https://www.socialscienceregistry.org/}} -the \textbf{3ie} database,\sidenote{\url{http://ridie.3ieimpact.org/}}, +the \textbf{3ie} database,\sidenote{\url{http://ridie.3ieimpact.org/}}, +the \textbf{eGAP} database,\sidenote{\url{http://egap.org/content/registration/}}, or the \textbf{OSF} registry\sidenote{\url{https://osf.io/registries}} as appropriate.\sidenote{\url{http://datacolada.org/12}} \index{pre-registration} @@ -182,6 +183,7 @@ \section{Ensuring privacy and security in research} \index{geodata} such as email addresses, phone numbers, and financial information. \index{de-identification} +It is important to keep in mind these data privacy principles not only for the individual respondent but also the PII data of their household members or other caregivers who are covered under the survey. In some contexts this list may be more extensive -- for example, if you are working in a small environment, someone's age and gender may be sufficient to identify them From 528b504242db60e78cdf69e773373c5d0c5d7c3e Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Tue, 24 Sep 2019 09:47:45 -0400 Subject: [PATCH 002/854] Implementing Caio's suggestion in issue #88 --- chapters/research-design.tex | 4 ---- 1 file changed, 4 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 8737131ec..caeb9d277 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -76,10 +76,6 @@ \section{Counterfactuals and treatment effects} We aren't even going to get into regression models here. Almost all experimental designs can be accurately described as a series of between-group comparisons.\sidenote{\url{http://nickchk.com/econ305.html}} -It means thinking carefully about how to transform and scale your data, -using fixed effects to extract ``within-group'' comparisons as needed, -and choosing estimators appropriate to your design. -As the saying goes, all models are wrong, but some are useful. The models you will construct and estimate are intended to do two things: to express the intention of your research design, and to help you group the potentially endless concepts of field reality From 91afc7d43a8ee48897ad77a636866ecd996f0bdf Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Tue, 24 Sep 2019 10:37:02 -0400 Subject: [PATCH 003/854] Updating the DEC abbreviation --- chapters/abbreviations.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/abbreviations.tex b/chapters/abbreviations.tex index 49b48d6e3..b857dec72 100644 --- a/chapters/abbreviations.tex +++ b/chapters/abbreviations.tex @@ -6,7 +6,7 @@ \noindent\textbf{CI} -- Confidence Interval -\noindent\textbf{DEC} -- Development Economics +\noindent\textbf{DEC} -- Development Economics Group at the World Bank \noindent\textbf{DD or DiD} -- Differences-in-Differences From eedc3659af0ab440383b96c97349f00ec34bf73e Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Tue, 24 Sep 2019 10:46:27 -0400 Subject: [PATCH 004/854] Changing a part of a sentence. --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 62a1c7341..591b037d5 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -54,7 +54,7 @@ \section{Doing credible research at scale} One important lesson we have learned from doing field work over this time is that the most overlooked parts of primary data work are reproducibility and collaboration. -You will be working with people +You may be working with people who have very different skillsets and mindsets than you, from and in a variety of cultures and contexts, and you will have to adopt workflows that everyone can agree upon, and that save time and hassle on every project. From e34a9c6b36f8aa9e7d07f15feef670c290eeb84b Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Tue, 24 Sep 2019 10:55:36 -0400 Subject: [PATCH 005/854] Error in the line --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 62a1c7341..82041b88f 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -95,7 +95,7 @@ \section{Writing reproducible code in a collaborative environment} and therefore the tools that are used to do it are set in advance. Good code is easier to read and replicate, making it easier to spot mistakes. The resulting data contains substantially less noise -that is due to sampling, randomization, and cleaning errors. And all the data work can de easily reviewed before it's published and replicated afterwards. +that is due to sampling, randomization, and cleaning errors. And all the data work can be easily reviewed before it's published and replicated afterwards. Code is good when it is both correct and is a useful tool to whoever reads it. Most research assistants that join our unit have only been trained in how to code correctly. From 2f19f3cc7a997d65d242f4442b29046309e73aa4 Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Tue, 24 Sep 2019 11:52:45 -0400 Subject: [PATCH 006/854] Typo in sentence --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index d69e501b6..39ddd98c7 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -37,7 +37,7 @@ \section{Protecting confidence in development research} Replicability is one key component of transparency. Transparency is necessary for consumers of research to be able to tell the quality of the research. -Without it, all evidence credibility comes from reputation, +Without it, all evidence of credibility comes from reputation, and it's unclear what that reputation is based on, since it's not transparent. Development researchers should take these concerns seriously. From 11206ac570a9316ecab37528595d3dc7e9afcaca Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Tue, 24 Sep 2019 12:36:33 -0400 Subject: [PATCH 007/854] Deleted extra words in a sentence --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 39ddd98c7..dd894478f 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -190,7 +190,7 @@ \section{Ensuring privacy and security in research} to decide which pieces of information fall into this category.\sidenote{ \url{https://sdcpractice.readthedocs.io/en/latest/}} -Most of the field research done in development involves human subjects -- real people.\sidenote{ +Most of the field research done in development involves human subjects.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Human_Subjects_Approval}} \index{human subjects} As a researcher, you are asking people to trust you with personal information about themselves: From 50a6606d7149854fa7f1089708553fe4aaaef616 Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 10:32:53 -0400 Subject: [PATCH 008/854] Rephrasing the line --- chapters/introduction.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 62a1c7341..32968a6df 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -40,8 +40,8 @@ \section{Doing credible research at scale} In the time that we have been working in the development field, the proportion of projects that rely on \textbf{primary data} has soared.\cite{angrist2017economic} -Today, the scope and scale of those projects continue to expand rapidly, -meaning that more and more people are working on the same data over longer and longer timeframes. +Today, the scope and scale of those projects continue to expand rapidly. +That is, more and more people are working on the same data over longer timeframes. This is because, while administrative datasets and \textbf{big data} have important uses, primary data\sidenote{\textbf{Primary data:} data collected from first-hand sources.} From e95bc97bf19615bd3b5584a84281328665a777fc Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 10:46:30 -0400 Subject: [PATCH 009/854] Rephrasing the line --- chapters/introduction.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 32968a6df..335745087 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -64,8 +64,8 @@ \section{Doing credible research at scale} we have realized that we barely have the time to give everyone the attention they deserve. This book itself is therefore intended to be a vehicle to document our experiences and share it with with future DIME team members. -The \textbf{DIME Wiki} is one of our flagship resources for project teams, -as a free online collection of our resources and best practices.\sidenote{\url{http://dimewiki.worldbank.org/}} +The \textbf{DIME Wiki} is one of our flagship resource repository, designed for teams engaged in impact evaluation projects. +It is available as a free online collection of our resources and best practices.\sidenote{\url{http://dimewiki.worldbank.org/}} This book therefore complements the detailed-but-unstructured DIME Wiki with a guided tour of the major tasks that make up primary data collection.\sidenote{Like this: \url{https://dimewiki.worldbank.org/wiki/Primary_Data_Collection}} We will not give a lot of highly specific details in this text, From 81e364d777c6d623d0a88763a80188833eac0724 Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 10:51:28 -0400 Subject: [PATCH 010/854] Deleting a comma --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 335745087..6c5a6604d 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -93,7 +93,7 @@ \section{Writing reproducible code in a collaborative environment} Process standardization means that there is little ambiguity about how something ought to be done, and therefore the tools that are used to do it are set in advance. -Good code is easier to read and replicate, making it easier to spot mistakes. +Good code is easier to read and replicate making it easier to spot mistakes. The resulting data contains substantially less noise that is due to sampling, randomization, and cleaning errors. And all the data work can de easily reviewed before it's published and replicated afterwards. From 8ab8d06b11db725dcb35469e3c0fab39cedfa790 Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 10:53:53 -0400 Subject: [PATCH 011/854] Typo --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 6c5a6604d..db30f934f 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -95,7 +95,7 @@ \section{Writing reproducible code in a collaborative environment} and therefore the tools that are used to do it are set in advance. Good code is easier to read and replicate making it easier to spot mistakes. The resulting data contains substantially less noise -that is due to sampling, randomization, and cleaning errors. And all the data work can de easily reviewed before it's published and replicated afterwards. +that is due to sampling, randomization, and cleaning errors. And all the data work can be easily reviewed before it's published and replicated afterwards. Code is good when it is both correct and is a useful tool to whoever reads it. Most research assistants that join our unit have only been trained in how to code correctly. From 35f9092fd88ae59321e2e53bafe11d1e34a52a33 Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 11:52:29 -0400 Subject: [PATCH 012/854] Typos and syntax --- chapters/handling-data.tex | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index dd894478f..cc9ae6296 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -6,7 +6,7 @@ and these can have wide-reaching consequences on the lives of millions. As the range and importance of the policy-relevant questions asked by development researchers grow, -so too does the (rightful) scrutiny under which methods and results are placed. +so does the (rightful) scrutiny under which methods and results are placed. This scrutiny involves two major components: data handling and analytical quality. Performing at a high standard in both means that research participants are appropriately protected, @@ -45,7 +45,7 @@ \section{Protecting confidence in development research} and may utilize unique data or small samples. This approach opens the door to working with the development community to answer both specific programmatic questions and general research inquiries. -However, the data researchers utilize have never been reviewed by anyone else, +However, if what the data researchers utilize has never been reviewed by anyone else, so it is hard for others to verify that it was collected, handled, and analyzed appropriately. Maintaining confidence in research via the components of credibility, transparency, and replicability is the most important way that researchers can avoid serious error, @@ -69,7 +69,7 @@ \subsection{Research replicability} based on the valuable work you have already done. Services like GitHub that expose your code \textit{history} are also valuable resources. They can show things like modifications -made in response to referee comments; for another, they can show +made in response to referee comments; for someone else, they can show the research paths and questions you may have tried to answer (but excluded from publication) as a resource to others who have similar questions of their own data. @@ -86,7 +86,7 @@ \subsection{Research replicability} Therefore, your code should be written neatly and published openly. It should be easy to read and understand in terms of structure, style, and syntax. Finally, the corresponding dataset should be openly accessible -unless for legal or ethical reasons it cannot.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} +unless for legal or ethical reasons it is proprietary.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} \subsection{Research transparency} @@ -134,7 +134,7 @@ \subsection{Research transparency} but is less effective for file storage. Each project has its specificities, and the exact shape of this process can be molded to the team's needs, -but it should be agreed on prior to project launch. +but it should be agreed upon prior to project launch. This way, you can start building a project's documentation as soon as you start making decisions. \subsection{Research credibility} From b2fb0abdcdbee891cb2c842d466c242f3e029caf Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 12:46:20 -0400 Subject: [PATCH 013/854] Update planning-data-work.tex --- chapters/planning-data-work.tex | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 2ed38ae8a..f1c408afe 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -36,11 +36,9 @@ \section{Preparing your digital workspace} but thinking about simple things from a workflow perspective can help you make marginal improvements every day you work. -Teams often develop they workflows as they go, +Teams often develop their workflows as they go, solving new challenges when they appear. -This will always be necessary, -and new challenges will keep coming. -However, there are a number of tasks that will always have to be completed on any project. +However, there are a number of tasks that will always have to be completed during any project. These include organizing folders, collaborating on code, controlling different versions of a file, @@ -58,7 +56,7 @@ \subsection{Setting up your computer} that it is in good working order, and that you have a \textbf{password-protected} login. All machines that will handle personally-identifiable information should be encrypted; -this should be available built-in to most modern operating systems (BitLocker on PCs or FileVault on Macs). +this should be built-in to most modern operating systems (BitLocker on PCs or FileVault on Macs). Then, make sure your computer is backed up. Follow the \textbf{3-2-1 rule}: (3) copies of everything; @@ -108,7 +106,7 @@ \subsection{Folder management} For the purpose of this book, we're mainly interested in the folder that will store the project's data work. Agree with your team on a specific folder structure, and -set it up at the beginning of the research project, +set it up at the beginning of the research project to prevent folder re-organization that may slow down your workflow and, more importantly, prevent your code files from running. DIME Analytics created and maintains @@ -258,7 +256,7 @@ \subsection{Version control} \subsection{Output management} -One more thing to be discussed with your team is the best way to manage outputs. +Another task that needs to be discussed with your team is the best way to manage outputs. A great number of them will be created during the course of a project, from raw outputs such as tables and graphs to final products such as presentations, papers and reports. When the first outputs are being created, agree on where to store them, From 268cadb104cb2ded153ad117b7494f9012b6dd6f Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 13:31:11 -0400 Subject: [PATCH 014/854] Typos --- chapters/research-design.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 8737131ec..81b2f803e 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -63,7 +63,7 @@ \section{Counterfactuals and treatment effects} do not attempt to create a predictive or comprehensive model of how the outcome of interest is generated -- typically we do not care about measures of fit or predictive accuracy -like R-squareds or root mean square errors. +like R-squared or root mean square errors. Instead, the econometric models desribed here aim to correctly describe the experimental design being used, so that the correct estimate of the difference @@ -109,7 +109,7 @@ \subsection{Cross-sectional RCTs} \textbf{Cross-sectional RCTs} are the simplest possible study design: a program is implemented, surveys are conducted, and data is analyzed. -The randomization process, as in all RCTs, +The randomization process draws the treatment and control groups from the same underlying population. This implies the groups' outcome means would be identical in expectation before intervention, and would have been identical at measurement -- @@ -184,7 +184,7 @@ \subsection{Regression discontinuity} In an RD design, there is a \textbf{running variable} which gives eligible people access to some program, and a strict cutoff determines who is included.\cite{lee2010regression} -This is ussally justified by budget limitations. +This is usually justified by budget limitations. The running variable should not be the outcome of interest, and while it can be time, that may require additional modeling assumptions. Those who qualify are given the intervention and those who don't are not; From ed6cebcdf6d755e3dffdf3a48ed30feae614efae Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 13:50:37 -0400 Subject: [PATCH 015/854] Rephrasing of sentence(s) Can be made shorter than what I have written. Also, should we say it in a separate sentence that the lack of these elements will make the code less accessible or something along the following lines? --- chapters/introduction.tex | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index db30f934f..e3c1fa78e 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -97,7 +97,10 @@ \section{Writing reproducible code in a collaborative environment} The resulting data contains substantially less noise that is due to sampling, randomization, and cleaning errors. And all the data work can be easily reviewed before it's published and replicated afterwards. -Code is good when it is both correct and is a useful tool to whoever reads it. +A good do-file consists of code that has two elements: +- that is correct (doesn't produce any errors along the way) +- that is useful and comprehensible to someone who hasn't seen it before (such that the person who +wrote this code isn't lost if they see this code three weeks after they've written it) Most research assistants that join our unit have only been trained in how to code correctly. While correct results are extremely important, we usually tell our new research assistants that \textit{when your code runs on your computer and you get the correct results then you are only half-done writing \underline{good} code.} @@ -125,8 +128,8 @@ \section{Writing reproducible code in a collaborative environment} it should not require arcane reverse-engineering to figure out what a code chunk is trying to do. \textbf{Style}, finally, is the way that the non-functional elements of your code convey its purpose. -Elements like spacing, indentation, and naming can make your code much more (or much less) -accessible to someone who is reading it for the first time and needs to understand it quickly and correctly. +Elements like spacing, indentation, and naming (or lack thereof) can make your code much more +(or much less) accessible to someone who is reading it for the first time and needs to understand it quickly and correctly. For some implementation portions where precise code is particularly important, we will provide minimal code examples either in the book or on the DIME Wiki. From 563cb752fb8de15c116c8775244900306ee4e608 Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 14:57:10 -0400 Subject: [PATCH 016/854] Restructuring the paragraph --- chapters/planning-data-work.tex | 23 ++++++++++------------- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index f1c408afe..c21e886d1 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -350,16 +350,13 @@ \subsection{Preparing for collaboration and replication} When it comes to collaboration software,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Collaboration_Tools}} the two most common softwares in use are Dropbox and GitHub.\sidenote{ \url{https://michaelstepner.com/blog/git-vs-dropbox/}} -GitHub issues are a great tool for task management, -and Dropbox Paper also provides a good interface with notifications. -Neither of these tools require much technical knowledge; -they merely require an agreement and workflow design -so that the people assigning the tasks -are sure to set them up in the system. -GitHub is useful because tasks can clearly be tied to file versions; -therefore it is useful for managing code-related tasks. -It also creates incentives for writing down why changes were made as they are saved, -creating naturally documented code. -Dropbox Paper is useful because tasks can be easily linked to other documents saved in Dropbox; -therefore it is useful for managing non-code-related tasks. -Our team uses both. +GitHub has the following features that are amazing for efficient workflow: +- The issues tab is a great tool for task management. +- It creates incentives for writing down why changes were made as they are saved, creating naturally documented code. +- It is useful also because tasks can clearly be tied to file versions. Thus, it serves as a great tool for +managing code-related tasks. + +On the other hand, Dropbox Paper provides a good interface with notifications. It is useful because tasks can be easily linked to other documents saved in Dropbox. Thus, it is a great tool for managing non-code-related tasks. + +Neither of these tools require much technical knowledge; they merely require an agreement and workflow design +so that the people assigning the tasks are sure to set them up in the system. Our team uses both. From 08fdcfb9bb1c38e9d040e5651d069a4169fb4e84 Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 15:10:07 -0400 Subject: [PATCH 017/854] Bold Font for the concept I wanted to know if Encryption at rest is just one word and wanted to understand better how is it different from encryption in cloud storage. --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index d9c09d318..183764777 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -133,7 +133,7 @@ \section{Collecting data securely} (in tablet-assisted data collation) or your browser (in web data collection) until it reaches the server. -Encryption at rest is the only way to ensure +\textbf{Encryption at rest} is the only way to ensure that PII data remains private when it is stored on someone else's server on the open internet. Encryption makes data files completely unusable From 3770b6cf0f2085556a06454b50fd4cc0bee4ee28 Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 15:21:43 -0400 Subject: [PATCH 018/854] Bold font for two types of quality checks --- chapters/data-collection.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 183764777..355b54702 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -215,7 +215,7 @@ \section{Overseeing fieldwork and quality assurance} These are typically done in two main forms: high-frequency checks (HFCs) and back-checks. -High-frequency checks are carried out on the data side.\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks/wiki}} +\textbf{High-frequency checks} are carried out on the data side.\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks/wiki}} First, observations need to be checked for duplicate entries: \texttt{ieduplicates}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ieduplicates}} provides a workflow for collaborating on the resolution of @@ -253,7 +253,7 @@ \section{Overseeing fieldwork and quality assurance} Some issues will need immediate follow-up, and it will be harder to solve them once the enumeration team leaves the area. -Back-checks\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} +\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} involve more extensive collaboration with the field team, and are best thought of as direct data audits. In back-checks, a random subset of the field sample is chosen From 23fdf479f6a7d514bba7a5142d1d05aff8ddfcd0 Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 15:34:18 -0400 Subject: [PATCH 019/854] Restructuring a sentence It still may seem a little off but would appreciate your thoughts on this sentence structuring. --- chapters/data-collection.tex | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 355b54702..94c2852ee 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -298,12 +298,12 @@ \section{Overseeing fieldwork and quality assurance} This should contain all the observations that were completed; it should merge perfectly with the received dataset; and it should report reasons for any observations not collected. -Reporting of \textbf{missingness} and \textbf{attrition} is critical -to the interpretation of any survey dataset, -so it is important to structure this reporting -such that broad rationales can be grouped into categories -but also that the field team can provide detailed, open-ended responses -for any observations they were unable to complete. +Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical +to the interpretation of any survey dataset. +Thus, it is important to structure this reporting mechanism in a way to +not only group broad rationales into specific categories +but also collect all the detailed, open-ended responses to questions the field team can provide +for any observations that they were unable to complete. This reporting should be validated and saved alongside the final raw data, and treated the same way. This information should be stored as a dataset in its own right From 7b9cdafd29896fdfe9e45fc35e6789d112796844 Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 18:10:49 -0400 Subject: [PATCH 020/854] Restructuring the paragraph explaining overleaf --- chapters/publication.tex | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index e105f020c..12d94d89a 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -55,30 +55,29 @@ \section{Collaborating on academic writing} for someone new to \LaTeX\ to be able to ``just write'' is often the web-based Overleaf suite.\sidenote{\url{https://www.overleaf.com}} Overleaf offers a \textbf{rich text editor} -that behaves pretty similarly to familiar tools like Word. -TeXstudio\sidenote{\url{https://www.texstudio.org}} and atom-latex\sidenote{\url{https://atom.io/packages/atom-latex}} -are two popular desktop-based tools for writing \LaTeX; -they allow more advanced integration with Git, -among other advantages, but the entire team needs to be comfortable -with \LaTeX\ before adopting one of these tools. +that behaves pretty similarly to familiar tools like Word. With minimal workflow adjustments, you can to show coauthors how to write and edit in Overleaf, so long as you make sure you are always available to troubleshoot -\LaTeX\ crashes and errors. -The most common issue will be special characters, namely +\LaTeX\ crashes and errors. It also offers a convenient selection of templates +so it is easy to start up a project +and replicate a lot of the underlying setup code. +One of the most common issues you will face while using Overleaf will be special characters, namely \texttt{\&}, \texttt{\%}, and \texttt{\_}, which need to be \textbf{escaped} (instructed to interpret literally) by writing a backslash (\texttt{\textbackslash}) before them, such as \texttt{40\textbackslash\%} for the percent sign to function. -Overleaf offers a convenient selection of templates -so it is easy to start up a project -and replicate a lot of the underlying setup code. -The main issue with using Overleaf is that you need to upload input files +Another issue is that you need to upload input files (such as figures and tables) manually. This can create conflicts when these inputs are still being updated -- namely, the main document not having the latest results. One solution is to move to Overleaf only once there will not be substantive changes to results. +Other popular desktop-based tools for writing \LaTeX are TeXstudio\sidenote{\url{https://www.texstudio.org}} and atom-latex\sidenote{\url{https://atom.io/packages/atom-latex}}. +They allow more advanced integration with Git, +among other advantages, but the entire team needs to be comfortable +with \LaTeX\ before adopting one of these tools. + One of the important tools available in \LaTeX\ is the BibTeX bibliography manager.\sidenote{\url{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.3194&rep=rep1&type=pdf}} This tool stores unformatted references in an accompanying \texttt{.bib} file, From 51629be90a6e4cb9b182203af290e30206770584 Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Fri, 27 Sep 2019 18:16:03 -0400 Subject: [PATCH 021/854] Both font for new topic --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 12d94d89a..a9496440c 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -78,7 +78,7 @@ \section{Collaborating on academic writing} among other advantages, but the entire team needs to be comfortable with \LaTeX\ before adopting one of these tools. -One of the important tools available in \LaTeX\ is the BibTeX bibliography manager.\sidenote{\url{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.3194&rep=rep1&type=pdf}} +One of the important tools available in \LaTeX\ is the \textbf{BibTeX bibliography manager}.\sidenote{\url{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.3194&rep=rep1&type=pdf}} This tool stores unformatted references in an accompanying \texttt{.bib} file, and \LaTeX\ then inserts them in text From 93827143a26414b3aecbab8738860bcc07f915ad Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 30 Sep 2019 17:05:01 -0400 Subject: [PATCH 022/854] Fix typos and small issues with wording --- chapters/introduction.tex | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 62a1c7341..3dd33c2a3 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -30,7 +30,7 @@ \section{Doing credible research at scale} The team responsible for this book is known as \textbf{DIME Analytics}.\sidenote{ \url{http://www.worldbank.org/en/research/dime/data-and-analytics}} -The DIME Analytics team works within the \textbf{Development Impact Evaluation unit (DIME)}\sidenote{ +The DIME Analytics team works within the \textbf{Development Impact Evaluation (DIME)} group \sidenote{ \url{http://www.worldbank.org/en/research/dime}} at the World Bank's \textbf{Development Economics group (DEC)}.\sidenote{ \url{http://www.worldbank.org/en/research/}} @@ -94,10 +94,9 @@ \section{Writing reproducible code in a collaborative environment} little ambiguity about how something ought to be done, and therefore the tools that are used to do it are set in advance. Good code is easier to read and replicate, making it easier to spot mistakes. -The resulting data contains substantially less noise -that is due to sampling, randomization, and cleaning errors. And all the data work can de easily reviewed before it's published and replicated afterwards. +The resulting data contains substantially fewer sampling, randomization, and cleaning errors. And all the data work can de easily reviewed before it's published and replicated afterwards. -Code is good when it is both correct and is a useful tool to whoever reads it. +Code is good when it is both correct and easily understood by whoever reads it. Most research assistants that join our unit have only been trained in how to code correctly. While correct results are extremely important, we usually tell our new research assistants that \textit{when your code runs on your computer and you get the correct results then you are only half-done writing \underline{good} code.} @@ -114,7 +113,7 @@ \section{Writing reproducible code in a collaborative environment} To accomplish that, you should think of code in terms of three major elements: \textbf{structure}, \textbf{syntax}, and \textbf{style}. We always tell people to ``code as if a stranger would read it'', -since tomorrow, that stranger will be you. +from tomorrow, that stranger will be you. The \textbf{structure} is the environment your code lives in: good structure means that it is easy to find individual pieces of code that correspond to tasks. Good structure also means that functional blocks are sufficiently independent from each other From 897ef9d280045199233724b25dac036daf574b1c Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 30 Sep 2019 17:38:46 -0400 Subject: [PATCH 023/854] Fixed typos and small issues with word order --- chapters/handling-data.tex | 21 +++++++++------------ 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index d69e501b6..51eb1fb9f 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -6,24 +6,21 @@ and these can have wide-reaching consequences on the lives of millions. As the range and importance of the policy-relevant questions asked by development researchers grow, -so too does the (rightful) scrutiny under which methods and results are placed. -This scrutiny involves two major components: data handling and analytical quality. +so too does the (rightful) scrutiny of methods and results. +This scrutiny has two major components: data handling and analytical quality. Performing at a high standard in both means that research participants are appropriately protected, and that consumers of research can have confidence in its conclusions. -What we call ethical standards in this chapter is a set of practices for data privacy and research transparency that address these two components. +What we call ethical standards in this chapter is a set of practices for research transparency and data privacy that address these two components. Their adoption is an objective measure of to judge a research product's performance in both. Without these transparent measures of credibility, reputation is the primary signal for the quality of evidence, and two failures may occur: low-quality studies from reputable sources may be used as evidence when in fact they don't warrant it, and high-quality studies from sources without an international reputation may be ignored. Both these outcomes reduce the quality of evidence overall. -Even more importantly, they usually mean that credibility in development research -accumulates at international institutions and top global universities -instead of the people and places directly involved in and affected by it. +Even more importantly, they usually mean that credibility in development research accumulates at international institutions and top global universities instead of the people and places directly involved in and affected by it. Simple transparency standards mean that it is easier to judge research quality, and making high-quality research identifiable also increases its impact. -This section provides some basic guidelines and resources -for collecting, handling, and using field data ethically and responsibly to publish research findings. +This section provides some basic guidelines and resources for collecting, handling, and using field data ethically and responsibly to publish research findings. \end{fullwidth} %------------------------------------------------ @@ -33,7 +30,7 @@ \section{Protecting confidence in development research} The empirical revolution in development research \index{transparency}\index{credibility}\index{reproducibility} has led to increased public scrutiny of the reliability of research.\cite{rogers_2017} -Three major components make up this scrutiny: \textbf{credibility},\cite{ioannidis2017power} \textbf{transparency},\cite{christensen2018transparency} and \textbf{replicability}.\cite{duvendack2017meant} +Three major components make up this scrutiny: \textbf{replicability}.\cite{duvendack2017meant}, \textbf{transparency},\cite{christensen2018transparency} and \textbf{credibility},\cite{ioannidis2017power}. Replicability is one key component of transparency. Transparency is necessary for consumers of research to be able to tell the quality of the research. @@ -45,11 +42,11 @@ \section{Protecting confidence in development research} and may utilize unique data or small samples. This approach opens the door to working with the development community to answer both specific programmatic questions and general research inquiries. -However, the data researchers utilize have never been reviewed by anyone else, +However, the data that researchers utilize have never been reviewed by anyone else, so it is hard for others to verify that it was collected, handled, and analyzed appropriately. Maintaining confidence in research via the components of credibility, transparency, and replicability -is the most important way that researchers can avoid serious error, -and therefore these are not by-products but core components of research output. +is the most important way that researchers can avoid serious errors. +Therefore these are not by-products, but core components of research output. \subsection{Research replicability} From b40bdacb7bd2c7e9ff4e7a5219f69007d3c12a99 Mon Sep 17 00:00:00 2001 From: RadhikaKaul <55540497+RadhikaKaul@users.noreply.github.com> Date: Tue, 1 Oct 2019 10:04:41 -0400 Subject: [PATCH 024/854] Edited a sentence --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 1e288fb97..48645affa 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -54,7 +54,7 @@ \subsection{Understanding Stata code} is to constantly read helpfiles. So if there is a command that you do not understand in any of our code examples, for example \texttt{isid}, then write \texttt{help isid}, and the helpfile for the command \texttt{isid} will open. -We cannot emphasize too much how important we think it is that you get into the habit of reading helpfiles. +We cannot emphasize enough how important we think it is that you get into the habit of reading helpfiles. Sometimes, you will encounter code employing user-written commands, and you will not be able to read those helpfiles until you have installed the commands. From e06338397b1ce6430232e531740188ea56a5a556 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 1 Oct 2019 13:10:03 -0400 Subject: [PATCH 025/854] Ch 2: Removed sentence Reputation is usually transparent in academia, and based on an assessment of publications' quality. Plus, the problem with reputation has already been very well discussed in the introduction --- chapters/handling-data.tex | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index dd894478f..5d8c2783c 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -37,8 +37,7 @@ \section{Protecting confidence in development research} Replicability is one key component of transparency. Transparency is necessary for consumers of research to be able to tell the quality of the research. -Without it, all evidence of credibility comes from reputation, -and it's unclear what that reputation is based on, since it's not transparent. +Without it, all evidence of credibility comes from reputation. Development researchers should take these concerns seriously. Many development research projects are purpose-built to cover specific questions, From 22435df83f3823b3a1c70779dfbcaaee32e0c5a6 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 1 Oct 2019 13:16:55 -0400 Subject: [PATCH 026/854] Ch2: small addition --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 5d8c2783c..6750399c2 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -85,7 +85,7 @@ \subsection{Research replicability} Therefore, your code should be written neatly and published openly. It should be easy to read and understand in terms of structure, style, and syntax. Finally, the corresponding dataset should be openly accessible -unless for legal or ethical reasons it cannot.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} +unless for legal or ethical reasons it cannot (we will discuss more about this soon).\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} \subsection{Research transparency} From 991b0445ac99c959f4c60eb4e594c76612105775 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 1 Oct 2019 13:17:37 -0400 Subject: [PATCH 027/854] Ch2 : small change in wording --- chapters/handling-data.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 6750399c2..951a94f76 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -79,9 +79,9 @@ \subsection{Research replicability} This may mean applying your techniques to their data or implementing a similar structure in a different context. As a pure public good, this is nearly costless. -The useful tools and standards you create will have high value to others. -If you are personally or professionally motivated by citations, -producing these kinds of resources will almost certainly lead to that as well. +The useful tools and standards you create will have high value to others +(if you are personally or professionally motivated by citations, +producing these kinds of resources can lead to that as well). Therefore, your code should be written neatly and published openly. It should be easy to read and understand in terms of structure, style, and syntax. Finally, the corresponding dataset should be openly accessible From 4708bba3b41690c9e6ca55504406aac3930ad8a8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Tue, 1 Oct 2019 14:35:53 -0400 Subject: [PATCH 028/854] Change title of ch 2 #150 --- manuscript.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/manuscript.tex b/manuscript.tex index 81302129f..39b2c04d5 100644 --- a/manuscript.tex +++ b/manuscript.tex @@ -31,7 +31,7 @@ \chapter{Introduction: Data for development impact} % The asterisk leaves out th % CHAPTER 1 %---------------------------------------------------------------------------------------- -\chapter{Handling data ethically} +\chapter{Handling data transparently and ethically} \label{ch:1} \input{chapters/handling-data.tex} From d42d09eabbb53b0cb9e987e225abbd97967ba001 Mon Sep 17 00:00:00 2001 From: kchahande <44029434+kchahande@users.noreply.github.com> Date: Tue, 1 Oct 2019 15:06:17 -0400 Subject: [PATCH 029/854] Corrected a typo --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 1e288fb97..b878ace8a 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -58,7 +58,7 @@ \subsection{Understanding Stata code} Sometimes, you will encounter code employing user-written commands, and you will not be able to read those helpfiles until you have installed the commands. -Two examples of these in our code are \texttt{reandtreat} or \texttt{ieboilstart}. +Two examples of these in our code are \texttt{randtreat} or \texttt{ieboilstart}. The most common place to distribute user-written commands for Stata is the Boston College Statistical Software Components (SSC) archive. In our code examples, we only use either Stata's built-in commands or commands available from the SSC archive. So, if your installation of Stata does not recognize a command in our code, for example From 2b7e0f90cd8cb05750bf32901ea938840379a79f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Tue, 1 Oct 2019 15:22:55 -0400 Subject: [PATCH 030/854] [index] fix index #78 --- chapters/preamble.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/preamble.tex b/chapters/preamble.tex index 3f094644c..d6a39c86d 100644 --- a/chapters/preamble.tex +++ b/chapters/preamble.tex @@ -63,7 +63,7 @@ \newcommand{\blankpage}{\newpage\hbox{}\thispagestyle{empty}\newpage} % Command to insert a blank page -\usepackage{makeidx} % Used to generate the index +\usepackage{imakeidx} % Used to generate the index \makeindex % Generate the index which is printed at the end of the document %So we can use option FloatBarrier, which is similar to [H] but is an From 2d3276869206f35917c166398ed17aef8aee25df Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Tue, 1 Oct 2019 16:59:17 -0400 Subject: [PATCH 031/854] Add chapter numbers --- manuscript.tex | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/manuscript.tex b/manuscript.tex index 81302129f..94ea30295 100644 --- a/manuscript.tex +++ b/manuscript.tex @@ -31,7 +31,7 @@ \chapter{Introduction: Data for development impact} % The asterisk leaves out th % CHAPTER 1 %---------------------------------------------------------------------------------------- -\chapter{Handling data ethically} +\chapter{Chapter 1: Handling data ethically} \label{ch:1} \input{chapters/handling-data.tex} @@ -40,7 +40,7 @@ \chapter{Handling data ethically} % CHAPTER 2 %---------------------------------------------------------------------------------------- -\chapter{Planning data work before going to field} +\chapter{Chapter 2: Planning data work before going to field} \label{ch:2} \input{chapters/planning-data-work.tex} @@ -49,7 +49,7 @@ \chapter{Planning data work before going to field} % CHAPTER 3 %---------------------------------------------------------------------------------------- -\chapter{Designing research for causal inference} +\chapter{Chapter 3: Designing research for causal inference} \label{ch:3} \input{chapters/research-design.tex} @@ -58,7 +58,7 @@ \chapter{Designing research for causal inference} % CHAPTER 4 %---------------------------------------------------------------------------------------- -\chapter{Sampling, randomization, and power} +\chapter{Chapter 4: Sampling, randomization, and power} \label{ch:4} \input{chapters/sampling-randomization-power.tex} @@ -67,7 +67,7 @@ \chapter{Sampling, randomization, and power} % CHAPTER 5 %---------------------------------------------------------------------------------------- -\chapter{Collecting primary data} +\chapter{Chapter 5: Collecting primary data} \label{ch:5} \input{chapters/data-collection.tex} @@ -76,7 +76,7 @@ \chapter{Collecting primary data} % CHAPTER 6 %---------------------------------------------------------------------------------------- -\chapter{Analyzing survey data} +\chapter{Chapter 6: Analyzing survey data} \label{ch:6} \input{chapters/data-analysis.tex} @@ -85,7 +85,7 @@ \chapter{Analyzing survey data} % CHAPTER 7 %---------------------------------------------------------------------------------------- -\chapter{Publishing collaborative research} +\chapter{Chapter 7: Publishing collaborative research} \label{ch:7} \input{chapters/publication.tex} From a590a6a156200a48b366bdae8a07f03c978fdf1f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Tue, 1 Oct 2019 17:20:35 -0400 Subject: [PATCH 032/854] Bold section titles --- chapters/preamble.tex | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/chapters/preamble.tex b/chapters/preamble.tex index d6a39c86d..b3fe3defc 100644 --- a/chapters/preamble.tex +++ b/chapters/preamble.tex @@ -84,6 +84,26 @@ \FloatBarrier } +%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% Customizing section/subsection titles +% https://tex.stackexchange.com/questions/96090/formatting-subsections-and-chapters-in-tufte-book + +% section format +\titleformat{\section}% +{\normalfont\Large\bfseries}% format applied to label+text +{}% label +{}% horizontal separation between label and title body +{}% before the title body +[]% after the title body + +% subsection format +\titleformat{\subsection}% +{\normalfont\large}% format applied to label+text +{}% label +{}% horizontal separation between label and title body +{}% before the title body +[]% after the title body + %---------------------------------------------------------------------------------------- % BOOK META-INFORMATION %---------------------------------------------------------------------------------------- From e33ac9ad8c7a2bc8be3f0d95e88aa362f44d2783 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 7 Oct 2019 21:41:46 -0400 Subject: [PATCH 033/854] Ch2 : small changes to GitHub description --- chapters/handling-data.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 951a94f76..a239f6b74 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -125,11 +125,12 @@ \subsection{Research transparency} There are various software solutions for building documentation over time. The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution, with integrated file storage, version histories, and collaborative wiki pages. -\textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent task management -platform,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}} +\textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}} \index{task management}\index{GitHub} in addition to version histories and wiki pages. -It offers multiple different ways to document a project that can be adapted to different team and project dynamics, +It offers multiple different ways to justify changes and additions, +track and register discussions, and manage tasks. +It's a flexible tool that can be adapted to different team and project dynamics, but is less effective for file storage. Each project has its specificities, and the exact shape of this process can be molded to the team's needs, From 7962899927f3b51ca1c37490b01cc0b799f47af2 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 7 Oct 2019 22:44:18 -0400 Subject: [PATCH 034/854] Ch2: changes to intro --- chapters/planning-data-work.tex | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 2ed38ae8a..9ccdf2fb9 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -2,21 +2,25 @@ \begin{fullwidth} Preparation for data work begins long before you collect any data. -In order to be prepared for fieldwork, you need to know what you are getting into. -This means knowing which data sets you need, -how those data sets will stay organized and linked, -and what identifying information you will collect -for the different types and levels of data you'll observe. +In order to be prepared to work on the data you receive, +you need to know what you are getting into. +This means knowing which data sets and output you need at the end of the process, +how they will stay organized and linked, +what different types and levels of data you'll handle, +and how big and sensitive it will be. Identifying these details creates a \textbf{data map} for your project, giving you and your team a sense of how information resources should be organized. It's okay to update this map once the project is underway -- the point is that everyone knows what the plan is. Then, you must identify and prepare your tools and workflow. -All the tools we discuss here are designed to prepare you for collaboration and replication, -so that you can confidently manage tools and tasks on your computer. +Changing software and protocols half-way through a project can be costly and time-consuming, +so it's important to think ahead about decisions that may seem of little consequence +(think: creating a new folder and moving files into it). +This chapter will discuss some of often overlooked tools and processes that +will help prepare you for collaboration and replication. We will try to provide free, open-source, and platform-agnostic tools wherever possible, -and provide more detailed instructions for those with which we are familiar. +and point to more detailed instructions when relevant. However, most have a learning and adaptation process, meaning you will become most comfortable with each tool only by using it in real-world work. From 20f49e4e7ba20124040d981f83d7f29376fe743a Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 7 Oct 2019 22:50:07 -0400 Subject: [PATCH 035/854] Ch2 : minor wording adjustments --- chapters/planning-data-work.tex | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 9ccdf2fb9..ada386824 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -40,7 +40,7 @@ \section{Preparing your digital workspace} but thinking about simple things from a workflow perspective can help you make marginal improvements every day you work. -Teams often develop they workflows as they go, +Teams often develop workflows as they go, solving new challenges when they appear. This will always be necessary, and new challenges will keep coming. @@ -120,7 +120,7 @@ \subsection{Folder management} as a part of our \texttt{ietoolkit} suite. This command sets up a standardized folder structure for what we call the \texttt{/DataWork/} folder.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} It includes folders for all the steps of a typical DIME project. -However, since each project will always have project-specific needs, +However, since each project will always have its own specific needs, we tried to make it as easy as possible to adapt when that is the case. The main advantage of having a universally standardized folder structure is that changing from one project to another requires less @@ -140,13 +140,13 @@ \subsection{Folder management} This is reflected in the master script in such a way that the only change necessary to run the entire code from a new computer is to change the path to the project folder. -The code below shows how folder structure is reflected in a master do-file. +The code in \texttt{stata-master-dofile.do} how folder structure is reflected in a master do-file. \subsection{Code management} Once you start a project's data work, -the number of scripts, datasets and outputs that you have to manage will grow very quickly. +the number of scripts, datasets, and outputs that you have to manage will grow very quickly. This can get out of hand just as quickly, so it's important to organize your data work and follow best practices from the beginning. Adjustments will always be needed along the way, @@ -228,7 +228,7 @@ \subsection{Version control} A \textbf{version control system} is the way you manage the changes to any computer file. This is important, for example, for your team to be able to find the version of a presentation that you delivered to your donor, -but also to understand why the significance of your estimates has changed. +but also to understand why the significance level of your estimates has changed. Everyone who has ever encountered a file named something like \texttt{final\_report\_v5\_LJK\_KLE\_jun15.docx} can appreciate how useful such a system can be. Most file sharing solutions offer some level of version control. @@ -244,8 +244,8 @@ \subsection{Version control} to be created and stored separately in GitHub. Nearly all code and outputs (except datasets) are better managed this way. Code is written in its native language, -and increasingly, written outputs such as reports, -presentations and documentations can be written using different \textbf{literate programming} +and it's becoming more and more common for written outputs such as reports, +presentations and documentations to be written using different \textbf{literate programming} tools such as {\LaTeX} and dynamic documents. You should therefore feel comfortable having both a project folder and a code folder. Their structures can be managed in parallel by using \texttt{iefolder} twice. @@ -311,7 +311,7 @@ \subsection{Output management} and Ben Jann's \texttt{webdoc}\sidenote{\url{http://repec.sowi.unibe.ch/stata/webdoc/index.html}} and \texttt{texdoc}.\sidenote{\url{http://repec.sowi.unibe.ch/stata/texdoc/index.html}} -Whichever options you decide to use, +Whichever options you chose, agree with your team on what tools will be used for what outputs, and where they will be stored before you start creating them. Take into account ease of use for different team members, but @@ -323,7 +323,7 @@ \subsection{Output management} you will need to make changes to your outputs quite frequently. And anyone who has tried to recreate a graph after a few months probably knows that it can be hard to remember where you saved the code that created it. -Here, naming conventions and code organization play a key role in not re-writing script again and again. +Here, naming conventions and code organization play a key role in not re-writing scripts again and again. Use intuitive and descriptive names when you save your code. It's often desirable to have the names of your outputs and scripts linked, so, for example, \texttt{merge.do} creates \texttt{merged.dta}. @@ -362,10 +362,10 @@ \subsection{Preparing for collaboration and replication} they merely require an agreement and workflow design so that the people assigning the tasks are sure to set them up in the system. -GitHub is useful because tasks can clearly be tied to file versions; +GitHub allows tasks to be clearly tied to file versions; therefore it is useful for managing code-related tasks. It also creates incentives for writing down why changes were made as they are saved, creating naturally documented code. -Dropbox Paper is useful because tasks can be easily linked to other documents saved in Dropbox; +Dropbox Paper allows for tasks to be easily linked to other documents saved in Dropbox; therefore it is useful for managing non-code-related tasks. Our team uses both. From 3abd903670cc868ad81292a1a6d334d8ce131598 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 8 Oct 2019 11:17:17 -0400 Subject: [PATCH 036/854] Apply suggestions from code review --- chapters/handling-data.tex | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 2b160a9d7..a88fe635d 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -76,13 +76,13 @@ \subsection{Research replicability} This may mean applying your techniques to their data or implementing a similar structure in a different context. As a pure public good, this is nearly costless. -The useful tools and standards you create will have high value to others -(if you are personally or professionally motivated by citations, -producing these kinds of resources can lead to that as well). +The useful tools and standards you create will have high value to others. +If you are personally or professionally motivated by citations, +producing these kinds of resources can lead to that as well. Therefore, your code should be written neatly and published openly. It should be easy to read and understand in terms of structure, style, and syntax. Finally, the corresponding dataset should be openly accessible -unless for legal or ethical reasons it cannot (we will discuss more about this soon).\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} +unless for legal or ethical reasons it cannot.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} \subsection{Research transparency} @@ -125,7 +125,7 @@ \subsection{Research transparency} \textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}} \index{task management}\index{GitHub} in addition to version histories and wiki pages. -It offers multiple different ways to justify changes and additions, +It offers multiple different ways to record the decision process leading to changes and additions, track and register discussions, and manage tasks. It's a flexible tool that can be adapted to different team and project dynamics, but is less effective for file storage. From 32849d00227b631b2b2ac43ad17afb0ea70b259b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 8 Oct 2019 11:25:48 -0400 Subject: [PATCH 037/854] Update chapters/data-collection.tex --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 94c2852ee..4233d56e1 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -300,7 +300,7 @@ \section{Overseeing fieldwork and quality assurance} and it should report reasons for any observations not collected. Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of any survey dataset. -Thus, it is important to structure this reporting mechanism in a way to +It is important to structure this reporting in a way that not only group broad rationales into specific categories but also collect all the detailed, open-ended responses to questions the field team can provide for any observations that they were unable to complete. From b96475ebc883321ca04bc1fd0cbbfe5d418e5a40 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 8 Oct 2019 11:26:25 -0400 Subject: [PATCH 038/854] Apply suggestions from code review --- chapters/data-collection.tex | 4 ++-- chapters/handling-data.tex | 2 +- chapters/introduction.tex | 4 ++-- chapters/planning-data-work.tex | 4 ++-- chapters/research-design.tex | 2 +- 5 files changed, 8 insertions(+), 8 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 4233d56e1..a933aa187 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -301,8 +301,8 @@ \section{Overseeing fieldwork and quality assurance} Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of any survey dataset. It is important to structure this reporting in a way that -not only group broad rationales into specific categories -but also collect all the detailed, open-ended responses to questions the field team can provide +not only group broads rationales into specific categories +but also collects all the detailed, open-ended responses to questions the field team can provide for any observations that they were unable to complete. This reporting should be validated and saved alongside the final raw data, and treated the same way. diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index cc7966320..fb0fa4056 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -83,7 +83,7 @@ \subsection{Research replicability} Therefore, your code should be written neatly and published openly. It should be easy to read and understand in terms of structure, style, and syntax. Finally, the corresponding dataset should be openly accessible -unless for legal or ethical reasons it is proprietary.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} +unless for legal or ethical reasons it cannot.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} \subsection{Research transparency} diff --git a/chapters/introduction.tex b/chapters/introduction.tex index d3aad4a78..780781f9e 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -41,7 +41,7 @@ \section{Doing credible research at scale} In the time that we have been working in the development field, the proportion of projects that rely on \textbf{primary data} has soared.\cite{angrist2017economic} Today, the scope and scale of those projects continue to expand rapidly. -That is, more and more people are working on the same data over longer timeframes. +More and more people are working on the same data over longer timeframes. This is because, while administrative datasets and \textbf{big data} have important uses, primary data\sidenote{\textbf{Primary data:} data collected from first-hand sources.} @@ -64,7 +64,7 @@ \section{Doing credible research at scale} we have realized that we barely have the time to give everyone the attention they deserve. This book itself is therefore intended to be a vehicle to document our experiences and share it with with future DIME team members. -The \textbf{DIME Wiki} is one of our flagship resource repository, designed for teams engaged in impact evaluation projects. +The \textbf{DIME Wiki} is one of our flagship resources designed for teams engaged in impact evaluation projects. It is available as a free online collection of our resources and best practices.\sidenote{\url{http://dimewiki.worldbank.org/}} This book therefore complements the detailed-but-unstructured DIME Wiki with a guided tour of the major tasks that make up primary data collection.\sidenote{Like this: \url{https://dimewiki.worldbank.org/wiki/Primary_Data_Collection}} diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index c21e886d1..866553897 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -350,8 +350,8 @@ \subsection{Preparing for collaboration and replication} When it comes to collaboration software,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Collaboration_Tools}} the two most common softwares in use are Dropbox and GitHub.\sidenote{ \url{https://michaelstepner.com/blog/git-vs-dropbox/}} -GitHub has the following features that are amazing for efficient workflow: -- The issues tab is a great tool for task management. +GitHub has the following features that are useful for efficient workflows: +- The Issues tab is a great tool for task management. - It creates incentives for writing down why changes were made as they are saved, creating naturally documented code. - It is useful also because tasks can clearly be tied to file versions. Thus, it serves as a great tool for managing code-related tasks. diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 81b2f803e..2c87891bb 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -63,7 +63,7 @@ \section{Counterfactuals and treatment effects} do not attempt to create a predictive or comprehensive model of how the outcome of interest is generated -- typically we do not care about measures of fit or predictive accuracy -like R-squared or root mean square errors. +like R-squared values or root mean square errors. Instead, the econometric models desribed here aim to correctly describe the experimental design being used, so that the correct estimate of the difference From 7d4be83ee55f16c6c43cba5905672470131ec38d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Tue, 8 Oct 2019 11:32:58 -0400 Subject: [PATCH 039/854] Remove any lfs settings --- .gitattributes | 2 -- 1 file changed, 2 deletions(-) delete mode 100644 .gitattributes diff --git a/.gitattributes b/.gitattributes deleted file mode 100644 index 6d3944176..000000000 --- a/.gitattributes +++ /dev/null @@ -1,2 +0,0 @@ - -*.png filter=lfs diff=lfs merge=lfs -text From 1e86ce126c02bba2a67c97aefa3dcf2131675610 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 8 Oct 2019 15:01:34 -0400 Subject: [PATCH 040/854] Revert "Ch2 : minor wording adjustments" This reverts commit 20f49e4e7ba20124040d981f83d7f29376fe743a. --- chapters/planning-data-work.tex | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index ada386824..9ccdf2fb9 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -40,7 +40,7 @@ \section{Preparing your digital workspace} but thinking about simple things from a workflow perspective can help you make marginal improvements every day you work. -Teams often develop workflows as they go, +Teams often develop they workflows as they go, solving new challenges when they appear. This will always be necessary, and new challenges will keep coming. @@ -120,7 +120,7 @@ \subsection{Folder management} as a part of our \texttt{ietoolkit} suite. This command sets up a standardized folder structure for what we call the \texttt{/DataWork/} folder.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} It includes folders for all the steps of a typical DIME project. -However, since each project will always have its own specific needs, +However, since each project will always have project-specific needs, we tried to make it as easy as possible to adapt when that is the case. The main advantage of having a universally standardized folder structure is that changing from one project to another requires less @@ -140,13 +140,13 @@ \subsection{Folder management} This is reflected in the master script in such a way that the only change necessary to run the entire code from a new computer is to change the path to the project folder. -The code in \texttt{stata-master-dofile.do} how folder structure is reflected in a master do-file. +The code below shows how folder structure is reflected in a master do-file. \subsection{Code management} Once you start a project's data work, -the number of scripts, datasets, and outputs that you have to manage will grow very quickly. +the number of scripts, datasets and outputs that you have to manage will grow very quickly. This can get out of hand just as quickly, so it's important to organize your data work and follow best practices from the beginning. Adjustments will always be needed along the way, @@ -228,7 +228,7 @@ \subsection{Version control} A \textbf{version control system} is the way you manage the changes to any computer file. This is important, for example, for your team to be able to find the version of a presentation that you delivered to your donor, -but also to understand why the significance level of your estimates has changed. +but also to understand why the significance of your estimates has changed. Everyone who has ever encountered a file named something like \texttt{final\_report\_v5\_LJK\_KLE\_jun15.docx} can appreciate how useful such a system can be. Most file sharing solutions offer some level of version control. @@ -244,8 +244,8 @@ \subsection{Version control} to be created and stored separately in GitHub. Nearly all code and outputs (except datasets) are better managed this way. Code is written in its native language, -and it's becoming more and more common for written outputs such as reports, -presentations and documentations to be written using different \textbf{literate programming} +and increasingly, written outputs such as reports, +presentations and documentations can be written using different \textbf{literate programming} tools such as {\LaTeX} and dynamic documents. You should therefore feel comfortable having both a project folder and a code folder. Their structures can be managed in parallel by using \texttt{iefolder} twice. @@ -311,7 +311,7 @@ \subsection{Output management} and Ben Jann's \texttt{webdoc}\sidenote{\url{http://repec.sowi.unibe.ch/stata/webdoc/index.html}} and \texttt{texdoc}.\sidenote{\url{http://repec.sowi.unibe.ch/stata/texdoc/index.html}} -Whichever options you chose, +Whichever options you decide to use, agree with your team on what tools will be used for what outputs, and where they will be stored before you start creating them. Take into account ease of use for different team members, but @@ -323,7 +323,7 @@ \subsection{Output management} you will need to make changes to your outputs quite frequently. And anyone who has tried to recreate a graph after a few months probably knows that it can be hard to remember where you saved the code that created it. -Here, naming conventions and code organization play a key role in not re-writing scripts again and again. +Here, naming conventions and code organization play a key role in not re-writing script again and again. Use intuitive and descriptive names when you save your code. It's often desirable to have the names of your outputs and scripts linked, so, for example, \texttt{merge.do} creates \texttt{merged.dta}. @@ -362,10 +362,10 @@ \subsection{Preparing for collaboration and replication} they merely require an agreement and workflow design so that the people assigning the tasks are sure to set them up in the system. -GitHub allows tasks to be clearly tied to file versions; +GitHub is useful because tasks can clearly be tied to file versions; therefore it is useful for managing code-related tasks. It also creates incentives for writing down why changes were made as they are saved, creating naturally documented code. -Dropbox Paper allows for tasks to be easily linked to other documents saved in Dropbox; +Dropbox Paper is useful because tasks can be easily linked to other documents saved in Dropbox; therefore it is useful for managing non-code-related tasks. Our team uses both. From 4d38fee8217aee0d5dc231654d85949527f35246 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 8 Oct 2019 15:06:22 -0400 Subject: [PATCH 041/854] Ch 2: minor wording arrangements --- chapters/planning-data-work.tex | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 9ccdf2fb9..1bd122be8 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -120,7 +120,7 @@ \subsection{Folder management} as a part of our \texttt{ietoolkit} suite. This command sets up a standardized folder structure for what we call the \texttt{/DataWork/} folder.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} It includes folders for all the steps of a typical DIME project. -However, since each project will always have project-specific needs, +However, since each project will always have its own needs, we tried to make it as easy as possible to adapt when that is the case. The main advantage of having a universally standardized folder structure is that changing from one project to another requires less @@ -140,13 +140,13 @@ \subsection{Folder management} This is reflected in the master script in such a way that the only change necessary to run the entire code from a new computer is to change the path to the project folder. -The code below shows how folder structure is reflected in a master do-file. +The code \texttt{stata-master-dofile.do} how folder structure is reflected in a master do-file. \subsection{Code management} Once you start a project's data work, -the number of scripts, datasets and outputs that you have to manage will grow very quickly. +the number of scripts, datasets, and outputs that you have to manage will grow very quickly. This can get out of hand just as quickly, so it's important to organize your data work and follow best practices from the beginning. Adjustments will always be needed along the way, @@ -228,7 +228,7 @@ \subsection{Version control} A \textbf{version control system} is the way you manage the changes to any computer file. This is important, for example, for your team to be able to find the version of a presentation that you delivered to your donor, -but also to understand why the significance of your estimates has changed. +but also to understand why the significance level of your estimates has changed. Everyone who has ever encountered a file named something like \texttt{final\_report\_v5\_LJK\_KLE\_jun15.docx} can appreciate how useful such a system can be. Most file sharing solutions offer some level of version control. @@ -244,8 +244,8 @@ \subsection{Version control} to be created and stored separately in GitHub. Nearly all code and outputs (except datasets) are better managed this way. Code is written in its native language, -and increasingly, written outputs such as reports, -presentations and documentations can be written using different \textbf{literate programming} +and it's becoming more and more common for written outputs such as reports, +presentations and documentations to be written using different \textbf{literate programming} tools such as {\LaTeX} and dynamic documents. You should therefore feel comfortable having both a project folder and a code folder. Their structures can be managed in parallel by using \texttt{iefolder} twice. @@ -311,7 +311,7 @@ \subsection{Output management} and Ben Jann's \texttt{webdoc}\sidenote{\url{http://repec.sowi.unibe.ch/stata/webdoc/index.html}} and \texttt{texdoc}.\sidenote{\url{http://repec.sowi.unibe.ch/stata/texdoc/index.html}} -Whichever options you decide to use, +Whichever options you choose, agree with your team on what tools will be used for what outputs, and where they will be stored before you start creating them. Take into account ease of use for different team members, but @@ -323,7 +323,7 @@ \subsection{Output management} you will need to make changes to your outputs quite frequently. And anyone who has tried to recreate a graph after a few months probably knows that it can be hard to remember where you saved the code that created it. -Here, naming conventions and code organization play a key role in not re-writing script again and again. +Here, naming conventions and code organization play a key role in not re-writing scripts again and again. Use intuitive and descriptive names when you save your code. It's often desirable to have the names of your outputs and scripts linked, so, for example, \texttt{merge.do} creates \texttt{merged.dta}. From 94aaae5e5d2c636c41eab243fc0e2968c3a76938 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 8 Oct 2019 17:49:04 -0400 Subject: [PATCH 042/854] Update introduction --- chapters/handling-data.tex | 36 +++++++++++++++++++++++------------- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index a11958b77..7e34a20b1 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -8,19 +8,29 @@ asked by development researchers grow, so does the (rightful) scrutiny under which methods and results are placed. This scrutiny involves two major components: data handling and analytical quality. -Performing at a high standard in both means that research participants -are appropriately protected, -and that consumers of research can have confidence in its conclusions. - -What we call ethical standards in this chapter is a set of practices for research transparency and data privacy that address these two components. -Their adoption is an objective measure of to judge a research product's performance in both. -Without these transparent measures of credibility, reputation is the primary signal for the quality of evidence, and two failures may occur: +Performing at a high standard in both means that +consumers of research can have confidence in its conclusions, +and that research participants are appropriately protected. +What we call ethical standards in this chapter is a set of practices +for research quality and data privacy that address these two principles. + +Neither quality nor privacy is an ``all-or-nothing'' objective. +We expect that teams will do as much as they can to make their work +conform to modern practices of credibility, transparency, and replicability. +Similarly, we expect that teams will ensure the privacy of participants in research +by intelligently assessing and proactively averting risks they might face. +We also expect teams will report what they have and have not done +in order to provide objective measures of a research product's performance in both. +Without these transparent measures of quality, reputation is the primary signal for the quality of evidence, and two failures may occur: low-quality studies from reputable sources may be used as evidence when in fact they don't warrant it, and high-quality studies from sources without an international reputation may be ignored. Both these outcomes reduce the quality of evidence overall. -Even more importantly, they usually mean that credibility in development research accumulates at international institutions and top global universities instead of the people and places directly involved in and affected by it. -Simple transparency standards mean that it is easier to judge research quality, and making high-quality research identifiable also increases its impact. -This section provides some basic guidelines and resources for collecting, handling, and using field data ethically and responsibly to publish research findings. +Even more importantly, they usually mean that credibility in development research accumulates at international institutions +and top global universities instead of the people and places directly involved in and affected by it. +Simple transparency standards mean that it is easier to judge research quality, +and making high-quality research identifiable also increases its impact. +This section provides some basic guidelines and resources + for collecting, handling, and using field data ethically and responsibly to publish research findings. \end{fullwidth} %------------------------------------------------ @@ -126,8 +136,8 @@ \subsection{Research transparency} \textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}} \index{task management}\index{GitHub} in addition to version histories and wiki pages. -It offers multiple different ways to record the decision process leading to changes and additions, -track and register discussions, and manage tasks. +It offers multiple different ways to record the decision process leading to changes and additions, +track and register discussions, and manage tasks. It's a flexible tool that can be adapted to different team and project dynamics, but is less effective for file storage. Each project has its specificities, @@ -149,7 +159,7 @@ \subsection{Research credibility} all experimental and observational studies should be \textbf{pre-registered}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Pre-Registration}} using the \textbf{AEA} database,\sidenote{\url{https://www.socialscienceregistry.org/}} -the \textbf{3ie} database,\sidenote{\url{http://ridie.3ieimpact.org/}}, +the \textbf{3ie} database,\sidenote{\url{http://ridie.3ieimpact.org/}}, the \textbf{eGAP} database,\sidenote{\url{http://egap.org/content/registration/}}, or the \textbf{OSF} registry\sidenote{\url{https://osf.io/registries}} as appropriate.\sidenote{\url{http://datacolada.org/12}} \index{pre-registration} From ef54f665466ba41d15f8f9b9c8fa5d43f5469783 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 8 Oct 2019 17:50:55 -0400 Subject: [PATCH 043/854] Soften GitHub language (#162) --- chapters/handling-data.tex | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 7e34a20b1..251600bdd 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -30,7 +30,7 @@ Simple transparency standards mean that it is easier to judge research quality, and making high-quality research identifiable also increases its impact. This section provides some basic guidelines and resources - for collecting, handling, and using field data ethically and responsibly to publish research findings. +for using field data ethically and responsibly to publish research findings. \end{fullwidth} %------------------------------------------------ @@ -74,10 +74,9 @@ \subsection{Research replicability} if any or all of these things were to be done slightly differently.\cite{simmons2011false,wicherts2016degrees} Letting people play around with your data and code is a great way to have new questions asked and answered based on the valuable work you have already done. -Services like GitHub that expose your code \textit{history} -are also valuable resources. They can show things like modifications -made in response to referee comments; for someone else, they can show -the research paths and questions you may have tried to answer +Services like GitHub that expose your code development process are valuable resources here. +They can show things like modifications made in response to referee comments. +They can also document the research paths and questions you may have tried to answer (but excluded from publication) as a resource to others who have similar questions of their own data. From 367bdba9e43493c79bcad8d2ef284b870b45eec5 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 8 Oct 2019 17:52:40 -0400 Subject: [PATCH 044/854] Code can be identifying (#161) --- chapters/handling-data.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 251600bdd..11cac766c 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -66,7 +66,8 @@ \subsection{Research replicability} referring only to the code processes in a specific study; in other contexts they may have more specific meanings.\sidenote{\url{http://datacolada.org/76}}) All your code files involving data cleaning, construction and analysis -should be public -- nobody should have to guess what exactly comprises a given index, +should be public (unless they contain identifying information). +Nobody should have to guess what exactly comprises a given index, or what controls are included in your main regression, or whether or not you clustered standard errors correctly. That is, as a purely technical matter, nobody should have to ``just trust you'', From 16dcfea51491e3c56d26687d83a091e92db3ddbc Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 8 Oct 2019 17:54:24 -0400 Subject: [PATCH 045/854] Enable secure data collection (#160) --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 11cac766c..97f57b6d4 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -219,7 +219,7 @@ \section{Ensuring privacy and security in research} \index{encryption} during data collection, storage, and transfer. \index{data transfer}\index{data storage} -Most modern data collection software makes the first part straightforward.\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} +Most modern data collection software has features that, if enabled, make the first part straightforward.\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} However, secure storage and transfer are your responsibility.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Security}} There are plenty of options available to keep your data safe, at different prices, from enterprise-grade solutions to combined free options. From bf8ee98fb22eb74876cd618886ac27f82bdb7986 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 8 Oct 2019 18:04:13 -0400 Subject: [PATCH 046/854] Writing. Remove "replicability". (#145) --- chapters/handling-data.tex | 35 ++++++++++++++++------------------- 1 file changed, 16 insertions(+), 19 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 97f57b6d4..bd4a4053c 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -16,7 +16,7 @@ Neither quality nor privacy is an ``all-or-nothing'' objective. We expect that teams will do as much as they can to make their work -conform to modern practices of credibility, transparency, and replicability. +conform to modern practices of credibility, transparency, and reproducibility. Similarly, we expect that teams will ensure the privacy of participants in research by intelligently assessing and proactively averting risks they might face. We also expect teams will report what they have and have not done @@ -40,31 +40,28 @@ \section{Protecting confidence in development research} The empirical revolution in development research \index{transparency}\index{credibility}\index{reproducibility} has led to increased public scrutiny of the reliability of research.\cite{rogers_2017} -Three major components make up this scrutiny: \textbf{replicability}.\cite{duvendack2017meant}, \textbf{transparency},\cite{christensen2018transparency} and \textbf{credibility},\cite{ioannidis2017power}. -Replicability is one key component of transparency. -Transparency is necessary for consumers of research -to be able to tell the quality of the research. +Three major components make up this scrutiny: \textbf{reproducibility}.\cite{duvendack2017meant}, \textbf{transparency},\cite{christensen2018transparency} and \textbf{credibility},\cite{ioannidis2017power}. +Reproducibility is one key component of transparency. +Transparency is necessary for consumers of research products +to be able to determine the quality of the research process and the value of the evidence. Without it, all evidence of credibility comes from reputation, and it's unclear what that reputation is based on, since it's not transparent. -Development researchers should take these concerns seriously. -Many development research projects are purpose-built to cover specific questions, -and may utilize unique data or small samples. -This approach opens the door to working with the development community -to answer both specific programmatic questions and general research inquiries. -However, the data that researchers utilize have never been reviewed by anyone else, +Development researchers should take these concerns particularly seriously. +Many development research projects are purpose-built to address specific questions, +and often use unique data or small samples. +This approach opens the door to working closely with the broader development community +to answer specific programmatic questions and general research inquiries. +However, almost by definition, primary data that researchers use have never been reviewed by anyone else, so it is hard for others to verify that it was collected, handled, and analyzed appropriately. -Maintaining confidence in research via the components of credibility, transparency, and replicability +Maintaining confidence in research via the components of credibility, transparency, and reproducibility is the most important way that researchers can avoid serious error, and therefore these principles are not by-products but core components of research output. -\subsection{Research replicability} +\subsection{Research reproduciblity} -Replicable research, first and foremost, -means that the actual analytical processes you used are executable by others.\cite{dafoe2014science} -(We use ``replicable'' and ``reproducible'' somewhat interchangeably, -referring only to the code processes in a specific study; -in other contexts they may have more specific meanings.\sidenote{\url{http://datacolada.org/76}}) +Reproducible research means that the actual analytical processes you used are executable by others.\cite{dafoe2014science} +(We use ``reproducibility'' to refer to the code processes in a specific study.\sidenote{\url{http://datacolada.org/76}}) All your code files involving data cleaning, construction and analysis should be public (unless they contain identifying information). Nobody should have to guess what exactly comprises a given index, @@ -82,7 +79,7 @@ \subsection{Research replicability} as a resource to others who have similar questions of their own data. Secondly, reproducible research\sidenote{\url{https://dimewiki.worldbank.org/wiki/Reproducible_Research}} -enables other researchers to re-utilize your code and processes +enables other researchers to re-use your code and processes to do their own work more easily and effectively in the future. This may mean applying your techniques to their data or implementing a similar structure in a different context. From 95e4a4c536f90719e4780342e7404d46182462ca Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 8 Oct 2019 18:10:33 -0400 Subject: [PATCH 047/854] Pre-registration and PAP (#154) --- chapters/handling-data.tex | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index bd4a4053c..a36958c1d 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -21,7 +21,7 @@ by intelligently assessing and proactively averting risks they might face. We also expect teams will report what they have and have not done in order to provide objective measures of a research product's performance in both. -Without these transparent measures of quality, reputation is the primary signal for the quality of evidence, and two failures may occur: +Otherwise, reputation is the primary signal for the quality of evidence, and two failures may occur: low-quality studies from reputable sources may be used as evidence when in fact they don't warrant it, and high-quality studies from sources without an international reputation may be ignored. Both these outcomes reduce the quality of evidence overall. @@ -52,13 +52,14 @@ \section{Protecting confidence in development research} and often use unique data or small samples. This approach opens the door to working closely with the broader development community to answer specific programmatic questions and general research inquiries. -However, almost by definition, primary data that researchers use have never been reviewed by anyone else, +However, almost by definition, +primary data that researchers use for such studies has never been reviewed by anyone else, so it is hard for others to verify that it was collected, handled, and analyzed appropriately. Maintaining confidence in research via the components of credibility, transparency, and reproducibility -is the most important way that researchers can avoid serious error, -and therefore these principles are not by-products but core components of research output. +is the most important way that researchers using primary data can avoid serious error, +and therefore these are not by-products but core components of research output. -\subsection{Research reproduciblity} +\subsection{Research reproducibility} Reproducible research means that the actual analytical processes you used are executable by others.\cite{dafoe2014science} (We use ``reproducibility'' to refer to the code processes in a specific study.\sidenote{\url{http://datacolada.org/76}}) @@ -94,7 +95,8 @@ \subsection{Research reproduciblity} \subsection{Research transparency} -Transparent research will expose not only the code, but all the processes involved in establishing credibility to the public.\sidenote{\url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} +Transparent research will expose not only the code, +but all the processes involved in establishing credibility to the public.\sidenote{\url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} This means that readers be able to judge for themselves if the research was done well, and if the decision-making process was sound. If the research is well-structured, and all relevant documentation\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} is shared, @@ -149,12 +151,15 @@ \subsection{Research credibility} Were the key research outcomes pre-specified or chosen ex-post? How sensitive are the results to changes in specifications or definitions? Tools such as \textbf{pre-analysis plans}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} -are important to assuage these concerns for experimental evaluations, +can be used to assuage these concerns for experimental evaluations \index{pre-analysis plan} +by fully specifying some set of analysis intended to be conducted, but they may feel like ``golden handcuffs'' for other types of research.\cite{olken2015promises} Regardless of whether or not a formal pre-analysis plan is utilized, all experimental and observational studies should be \textbf{pre-registered}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Pre-Registration}} +simply to create a record of the fact that the study was undertaken. +This is increasingly required by publishers and can be done very quickly using the \textbf{AEA} database,\sidenote{\url{https://www.socialscienceregistry.org/}} the \textbf{3ie} database,\sidenote{\url{http://ridie.3ieimpact.org/}}, the \textbf{eGAP} database,\sidenote{\url{http://egap.org/content/registration/}}, From 59d7a5ae91c25e50af0a7a3eaba228eb036338d6 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 8 Oct 2019 18:13:39 -0400 Subject: [PATCH 048/854] PAP/RR (#153) --- chapters/handling-data.tex | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index a36958c1d..cb8870a62 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -106,13 +106,19 @@ \subsection{Research transparency} and, as we hope to convince you, make the process easier for themselves, because it requires methodical organization that is labor-saving and efficient over the complete course of a project. -\textbf{Registered Reports}\sidenote{\url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} can help with this process where available. +\textbf{Registered Reports}\sidenote{\url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} +can help with this process where they are available. By setting up a large portion of the research design in advance,\sidenote{\url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis/}} -a great deal of work has already been completed, +a great deal of analytical planning has already been completed, and at least some research questions are pre-committed for publication regardless of the outcome. This is meant to combat the ``file-drawer problem'',\cite{simonsohn2014p} and ensure that researchers are transparent in the additional sense that all the results obtained from registered studies are actually published. +In no way should this be viewed as binding the hands of the researcher: +anything outside the original plan is just as interesting and valuable +as it would have been if the the plan was never published; +but having pre-committed to any particular inquiry makes its results +immune to a wide range of criticisms of specification searching or multiple testing. Documenting a project in detail greatly increases transparency. This means explicitly noting decisions as they are made, and explaining the process behind them. From 55a5df98adb70be08d769891f8123ac58c7e8ea7 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Fri, 11 Oct 2019 13:14:59 -0400 Subject: [PATCH 049/854] Incorporate blog materials (#144) --- bibliography.bib | 7 +++++++ chapters/handling-data.tex | 30 ++++++++++++++++++++++++++++++ 2 files changed, 37 insertions(+) diff --git a/bibliography.bib b/bibliography.bib index 3a2272318..44d5cc2c6 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -1,3 +1,10 @@ +@techreport{galiani2017incentives, + title={Incentives for replication in economics}, + author={Galiani, Sebastian and Gertler, Paul and Romero, Mauricio}, + year={2017}, + institution={National Bureau of Economic Research} +} + @article{blischak2016quick, title={A quick introduction to version control with {Git} and {GitHub}}, author={Blischak, John D and Davenport, Emily R and Wilson, Greg}, diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index cb8870a62..20e67ae88 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -93,6 +93,23 @@ \subsection{Research reproducibility} Finally, the corresponding dataset should be openly accessible unless for legal or ethical reasons it cannot be.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} +Reproducibility and transparency are not binary concepts: +there’s a spectrum, starting with simple materials release. +But even getting that first stage right is a challenge. +An analysis of 203 empirical papers published in top economics journals in 2016 +showed that less than 1 in 7 provided all the data and code +needed to assess computational reproducibility.\cite{galiani2017incentives} +A scan of the 90,000 datasets on the Harvard Dataverse +found that only 10% have the necessary files and documentation +for computational reproducibility +(and a check of 3,000 of those that met requirements +found that 85% did not replicate). +Longer-term goals to meet reproducibility and transparency standards +include making tools for research transparency part and parcel +of the quest for efficiency gains in the research production function. +People seem to systematically underestimate the benefits +and overestimate the costs to adopting modern research practices. + \subsection{Research transparency} Transparent research will expose not only the code, @@ -248,3 +265,16 @@ \section{Ensuring privacy and security in research} (most modern operating systems provide such a tool). This means that even if you lose your computer with identifying data in it, anyone who gets hold of it still cannot access the information. + +Complete data publication, unlike reproducibility checks, +brings along with it a set of serious privacy concerns, +particularly when sensitive data is used in key analyses. +There are a number of tools developed to help researchers de-identify data +(\texttt{PII_detection}\sidenote{\url{https://github.com/PovertyAction/PII_detection}} from IPA, +\texttt{PII_scan}\sidenote{\url{https://github.com/J-PAL/PII-Scan}} from JPAL, +and \texttt{sdcMicro}\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} from the World Bank). +But is it ever possible to fully protect privacy in an era of big data? +One option is to add noise to data, as the US Census has proposed, +as it makes the trade-off between data accuracy and privacy explicit. +But there are no established norms for such “differential privacy” approaches: +most approaches fundamentally rely on judging “how harmful” disclosure would be. From 4d2cb3fb434ba5c27be0a7da3cfc8aa06296c20e Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Fri, 11 Oct 2019 13:19:11 -0400 Subject: [PATCH 050/854] Corrections --- chapters/handling-data.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 20e67ae88..adf093ee2 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -103,7 +103,7 @@ \subsection{Research reproducibility} found that only 10% have the necessary files and documentation for computational reproducibility (and a check of 3,000 of those that met requirements -found that 85% did not replicate). +found that 85\% did not replicate). Longer-term goals to meet reproducibility and transparency standards include making tools for research transparency part and parcel of the quest for efficiency gains in the research production function. @@ -270,8 +270,8 @@ \section{Ensuring privacy and security in research} brings along with it a set of serious privacy concerns, particularly when sensitive data is used in key analyses. There are a number of tools developed to help researchers de-identify data -(\texttt{PII_detection}\sidenote{\url{https://github.com/PovertyAction/PII_detection}} from IPA, -\texttt{PII_scan}\sidenote{\url{https://github.com/J-PAL/PII-Scan}} from JPAL, +(\texttt{PII\_detection}\sidenote{\url{https://github.com/PovertyAction/PII\_detection}} from IPA, +\texttt{PII-scan}\sidenote{\url{https://github.com/J-PAL/PII-Scan}} from JPAL, and \texttt{sdcMicro}\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} from the World Bank). But is it ever possible to fully protect privacy in an era of big data? One option is to add noise to data, as the US Census has proposed, From 9e19be97f8f44275c5517b522008044dd0d5944b Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Fri, 11 Oct 2019 13:42:38 -0400 Subject: [PATCH 051/854] Incorporate #132 --- chapters/handling-data.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index adf093ee2..3b7a0b536 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -113,7 +113,7 @@ \subsection{Research reproducibility} \subsection{Research transparency} Transparent research will expose not only the code, -but all the processes involved in establishing credibility to the public.\sidenote{\url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} +but all the other research processes involved in developing the analytical approach.\sidenote{\url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} This means that readers be able to judge for themselves if the research was done well, and if the decision-making process was sound. If the research is well-structured, and all relevant documentation\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} is shared, @@ -125,7 +125,7 @@ \subsection{Research transparency} \textbf{Registered Reports}\sidenote{\url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} can help with this process where they are available. -By setting up a large portion of the research design in advance,\sidenote{\url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis/}} +By pre-specifying a large portion of the research design,\sidenote{\url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis/}} a great deal of analytical planning has already been completed, and at least some research questions are pre-committed for publication regardless of the outcome. This is meant to combat the ``file-drawer problem'',\cite{simonsohn2014p} From 19dbce5b2bf18eb9968a473ffb50c6e8f169fb48 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Fri, 11 Oct 2019 13:51:09 -0400 Subject: [PATCH 052/854] Minor comments #118, #117, #110, #109, #108, #107, #106 --- chapters/handling-data.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 3b7a0b536..abf051081 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -116,9 +116,9 @@ \subsection{Research transparency} but all the other research processes involved in developing the analytical approach.\sidenote{\url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} This means that readers be able to judge for themselves if the research was done well, and if the decision-making process was sound. -If the research is well-structured, and all relevant documentation\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} is shared, -this makes it as easy as possible for the reader to implement. -This is also an incentive for researchers to make better decisions, +If the research is well-structured, and all of the relevant documentation\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} is shared, +this makes it as easy as possible for the reader to implement the same analysis. +Expecting process transparency is also an incentive for researchers to make better decisions, be skeptical and thorough about their assumptions, and, as we hope to convince you, make the process easier for themselves, because it requires methodical organization that is labor-saving and efficient over the complete course of a project. From ebaee99cba9c10132589286750dc52c803e6b777 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Fri, 11 Oct 2019 13:51:32 -0400 Subject: [PATCH 053/854] Etc --- chapters/handling-data.tex | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index abf051081..e8de76b2e 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -74,8 +74,10 @@ \subsection{Research reproducibility} Letting people play around with your data and code is a great way to have new questions asked and answered based on the valuable work you have already done. Services like GitHub that expose your code development process are valuable resources here. -They can show things like modifications made in response to referee comments. -They can also document the research paths and questions you may have tried to answer +Such services can show things like modifications made in response to referee comments, +by having tagged version histories at each major revision. +These services can also use issue trackers and abandoned work branches +to document the research paths and questions you may have tried to answer (but excluded from publication) as a resource to others who have similar questions of their own data. @@ -150,7 +152,7 @@ \subsection{Research transparency} not a one-time requirement or retrospective task. New decisions are always being made as the plan begins contact with reality, and there is nothing wrong with sensible adaptation so long as it is recorded and disclosed. -(Email is \texttt{not} a note-taking service.) +(Email is \texttt{not} a note-taking service, because communications are rarely well-ordered and easy to delete.) There are various software solutions for building documentation over time. The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution, @@ -218,8 +220,9 @@ \section{Ensuring privacy and security in research} \index{de-identification} It is important to keep in mind these data privacy principles not only for the individual respondent but also the PII data of their household members or other caregivers who are covered under the survey. In some contexts this list may be more extensive -- -for example, if you are working in a small environment, -someone's age and gender may be sufficient to identify them +for example, if you are working in an environment that is either small, specific, +or has extensive linkable data sources available to others, +information like someone's age and gender may be sufficient to identify them even though these would not be considered PII in a larger context. Therefore you will have to use careful judgment in each case to decide which pieces of information fall into this category.\sidenote{ @@ -256,6 +259,8 @@ \section{Ensuring privacy and security in research} In general, though, you shouldn't need to handle PII data very often. Once data is securely collected and stored, the first thing you will generally do is \textbf{de-identify} it.\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} \index{de-identification} +(We will provide more detail on this in the chapter on data collection.) +This will create a working copy that can safely be shared among collaborators. De-identified data should avoid, for example, you being sent back to every household to alert them that someone dropped all their personal information on a public bus and we don't know who has it. This simply means creating a copy of the data that contains no personally-identifiable information. From 61d9ebda3d36c264935fe2599dbf7c74b930d298 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Fri, 11 Oct 2019 14:00:17 -0400 Subject: [PATCH 054/854] Additional credibility discussion --- bibliography.bib | 7 +++++++ chapters/handling-data.tex | 37 ++++++++++++++++++++++++++++++------- 2 files changed, 37 insertions(+), 7 deletions(-) diff --git a/bibliography.bib b/bibliography.bib index 44d5cc2c6..823bc39a7 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -1,3 +1,10 @@ +@article{simonsohn2015specification, + title={Specification curve: Descriptive and inferential statistics on all reasonable specifications}, + author={Simonsohn, Uri and Simmons, Joseph P and Nelson, Leif D}, + journal={Available at SSRN 2694998}, + year={2015} +} + @techreport{galiani2017incentives, title={Incentives for replication in economics}, author={Galiani, Sebastian and Gertler, Paul and Romero, Mauricio}, diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index e8de76b2e..e435c2765 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -70,7 +70,7 @@ \subsection{Research reproducibility} or whether or not you clustered standard errors correctly. That is, as a purely technical matter, nobody should have to ``just trust you'', nor should they have to bother you to find out what happens -if any or all of these things were to be done slightly differently.\cite{simmons2011false,wicherts2016degrees} +if any or all of these things were to be done slightly differently.\cite{simmons2011false,simonsohn2015specification,wicherts2016degrees} Letting people play around with your data and code is a great way to have new questions asked and answered based on the valuable work you have already done. Services like GitHub that expose your code development process are valuable resources here. @@ -125,6 +125,7 @@ \subsection{Research transparency} and, as we hope to convince you, make the process easier for themselves, because it requires methodical organization that is labor-saving and efficient over the complete course of a project. +Tools like pre-registration, pre-analysis plans, and \textbf{Registered Reports}\sidenote{\url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} can help with this process where they are available. By pre-specifying a large portion of the research design,\sidenote{\url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis/}} @@ -140,6 +141,9 @@ \subsection{Research transparency} immune to a wide range of criticisms of specification searching or multiple testing. Documenting a project in detail greatly increases transparency. +Many disciplines have a tradition of keeping a ``lab notebook'', +and adapting and expanding this process for the development +of lab-style working groups in development is a critical step. This means explicitly noting decisions as they are made, and explaining the process behind them. Documentation on data processing and additional hypotheses tested will be expected in the supplemental materials to any publication. Careful documentation will also save the research team a lot of time during a project, @@ -152,7 +156,7 @@ \subsection{Research transparency} not a one-time requirement or retrospective task. New decisions are always being made as the plan begins contact with reality, and there is nothing wrong with sensible adaptation so long as it is recorded and disclosed. -(Email is \texttt{not} a note-taking service, because communications are rarely well-ordered and easy to delete.) +(Email is \textit{not} a note-taking service, because communications are rarely well-ordered and easy to delete.) There are various software solutions for building documentation over time. The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution, @@ -160,11 +164,11 @@ \subsection{Research transparency} \textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}} \index{task management}\index{GitHub} in addition to version histories and wiki pages. -It offers multiple different ways to record the decision process leading to changes and additions, +Such services offers multiple different ways to record the decision process leading to changes and additions, track and register discussions, and manage tasks. -It's a flexible tool that can be adapted to different team and project dynamics, -but is less effective for file storage. -Each project has its specificities, +These are flexibles tool that can be adapted to different team and project dynamics, +but GitHub, unfortunately is less effective for file storage. +Each project has specific requirements for data, code, and documentation management, and the exact shape of this process can be molded to the team's needs, but it should be agreed upon prior to project launch. This way, you can start building a project's documentation as soon as you start making decisions. @@ -191,7 +195,18 @@ \subsection{Research credibility} or the \textbf{OSF} registry\sidenote{\url{https://osf.io/registries}} as appropriate.\sidenote{\url{http://datacolada.org/12}} \index{pre-registration} -With the rise of empirical research and increased public scrutiny of scientific evidence, however, +Garden varieties of research standards from journals, funders, and others feature both ex ante +(or ”regulation”) and ex post (or “verification”) policies. +Ex ante policies requires that the authors bear the burden +of ensuring they provide some set of materials before publication +and their quality meet some minimum standard. +Ex post policies require that authors make certain materials available to the public, +but their quality is not a direct condition for publication. +Still others have suggested “guidance” policies that would offer checklists +for which practices to adopt, such as reporting on whether and how +various practices were implemented. + +With this ongoing rise of empirical research and increased public scrutiny of scientific evidence, this is no longer enough to guarantee that findings will hold their credibility. Even if your methods are highly precise, your evidence is just as good as your data, @@ -200,6 +215,14 @@ \subsection{Research credibility} It allows other researchers, and research consumers, to verify the steps to a conclusion by themselves, and decide whether their standards for accepting a finding as evidence are met. +Therefore we encourage you to work, gradually, towards improving +the documentation and release of your research materials, +and finding the tools and workflows that best match your project and team. +Every investment you make in documentation and transparency up front +protects your project down the line, particularly as these standards continue to tighten. +Since projects span over many years, +the records you will need to have available for publication are +only bound to increase by the time you do so. %------------------------------------------------ From a2172ba653ec55c08a1806d6575a2daf4b59ade2 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Fri, 11 Oct 2019 14:30:14 -0400 Subject: [PATCH 055/854] IRB section --- chapters/handling-data.tex | 92 ++++++++++++++++++++++++++++++++------ 1 file changed, 78 insertions(+), 14 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index e435c2765..b7f2fa15b 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -227,7 +227,8 @@ \subsection{Research credibility} %------------------------------------------------ -\section{Ensuring privacy and security in research} +\section{Ensuring privacy and security in research data} + Anytime you are collecting primary data in a development research project,\index{primary data} you are almost certainly handling data that include \textbf{personally-identifying information (PII)}.\sidenote{\textbf{Personally-identifying information:} @@ -251,6 +252,82 @@ \section{Ensuring privacy and security in research} to decide which pieces of information fall into this category.\sidenote{ \url{https://sdcpractice.readthedocs.io/en/latest/}} +In all cases where this type of information is involved, +you must make sure that you adhere to several core processes, +including approval, consent, security, and privacy. +If you are a US-based researcher, you will become familiar +with a set of governance standards known as ``The Common Rule''.\sidenote{\url{https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html}} +If you interact with European institutions or persons, +you will also become familiar with ``GDPR'',\sidenote{\url{http://blogs.lshtm.ac.uk/library/2018/01/15/gdpr-for-research-data/}} +a set of regulations governing data ownership and privacy standards. +In all settings, you should have a clear understanding of +who owns your data (it may not be you, even if you collect or possess it), +the rights of the people whose information is reflected there, +and the necessary level of caution and risk involved in +storing and transferring this information. +Due to the increasing scrutiny on many organizations +from recently advanced standards and rights, +these considerations are critically important. +Check with your organization if you have any legal questions; +in general, you are responsible to avoid taking any action that +knowingly or recklessly ignores these considerations. + +\subsection{Ethical approval and consent processes} + +For almost all data collection or research activities that involves PII data, +you will be required to complete some form of Institutional Review Board (IRB) process. +Most commonly this consists of a formal application for approval of a specific +protocol for consent, data collection, and data handling. +The IRB which has authority over your project is not always apparent, +particularly if your institution does not have its own. +It is customary to obtain an approval from the university IRB where one PI is affiliated, +and if work is being done in an international setting approval is often also required +from a local institution subject to local law. + +The primary consideration of IRBs is the protection of the people whose data is being collected. +Many jurisdictions (especially those responsible to EU law) view all personal data +as being intrinsically owned by the persons who they describe. +This means that those persons have the right to refuse to participate in data collection +before it happens, as it is happening, or after it has already happened. +It also means that they must explicitly and affirmatively consent +to the collection, storage, and use of their information for all purposes. +Therefore, the development of these consent processes is of primary importance. +Ensuring that research participants are aware that their information +will be stored and may be used for various research purposes is critical. +There are special additional protections in place for vulnerable populations, +such as minors, prisoners, and people with disabilities, +and these should be confirmed with relevant authorities if your research includes them. + +Make sure you have significant advance timing with your IRB submissions. +You may not begin data collection until approval is in place, +and IRBs may have infrequent meeting schedules +or require several rounds of review for an application to be completed. +If there are any deviations from an approved plan or expected adjustments, +report these as early as you can so that you can update or revise the protocol. +Particularly at universities, IRBs have the power to retroactively deny +the right to use data which was not collected in accordance with an approved plan. +This is extremely rare, but shows the seriousness of these considerations +since the institution itself may face governmental penalties if its IRB +is unable to enforce them. As always, as long as you work in good faith, +you should not have any issues complying with these expectations. + +\subsection{Transmitting and storing data securely} + +Raw data which contains PII \textit{must} be \textbf{encrypted}\sidenote{\url{https://dimewiki.worldbank.org/wiki/encryption}} +\index{encryption} +during data collection, storage, and transfer. +\index{data transfer}\index{data storage} +Most modern data collection software has features that, if enabled, make the first part straightforward.\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} +However, secure storage and transfer are your responsibility.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Security}} +There are plenty of options available to keep your data safe, +at different prices, from enterprise-grade solutions to combined free options. +You will also want to setup a password manager that allows you to share encryption keys inside your team. +These will vary in level of security and ease of use, +and sticking to a standard practice will make your life easier, +so agreeing on a protocol from the start of a project is ideal. + +\subsection{Protecting personally-identifying information} + Most of the field research done in development involves human subjects.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Human_Subjects_Approval}} \index{human subjects} @@ -266,19 +343,6 @@ \section{Ensuring privacy and security in research} or the CITI Program.\sidenote{ \url{https://about.citiprogram.org/en/series/human-subjects-research-hsr/}} -Raw data which contains PII \textit{must} be \textbf{encrypted}\sidenote{\url{https://dimewiki.worldbank.org/wiki/encryption}} -\index{encryption} -during data collection, storage, and transfer. -\index{data transfer}\index{data storage} -Most modern data collection software has features that, if enabled, make the first part straightforward.\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} -However, secure storage and transfer are your responsibility.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Security}} -There are plenty of options available to keep your data safe, -at different prices, from enterprise-grade solutions to combined free options. -You will also want to setup a password manager that allows you to share encryption keys inside your team. -These will vary in level of security and ease of use, -and sticking to a standard practice will make your life easier, -so agreeing on a protocol from the start of a project is ideal. - In general, though, you shouldn't need to handle PII data very often. Once data is securely collected and stored, the first thing you will generally do is \textbf{de-identify} it.\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} \index{de-identification} From 748e60c3f40829c7b8e92ce8afc99ef0df0598a9 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Fri, 11 Oct 2019 14:43:36 -0400 Subject: [PATCH 056/854] Encryption section --- chapters/handling-data.tex | 48 +++++++++++++++++++++++++++++++++----- 1 file changed, 42 insertions(+), 6 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index b7f2fa15b..dea23828c 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -317,14 +317,50 @@ \subsection{Transmitting and storing data securely} \index{encryption} during data collection, storage, and transfer. \index{data transfer}\index{data storage} -Most modern data collection software has features that, if enabled, make the first part straightforward.\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} -However, secure storage and transfer are your responsibility.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Security}} +This means that, even if the information were to be intercepted or made public, +the files that would be obtained would be useless to the recipient. +(In security parlance this person is often referred to as an ``intruder'' +but it is rare that data breaches are nefarious or even intentional.) +The easiest way to protect personal information is not to use it. +It is often very simple to conduct planning and analytical work +using a subset of the data that has anonymous identifying ID variables, +and has had personal characteristics removed from the dataset altogether. +We encourage this approach, because it is easy. +However, when PII is absolutely necessary for work, +such as geographical location, application of intervention programs, +or planning or submission of survey materials, +you must actively protect those materials in transmission and storage. + +First, all accounts need to be protected by strong, unique passwords. +There are many services that create and store these passwords for you, +and some even provide utilities for sharing passwords with teams +inside that secure environment. (There are very few other secure ways to do this.) +Most modern data collection software has additional features that, if enabled, make secure transmission straightforward.\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} +Many also have features that ensure your data is encrypted when stored on their servers, +although this usually needs to be actively administered. +(Note that password-protection alone is not sufficient to count as encryption, +because if the password if obtained the information itself is usable.) +The biggest security gap is often in transmitting survey plans to field teams, +since they usually do not have a highly trained analyst on site. +To protect this information, some key steps are +(a) to ensure that all devices have hard drive encryption and password-protection; +(b) that no information is sent over e-mail (use a secure sync drive instead); +and (c) all field staff receive adequate training on the privacy standards applicable to their work. + +Secure storage and transfer are ultimately your responsibility.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Security}} There are plenty of options available to keep your data safe, -at different prices, from enterprise-grade solutions to combined free options. -You will also want to setup a password manager that allows you to share encryption keys inside your team. -These will vary in level of security and ease of use, -and sticking to a standard practice will make your life easier, +at different prices, from enterprise-grade solutions to free software. +It may be sufficient to hold identifying information in an encrypted service, +or you may need to encrypt information at the file level using a special tool. +Extremely sensitive information may be required to be held in a ``cold'' machine +which does not have Internet access -- this is most often the case with +government records such as granular tax information. +Each of these tools and requirements will vary in level of security and ease of use, +and sticking to a standard practice will make your life much easier, so agreeing on a protocol from the start of a project is ideal. +Finally, having an end-of-life plan for data is essential: +you should always know how to transfer access and control to a new person if the team changes, +and what the expiry of the data and the planned deletion processes are. \subsection{Protecting personally-identifying information} From 1c4acda4a507c4878a3f885da5c41f2e2c493f23 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 15 Oct 2019 12:57:11 -0400 Subject: [PATCH 057/854] De-identification section --- chapters/handling-data.tex | 143 +++++++++++++++++++++++-------------- 1 file changed, 90 insertions(+), 53 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index dea23828c..92c835943 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -7,14 +7,18 @@ As the range and importance of the policy-relevant questions asked by development researchers grow, so does the (rightful) scrutiny under which methods and results are placed. -This scrutiny involves two major components: data handling and analytical quality. +Additionally, research also involves looking deeply into real people's +personal lives, financial conditions, and other sensitive subjects. +The rights and responsibilities involved in having such access +to personal information are a core responsibility of collecting personal data. +Ethical scrutiny involves two major components: data handling and research transparency. Performing at a high standard in both means that consumers of research can have confidence in its conclusions, and that research participants are appropriately protected. -What we call ethical standards in this chapter is a set of practices -for research quality and data privacy that address these two principles. +What we call ethical standards in this chapter are a set of practices +for research quality and data management that address these two principles. -Neither quality nor privacy is an ``all-or-nothing'' objective. +Neither transparency nor privacy is an ``all-or-nothing'' objective. We expect that teams will do as much as they can to make their work conform to modern practices of credibility, transparency, and reproducibility. Similarly, we expect that teams will ensure the privacy of participants in research @@ -62,9 +66,10 @@ \section{Protecting confidence in development research} \subsection{Research reproducibility} Reproducible research means that the actual analytical processes you used are executable by others.\cite{dafoe2014science} -(We use ``reproducibility'' to refer to the code processes in a specific study.\sidenote{\url{http://datacolada.org/76}}) +(We use ``reproducibility'' to refer to the code processes in a specific study.\sidenote{ + \url{http://datacolada.org/76}}) All your code files involving data cleaning, construction and analysis -should be public (unless they contain identifying information). +should be public, unless they contain identifying information. Nobody should have to guess what exactly comprises a given index, or what controls are included in your main regression, or whether or not you clustered standard errors correctly. @@ -81,7 +86,8 @@ \subsection{Research reproducibility} (but excluded from publication) as a resource to others who have similar questions of their own data. -Secondly, reproducible research\sidenote{\url{https://dimewiki.worldbank.org/wiki/Reproducible_Research}} +Secondly, reproducible research\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Reproducible_Research}} enables other researchers to re-use your code and processes to do their own work more easily and effectively in the future. This may mean applying your techniques to their data @@ -93,7 +99,8 @@ \subsection{Research reproducibility} Therefore, your code should be written neatly with clear instructions and published openly. It should be easy to read and understand in terms of structure, style, and syntax. Finally, the corresponding dataset should be openly accessible -unless for legal or ethical reasons it cannot be.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} +unless for legal or ethical reasons it cannot be.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} Reproducibility and transparency are not binary concepts: there’s a spectrum, starting with simple materials release. @@ -115,10 +122,12 @@ \subsection{Research reproducibility} \subsection{Research transparency} Transparent research will expose not only the code, -but all the other research processes involved in developing the analytical approach.\sidenote{\url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} +but all the other research processes involved in developing the analytical approach.\sidenote{ + \url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} This means that readers be able to judge for themselves if the research was done well, and if the decision-making process was sound. -If the research is well-structured, and all of the relevant documentation\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} is shared, +If the research is well-structured, and all of the relevant documentation\sidenote{ + url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} is shared, this makes it as easy as possible for the reader to implement the same analysis. Expecting process transparency is also an incentive for researchers to make better decisions, be skeptical and thorough about their assumptions, @@ -126,9 +135,11 @@ \subsection{Research transparency} because it requires methodical organization that is labor-saving and efficient over the complete course of a project. Tools like pre-registration, pre-analysis plans, and -\textbf{Registered Reports}\sidenote{\url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} +\textbf{Registered Reports}\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} can help with this process where they are available. -By pre-specifying a large portion of the research design,\sidenote{\url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis/}} +By pre-specifying a large portion of the research design,\sidenote{ + \url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis/}} a great deal of analytical planning has already been completed, and at least some research questions are pre-committed for publication regardless of the outcome. This is meant to combat the ``file-drawer problem'',\cite{simonsohn2014p} @@ -151,7 +162,7 @@ \subsection{Research transparency} since you have a record of why something was done in a particular way. There are a number of available tools that will contribute to producing documentation, -\index{project documentation} + \index{project documentation} but project documentation should always be an active and ongoing process, not a one-time requirement or retrospective task. New decisions are always being made as the plan begins contact with reality, @@ -161,8 +172,9 @@ \subsection{Research transparency} There are various software solutions for building documentation over time. The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution, with integrated file storage, version histories, and collaborative wiki pages. -\textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}} -\index{task management}\index{GitHub} +\textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system,\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}} + \index{task management}\index{GitHub} in addition to version histories and wiki pages. Such services offers multiple different ways to record the decision process leading to changes and additions, track and register discussions, and manage tasks. @@ -179,9 +191,10 @@ \subsection{Research credibility} Is the research design sufficiently powered through its sampling and randomization? Were the key research outcomes pre-specified or chosen ex-post? How sensitive are the results to changes in specifications or definitions? -Tools such as \textbf{pre-analysis plans}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} +Tools such as \textbf{pre-analysis plans}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} can be used to assuage these concerns for experimental evaluations -\index{pre-analysis plan} + \index{pre-analysis plan} by fully specifying some set of analysis intended to be conducted, but they may feel like ``golden handcuffs'' for other types of research.\cite{olken2015promises} Regardless of whether or not a formal pre-analysis plan is utilized, @@ -193,7 +206,7 @@ \subsection{Research credibility} the \textbf{3ie} database,\sidenote{\url{http://ridie.3ieimpact.org/}}, the \textbf{eGAP} database,\sidenote{\url{http://egap.org/content/registration/}}, or the \textbf{OSF} registry\sidenote{\url{https://osf.io/registries}} as appropriate.\sidenote{\url{http://datacolada.org/12}} -\index{pre-registration} + \index{pre-registration} Garden varieties of research standards from journals, funders, and others feature both ex ante (or ”regulation”) and ex post (or “verification”) policies. @@ -232,16 +245,16 @@ \section{Ensuring privacy and security in research data} Anytime you are collecting primary data in a development research project,\index{primary data} you are almost certainly handling data that include \textbf{personally-identifying information (PII)}.\sidenote{\textbf{Personally-identifying information:} - any piece or set of information that can be used to identify an individual research subject. - \url{https://dimewiki.worldbank.org/wiki/De-identification\#Personally\_Identifiable\_Information}} -\index{personally-identifying information} +any piece or set of information that can be used to identify an individual research subject. + \url{https://dimewiki.worldbank.org/wiki/De-identification\#Personally\_Identifiable\_Information}} + \index{personally-identifying information} PII data contains information that can, without any transformation, be used to identify individual people, households, villages, or firms that were included in \textbf{data collection}. -\index{data collection} + \index{data collection} This includes names, addresses, and geolocations, and extends to personal information -\index{geodata} + \index{geodata} such as email addresses, phone numbers, and financial information. -\index{de-identification} + \index{de-identification} It is important to keep in mind these data privacy principles not only for the individual respondent but also the PII data of their household members or other caregivers who are covered under the survey. In some contexts this list may be more extensive -- for example, if you are working in an environment that is either small, specific, @@ -256,9 +269,11 @@ \section{Ensuring privacy and security in research data} you must make sure that you adhere to several core processes, including approval, consent, security, and privacy. If you are a US-based researcher, you will become familiar -with a set of governance standards known as ``The Common Rule''.\sidenote{\url{https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html}} +with a set of governance standards known as ``The Common Rule''.\sidenote{ + \url{https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html}} If you interact with European institutions or persons, -you will also become familiar with ``GDPR'',\sidenote{\url{http://blogs.lshtm.ac.uk/library/2018/01/15/gdpr-for-research-data/}} +you will also become familiar with ``GDPR'',\sidenote{ + \url{http://blogs.lshtm.ac.uk/library/2018/01/15/gdpr-for-research-data/}} a set of regulations governing data ownership and privacy standards. In all settings, you should have a clear understanding of who owns your data (it may not be you, even if you collect or possess it), @@ -313,10 +328,11 @@ \subsection{Ethical approval and consent processes} \subsection{Transmitting and storing data securely} -Raw data which contains PII \textit{must} be \textbf{encrypted}\sidenote{\url{https://dimewiki.worldbank.org/wiki/encryption}} -\index{encryption} +Raw data which contains PII \textit{must} be \textbf{encrypted}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/encryption}} + \index{encryption} during data collection, storage, and transfer. -\index{data transfer}\index{data storage} + \index{data transfer}\index{data storage} This means that, even if the information were to be intercepted or made public, the files that would be obtained would be useless to the recipient. (In security parlance this person is often referred to as an ``intruder'' @@ -347,11 +363,15 @@ \subsection{Transmitting and storing data securely} (b) that no information is sent over e-mail (use a secure sync drive instead); and (c) all field staff receive adequate training on the privacy standards applicable to their work. -Secure storage and transfer are ultimately your responsibility.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Security}} +Secure storage and transfer are ultimately your personal responsibility.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Data_Security}} There are plenty of options available to keep your data safe, at different prices, from enterprise-grade solutions to free software. It may be sufficient to hold identifying information in an encrypted service, or you may need to encrypt information at the file level using a special tool. +(This is in contrast to using software or services with disk-level or service-level encryption.) +Data security is important not only for identifying, but also sensitive information, +especially when a worst-case scenario could potentially lead to re-identifying subjects. Extremely sensitive information may be required to be held in a ``cold'' machine which does not have Internet access -- this is most often the case with government records such as granular tax information. @@ -362,47 +382,64 @@ \subsection{Transmitting and storing data securely} you should always know how to transfer access and control to a new person if the team changes, and what the expiry of the data and the planned deletion processes are. -\subsection{Protecting personally-identifying information} +\subsection{De-identifying and anonymizing information} Most of the field research done in development involves human subjects.\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Human_Subjects_Approval}} -\index{human subjects} + \url{https://dimewiki.worldbank.org/wiki/Human_Subjects_Approval}} + \index{human subjects} As a researcher, you are asking people to trust you with personal information about themselves: where they live, how rich they are, whether they have committed or been victims of crimes, their names, their national identity numbers, and all sorts of other data. PII data carries strict expectations about data storage and handling, and it is the responsibility of the research team to satisfy these expectations.\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Research_Ethics}} + \url{https://dimewiki.worldbank.org/wiki/Research_Ethics}} Your donor or employer will most likely require you to hold a certification from a source such as Protecting Human Research Participants\sidenote{ -\url{https://humansubjects.nih.gov/sites/hs/phrp/PHRP_Archived_Course_Materials.pdf}} + \url{https://phrptraining.com}} or the CITI Program.\sidenote{ -\url{https://about.citiprogram.org/en/series/human-subjects-research-hsr/}} + \url{https://about.citiprogram.org/en/series/human-subjects-research-hsr/}} In general, though, you shouldn't need to handle PII data very often. -Once data is securely collected and stored, the first thing you will generally do is \textbf{de-identify} it.\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} -\index{de-identification} +and you can take simple steps to minimize risk by minimizing the handling of PII. +First, only collect information that is strictly needed for the research. +Second, avoid the proliferation of copies of identified data. +There should only be one raw identified dataset copy +and it should be somewhere where only approved people can access it. +Finally, not everyone on the research team needs access to identified data. +Analysis that required PII data is rare +and can be avoided by properly linking identifiers to research information +such as treatment statuses and weights, then removing identifiers. + +Therefore, once data is securely collected and stored, the first thing you will generally do is \textbf{de-identify} it.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/De-identification}} + \index{de-identification} (We will provide more detail on this in the chapter on data collection.) -This will create a working copy that can safely be shared among collaborators. +This will create a working de-identified copy that can safely be shared among collaborators. De-identified data should avoid, for example, you being sent back to every household to alert them that someone dropped all their personal information on a public bus and we don't know who has it. This simply means creating a copy of the data that contains no personally-identifiable information. This data should be an exact copy of the raw data, -except it would be okay for it to be publicly released.\cite{matthews2011data} -Ideally, all machines used to store and process PII data are not only password protected, but also encrypted at the hard drive level -(most modern operating systems provide such a tool). -This means that even if you lose your computer with identifying data in it, -anyone who gets hold of it still cannot access the information. - -Complete data publication, unlike reproducibility checks, -brings along with it a set of serious privacy concerns, -particularly when sensitive data is used in key analyses. +except it would be okay if it were for some reason publicly released.\cite{matthews2011data} + +Note, however, that you can never \textbf{anonymize} data. +There is always some statistical chance that an individual's identity +will be re-linked to the data collected about them +by using some other set of data that are collectively unique. There are a number of tools developed to help researchers de-identify data -(\texttt{PII\_detection}\sidenote{\url{https://github.com/PovertyAction/PII\_detection}} from IPA, +and which you should use as appropriate at that stage of data collection. +These include \texttt{PII\_detection}\sidenote{\url{https://github.com/PovertyAction/PII\_detection}} from IPA, \texttt{PII-scan}\sidenote{\url{https://github.com/J-PAL/PII-Scan}} from JPAL, -and \texttt{sdcMicro}\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} from the World Bank). -But is it ever possible to fully protect privacy in an era of big data? +and \texttt{sdcMicro}\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} from the World Bank. +The \texttt{sdcMicro} tool, in particular, has a feature +that allows you to assess the uniqueness of your data observations, +and simple measures of the identifiability of records from that. +Additional options to protect privacy in data that will become public exist, +and you should expect and intend to release your datasets at some point. One option is to add noise to data, as the US Census has proposed, as it makes the trade-off between data accuracy and privacy explicit. -But there are no established norms for such “differential privacy” approaches: -most approaches fundamentally rely on judging “how harmful” disclosure would be. +But there are no established norms for such ``differential privacy'' approaches: +most approaches fundamentally rely on judging ``how harmful'' information disclosure would be. +The fact remains that there is always a balance between information release +and privacy protection, and that you should engage with it actively and explicitly. +The best thing you can do is make a complete record of the steps that have been taken +so that the process can be reviewed, revised, and updated as necessary. From bfd77bdd2b3cf365868565a9ae795055babe707d Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 15 Oct 2019 13:07:50 -0400 Subject: [PATCH 058/854] Cleaning --- chapters/handling-data.tex | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 92c835943..e693e939c 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -42,7 +42,7 @@ \section{Protecting confidence in development research} The empirical revolution in development research -\index{transparency}\index{credibility}\index{reproducibility} + \index{transparency}\index{credibility}\index{reproducibility} has led to increased public scrutiny of the reliability of research.\cite{rogers_2017} Three major components make up this scrutiny: \textbf{reproducibility}.\cite{duvendack2017meant}, \textbf{transparency},\cite{christensen2018transparency} and \textbf{credibility},\cite{ioannidis2017power}. Reproducibility is one key component of transparency. @@ -76,15 +76,16 @@ \subsection{Research reproducibility} That is, as a purely technical matter, nobody should have to ``just trust you'', nor should they have to bother you to find out what happens if any or all of these things were to be done slightly differently.\cite{simmons2011false,simonsohn2015specification,wicherts2016degrees} -Letting people play around with your data and code is a great way to have new questions asked and answered +Letting people play around with your data and code +is a great way to have new questions asked and answered based on the valuable work you have already done. -Services like GitHub that expose your code development process are valuable resources here. +Services like GitHub that log your research process are valuable resources here. + \index{GitHub} Such services can show things like modifications made in response to referee comments, by having tagged version histories at each major revision. These services can also use issue trackers and abandoned work branches to document the research paths and questions you may have tried to answer -(but excluded from publication) -as a resource to others who have similar questions of their own data. +as a resource to others who have similar questions. Secondly, reproducible research\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Reproducible_Research}} @@ -103,19 +104,16 @@ \subsection{Research reproducibility} \url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} Reproducibility and transparency are not binary concepts: -there’s a spectrum, starting with simple materials release. +there's a spectrum, starting with simple materials publication. But even getting that first stage right is a challenge. An analysis of 203 empirical papers published in top economics journals in 2016 showed that less than 1 in 7 provided all the data and code needed to assess computational reproducibility.\cite{galiani2017incentives} A scan of the 90,000 datasets on the Harvard Dataverse -found that only 10% have the necessary files and documentation +found that only 10\% had the necessary files and documentation for computational reproducibility (and a check of 3,000 of those that met requirements found that 85\% did not replicate). -Longer-term goals to meet reproducibility and transparency standards -include making tools for research transparency part and parcel -of the quest for efficiency gains in the research production function. People seem to systematically underestimate the benefits and overestimate the costs to adopting modern research practices. @@ -127,8 +125,8 @@ \subsection{Research transparency} This means that readers be able to judge for themselves if the research was done well, and if the decision-making process was sound. If the research is well-structured, and all of the relevant documentation\sidenote{ - url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} is shared, -this makes it as easy as possible for the reader to implement the same analysis. + \url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} is shared, +this makes it as easy as possible for the reader to understand the analysis later. Expecting process transparency is also an incentive for researchers to make better decisions, be skeptical and thorough about their assumptions, and, as we hope to convince you, make the process easier for themselves, @@ -138,6 +136,7 @@ \subsection{Research transparency} \textbf{Registered Reports}\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} can help with this process where they are available. + \index{pre-registration}\index{pre-analysis plans}\index{Registered Reports} By pre-specifying a large portion of the research design,\sidenote{ \url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis/}} a great deal of analytical planning has already been completed, @@ -156,9 +155,10 @@ \subsection{Research transparency} and adapting and expanding this process for the development of lab-style working groups in development is a critical step. This means explicitly noting decisions as they are made, and explaining the process behind them. -Documentation on data processing and additional hypotheses tested will be expected in the supplemental materials to any publication. +Documentation on data processing and additional hypotheses tested +will be expected in the supplemental materials to any publication. Careful documentation will also save the research team a lot of time during a project, -as it prevents you to have the same discussion twice (or more!), +as it prevents you from having the same discussion twice (or more!), since you have a record of why something was done in a particular way. There are a number of available tools that will contribute to producing documentation, @@ -335,8 +335,8 @@ \subsection{Transmitting and storing data securely} \index{data transfer}\index{data storage} This means that, even if the information were to be intercepted or made public, the files that would be obtained would be useless to the recipient. -(In security parlance this person is often referred to as an ``intruder'' -but it is rare that data breaches are nefarious or even intentional.) +In security parlance this person is often referred to as an ``intruder'' +but it is rare that data breaches are nefarious or even intentional. The easiest way to protect personal information is not to use it. It is often very simple to conduct planning and analytical work using a subset of the data that has anonymous identifying ID variables, @@ -354,8 +354,8 @@ \subsection{Transmitting and storing data securely} Most modern data collection software has additional features that, if enabled, make secure transmission straightforward.\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} Many also have features that ensure your data is encrypted when stored on their servers, although this usually needs to be actively administered. -(Note that password-protection alone is not sufficient to count as encryption, -because if the password if obtained the information itself is usable.) +Note that password-protection alone is not sufficient to count as encryption, +because if the underlying data is obtained through a leak the information itself is usable. The biggest security gap is often in transmitting survey plans to field teams, since they usually do not have a highly trained analyst on site. To protect this information, some key steps are From 2d3da76aa830b0847c79e6cece6941258c15ee79 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 15 Oct 2019 14:55:49 -0400 Subject: [PATCH 059/854] Section name --- chapters/handling-data.tex | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index e693e939c..0781d4e86 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -242,7 +242,6 @@ \subsection{Research credibility} \section{Ensuring privacy and security in research data} - Anytime you are collecting primary data in a development research project,\index{primary data} you are almost certainly handling data that include \textbf{personally-identifying information (PII)}.\sidenote{\textbf{Personally-identifying information:} any piece or set of information that can be used to identify an individual research subject. @@ -287,7 +286,7 @@ \section{Ensuring privacy and security in research data} in general, you are responsible to avoid taking any action that knowingly or recklessly ignores these considerations. -\subsection{Ethical approval and consent processes} +\subsection{Obtaining ethical approval and consent} For almost all data collection or research activities that involves PII data, you will be required to complete some form of Institutional Review Board (IRB) process. From 5ae8070372b20fab73a71919e84639aca3979884 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 15 Oct 2019 15:14:32 -0400 Subject: [PATCH 060/854] Apply suggestions from code review --- chapters/planning-data-work.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 1bd122be8..39e5bdb55 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -7,7 +7,7 @@ This means knowing which data sets and output you need at the end of the process, how they will stay organized and linked, what different types and levels of data you'll handle, -and how big and sensitive it will be. +and whether the data will require special handling due to volume or privacy considerations. Identifying these details creates a \textbf{data map} for your project, giving you and your team a sense of how information resources should be organized. It's okay to update this map once the project is underway -- @@ -140,7 +140,7 @@ \subsection{Folder management} This is reflected in the master script in such a way that the only change necessary to run the entire code from a new computer is to change the path to the project folder. -The code \texttt{stata-master-dofile.do} how folder structure is reflected in a master do-file. +The code in \texttt{stata-master-dofile.do} shows how folder structure is reflected in a master do-file. \subsection{Code management} From f98ddf60485b338a43a3d796c56f02ca7e25234f Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 15 Oct 2019 15:19:04 -0400 Subject: [PATCH 061/854] Introduction and title (#173) --- chapters/planning-data-work.tex | 14 +++++++++----- manuscript.tex | 2 +- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 033e79f7e..eb32cb23d 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -1,8 +1,10 @@ %------------------------------------------------ \begin{fullwidth} -Preparation for data work begins long before you collect any data. -In order to be prepared to work on the data you receive, +Preparation for data work begins long before you collect any data, +and involves planning both the software tools you will use yourself +as well as the collaboration platforms and processes for your team. +In order to be prepared to work on the data you receive with a group, you need to know what you are getting into. This means knowing which data sets and output you need at the end of the process, how they will stay organized and linked, @@ -13,11 +15,13 @@ It's okay to update this map once the project is underway -- the point is that everyone knows what the plan is. -Then, you must identify and prepare your tools and workflow. +Then, you must identify and prepare your collaborative tools and workflow. Changing software and protocols half-way through a project can be costly and time-consuming, so it's important to think ahead about decisions that may seem of little consequence (think: creating a new folder and moving files into it). -This chapter will discuss some of often overlooked tools and processes that +Similarly, having a self-documenting discussion platform +makes working together on outputs much easier from the very first discussion. +This chapter will discuss some tools and processes that will help prepare you for collaboration and replication. We will try to provide free, open-source, and platform-agnostic tools wherever possible, and point to more detailed instructions when relevant. @@ -356,7 +360,7 @@ \subsection{Preparing for collaboration and replication} \url{https://michaelstepner.com/blog/git-vs-dropbox/}} GitHub has the following features that are useful for efficient workflows: - The Issues tab is a great tool for task management. -- It creates incentives for writing down why changes were made as they are saved, creating naturally documented code. +- It creates incentives for writing down why changes were made as they are saved, creating naturally documented code. - It is useful also because tasks can clearly be tied to file versions. Thus, it serves as a great tool for managing code-related tasks. diff --git a/manuscript.tex b/manuscript.tex index d2ec49a8b..8ac70e36f 100644 --- a/manuscript.tex +++ b/manuscript.tex @@ -41,7 +41,7 @@ \chapter{Chapter 1: Handling data ethically} % CHAPTER 2 %---------------------------------------------------------------------------------------- -\chapter{Chapter 2: Planning data work before going to field} +\chapter{Chapter 2: Collaborating on code and data} \label{ch:2} \input{chapters/planning-data-work.tex} From d2b5fbfe3700606283f7a293f3b04e6407b61467 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 15 Oct 2019 16:00:25 -0400 Subject: [PATCH 062/854] Setup --- chapters/planning-data-work.tex | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index eb32cb23d..8a2492b0c 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -35,7 +35,7 @@ %------------------------------------------------ -\section{Preparing your digital workspace} +\section{Preparing a collaborative environment} Being comfortable using your computer and having the tools you need in reach is key. This section provides a brief introduction to key concepts and toolkits @@ -59,6 +59,22 @@ \section{Preparing your digital workspace} \subsection{Setting up your computer} +\subsection{Documenting decisions and tasks} + +\subsection{Choosing software} + +\section{Organizing code and data} + +\subsection{Version control} + +\subsection{Folder management} + +\subsection{Code management} + +\subsection{Output management} + +\subsection{Setting up your computer} + First things first: turn on your computer. Make sure you have fully updated the operating system, that it is in good working order, @@ -77,8 +93,6 @@ \subsection{Setting up your computer} Dropbox files count only as local copies and never backups, because others can alter it. -%There are a few things that can make your life much easier, although some have a little expense associated with them.\marginnote{This section is the only time we'll suggest you spend money, and it is totally okay if you or your organization cannot.} Mainly, make sure you have a \textit{good} computer. Get at least 16GB of RAM and a 500MB or 1TB hard drive. (Processor speeds matter less these days.) Get a monitor with high-definition resolution. \marginnote{Free alternatives to these tools include LibreOffice, Bitwarden, and Duplicati, although Dropbox is harder to replace effectively.} Your life will be easier with paid copies of software like Microsoft Office 365 and critical services like \textbf{Dropbox},\sidenote{\url{https://www.dropbox.com}} \textbf{Backblaze},\sidenote{\url{https://www.backblaze.com}} and \textbf{LastPass}.\sidenote{\url{https://www.lastpass.com}} Get a decent email client (like \textbf{Spark}\sidenote{\url{https://sparkmailapp.com}} or \textbf{Outlook}), \index{software} a calendar that you like, the communication and note-taking tools you need, a good music streaming service, and solid headphones with a microphone. None of these are essential to the work, but they will make you a lot more comfortable doing it, and being comfortable at your computer helps you stay happy and healthy. - % When using a computer for research, you should keep in mind a structure of work known as \textbf{scientific computing}.\cite{wilson2014best,wilson2017good} \index{scientific computing} Scientific computing is a set of practices developed to help you ensure that the computer is being used to improve your efficiency, so that you can focus on the real-world problems instead of technical ones.\sidenote{ \url{https://www.dropbox.com/s/wqefknwfb91kop8/Coding_For_Econs_20190221.pdf?raw=1}} This means getting to know your computer a little better than most people do, and thinking critically about tasks like file structures, code and \textbf{process reusability},\sidenote{ \url{http://blogs.worldbank.org/opendata/making-analytics-reusable}} and software choice. Most importantly, it means detecting early warning signs of \textbf{process bloat}. As a general rule, if the work required to maintain a process grows as fast (or faster) than the number of objects controlled by that process, you need to stop work immediately and rethink processes. You should work to design processes that are close to infinitely scalable by the number of objects being handled -- whether they be field samples, data files, surveys, or other real or digital objects. The first thing you need to figure to use your computer efficiently is where you are on your file system. \marginnote{You should \textit{always} use forward slashes (\texttt{/}) in file paths. Backslashes will break folder paths in many systems.} Find your \textbf{home folder}. On MacOS, this will be a folder with your username. From abcdbffeaceb9003b8468d8c78eaa496481851e7 Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 15 Oct 2019 17:47:48 -0400 Subject: [PATCH 063/854] Setup structure (#185) --- chapters/planning-data-work.tex | 133 +++++++++++++++----------------- 1 file changed, 63 insertions(+), 70 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 8a2492b0c..95b12d2a2 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -32,10 +32,9 @@ so that you do not spend a lot of time later figuring out basic functions. \end{fullwidth} - %------------------------------------------------ -\section{Preparing a collaborative environment} +\section{Preparing a collaborative work environment} Being comfortable using your computer and having the tools you need in reach is key. This section provides a brief introduction to key concepts and toolkits @@ -59,41 +58,27 @@ \section{Preparing a collaborative environment} \subsection{Setting up your computer} -\subsection{Documenting decisions and tasks} - -\subsection{Choosing software} - -\section{Organizing code and data} - -\subsection{Version control} - -\subsection{Folder management} - -\subsection{Code management} - -\subsection{Output management} - -\subsection{Setting up your computer} - First things first: turn on your computer. Make sure you have fully updated the operating system, that it is in good working order, and that you have a \textbf{password-protected} login. -All machines that will handle personally-identifiable information should be encrypted; -this should be built-in to most modern operating systems (BitLocker on PCs or FileVault on Macs). -Then, make sure your computer is backed up. +All machines should have their hard disks encrypted; +this should be built-in to most modern operating systems +(the service is currently called BitLocker on Windows or FileVault on MacOS). +Encryption prevents your contents from ever being accessed without the password. +Then, make sure your computer is backed up to prevent information loss. Follow the \textbf{3-2-1 rule}: -(3) copies of everything; -(2) different physical media; -(1) offsite storage.\sidenote{\url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} - -One reasonable setup is having your primary disk, -a local hard drive managed with a tool like Time Machine, -and a remote copy managed by a tool like Backblaze. -Dropbox files count only as local copies and never backups, -because others can alter it. - -% When using a computer for research, you should keep in mind a structure of work known as \textbf{scientific computing}.\cite{wilson2014best,wilson2017good} \index{scientific computing} Scientific computing is a set of practices developed to help you ensure that the computer is being used to improve your efficiency, so that you can focus on the real-world problems instead of technical ones.\sidenote{ \url{https://www.dropbox.com/s/wqefknwfb91kop8/Coding_For_Econs_20190221.pdf?raw=1}} This means getting to know your computer a little better than most people do, and thinking critically about tasks like file structures, code and \textbf{process reusability},\sidenote{ \url{http://blogs.worldbank.org/opendata/making-analytics-reusable}} and software choice. Most importantly, it means detecting early warning signs of \textbf{process bloat}. As a general rule, if the work required to maintain a process grows as fast (or faster) than the number of objects controlled by that process, you need to stop work immediately and rethink processes. You should work to design processes that are close to infinitely scalable by the number of objects being handled -- whether they be field samples, data files, surveys, or other real or digital objects. The first thing you need to figure to use your computer efficiently is where you are on your file system. \marginnote{You should \textit{always} use forward slashes (\texttt{/}) in file paths. Backslashes will break folder paths in many systems.} +have 3 copies of everything, on at least +2 different hardware devices you have access to, +with 1 offsite storage.\sidenote{ + \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} + +One reasonable setup is having your primary computer, +a local hard drive managed with a tool like Time Machine +(or a secondary computer), +and a remote copy managed by a cloud backup service. +Dropbox and other synced files count only as local copies and never as remote backups, +because other users can alter them. Find your \textbf{home folder}. On MacOS, this will be a folder with your username. On Windows, this will be something like ``This PC''. (It is never your desktop.) @@ -102,15 +87,58 @@ \subsection{Setting up your computer} On MacOS this will be something like \path{/users/username/dropbox/project/...}, and on Windows, \path{C:/users/username/github/project/...}. We will write file paths such as \path{/Dropbox/project-title/DataWork/EncryptedData/} -using forward slashes, and mostly use only A-Z, dash, and underscore. +using forward slashes, and mostly use only A-Z, dash (\texttt{-}), and underscore (\texttt{_}). You should \textit{always} use forward slashes (\texttt{/}) in file paths, just like an internet address, and no matter how your computer writes them, because the other type will cause your work to break many systems. You can use spaces in names of non-technical files, but not technical ones.\sidenote{ -\url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf}} + \url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf}} Making the structure of your files part of your workflow is really important, as is naming them correctly so you know what is where. +\subsection{Documenting decisions and tasks} + +\subsection{Choosing software} + +\section{Organizing code and data} + +\subsection{Version control} + +A \textbf{version control system} is the way you manage the changes to any computer file. +This is important, for example, for your team to be able to find the version of a presentation that you delivered to your donor, +but also to understand why the significance level of your estimates has changed. +Everyone who has ever encountered a file named something like \texttt{final\_report\_v5\_LJK\_KLE\_jun15.docx} +can appreciate how useful such a system can be. +Most file sharing solutions offer some level of version control. +These are usually enough to manage changes to binary files (such as Word and PowerPoint documents) without needing to appeal to these dreaded file names. +For code files, however, a more complex version control system is usually desirable. +We recommend using Git\sidenote{\textbf{Git:} a multi-user version control system for collaborating on and tracking changes to code as it is written.} for all plain text files. +Git tracks all the changes you make to your code, +and allows you to go back to previous versions without losing the information on changes made. +It also makes it possible to work on two parallel versions of the code, +so you don't risk breaking the code for other team members as you try something new, + +Increasingly, we recommend the entire data work folder +to be created and stored separately in GitHub. +Nearly all code and outputs (except datasets) are better managed this way. +Code is written in its native language, +and it's becoming more and more common for written outputs such as reports, +presentations and documentations to be written using different \textbf{literate programming} +tools such as {\LaTeX} and dynamic documents. +You should therefore feel comfortable having both a project folder and a code folder. +Their structures can be managed in parallel by using \texttt{iefolder} twice. +The project folder can be maintained in a synced location like Dropbox, +and the code folder can be maintained in a version-controlled location like GitHub. +While both are used for sharing and collaborating, +there is a sharp difference between the functionality of sync and version control. +Namely, sync forces everyone to have the same version of every file at all times +and does not support simultaneous editing well; version control does the opposite. +Keeping code in a version-controlled folder will allow you +to maintain better control of its history and functionality, +and because of the specificity with which code depends on file structure, +you will be able to enforce better practices there than in the project folder. + + \subsection{Folder management} The first thing your team will need to create is a shared folder.\sidenote{Common tools for folder sharing are Dropbox, Box, and OneDrive.} @@ -159,6 +187,7 @@ \subsection{Folder management} The code in \texttt{stata-master-dofile.do} shows how folder structure is reflected in a master do-file. + \subsection{Code management} Once you start a project's data work, @@ -240,42 +269,6 @@ \subsection{Code management} Making sure that the code is running, and that other people can understand the code is also the easiest way to ensure a smooth project handover. -\subsection{Version control} - -A \textbf{version control system} is the way you manage the changes to any computer file. -This is important, for example, for your team to be able to find the version of a presentation that you delivered to your donor, -but also to understand why the significance level of your estimates has changed. -Everyone who has ever encountered a file named something like \texttt{final\_report\_v5\_LJK\_KLE\_jun15.docx} -can appreciate how useful such a system can be. -Most file sharing solutions offer some level of version control. -These are usually enough to manage changes to binary files (such as Word and PowerPoint documents) without needing to appeal to these dreaded file names. -For code files, however, a more complex version control system is usually desirable. -We recommend using Git\sidenote{\textbf{Git:} a multi-user version control system for collaborating on and tracking changes to code as it is written.} for all plain text files. -Git tracks all the changes you make to your code, -and allows you to go back to previous versions without losing the information on changes made. -It also makes it possible to work on two parallel versions of the code, -so you don't risk breaking the code for other team members as you try something new, - -Increasingly, we recommend the entire data work folder -to be created and stored separately in GitHub. -Nearly all code and outputs (except datasets) are better managed this way. -Code is written in its native language, -and it's becoming more and more common for written outputs such as reports, -presentations and documentations to be written using different \textbf{literate programming} -tools such as {\LaTeX} and dynamic documents. -You should therefore feel comfortable having both a project folder and a code folder. -Their structures can be managed in parallel by using \texttt{iefolder} twice. -The project folder can be maintained in a synced location like Dropbox, -and the code folder can be maintained in a version-controlled location like GitHub. -While both are used for sharing and collaborating, -there is a sharp difference between the functionality of sync and version control. -Namely, sync forces everyone to have the same version of every file at all times -and does not support simultaneous editing well; version control does the opposite. -Keeping code in a version-controlled folder will allow you -to maintain better control of its history and functionality, -and because of the specificity with which code depends on file structure, -you will be able to enforce better practices there than in the project folder. - \subsection{Output management} Another task that needs to be discussed with your team is the best way to manage outputs. From fc19ef74d8f31f964cffd50d542fdc2fb5aa0e6d Mon Sep 17 00:00:00 2001 From: bbdaniels Date: Tue, 15 Oct 2019 18:05:19 -0400 Subject: [PATCH 064/854] Documentation and e-mail (#179) --- chapters/planning-data-work.tex | 77 +++++++++++++++++++++------------ 1 file changed, 50 insertions(+), 27 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 95b12d2a2..25f0fc869 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -98,8 +98,58 @@ \subsection{Setting up your computer} \subsection{Documenting decisions and tasks} +The first habit that many teams need to break is using e-mail for task management. +E-mail is, simply put, not a system. It is not a system for anything. +It is not structured to manage group membership or to present the same information +across a group of people, or to remind you when old information becomes relevant. +It is not structured to allow people to collaborate over a long time or to review old discussions. +It is easy to miss or lose communications when they have relevance in the future. +E-mail is for communicating ``now'' and this is what it was designed to do. +Everything that is communicated over e-mail or any other medium should +immediately be transferred into a system that is designed to keep records. +We call these systems collaboration tools, and there are several that are very useful. + +First of all, you will find a wide variety of task management tools online. +Many of them are based on an underlying system known as ``Kanban boards''.\sidenote{ + \url{https://en.wikipedia.org/wiki/Kanban_board}} +This task-oriented system allows the team to create and assign tasks, +to track progress across time, and to quickly see the project state. +These systems also link communications to specific tasks so that +the records related to decision making on those tasks is permanently recorded. +A common and free implementation of this system is the one found in GitHub project boards. +You may also use a system like GitHub Issues or task-assignment on Dropbox Paper, +which have a more chronological structure, if this is appropriate to your project. +What is important is that you have a system and you stick to it, +so that decisions and tasks are easily reviewable long after they are completed. + +When it comes to collaboration software,\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Collaboration_Tools}} +our team uses GitHub for code-related tasks, +and Dropbox Paper for more managerial and office tasks. +GitHub has the following features that are useful for efficient workflows: +- The Issues tab is a great tool for task management. +- It creates incentives for writing down why changes were made as they are saved, creating naturally documented code. +- It is useful also because tasks can clearly be tied to file versions. +Thus, it serves as a great tool for managing code-related tasks. +On the other hand, Dropbox Paper provides a good interface with notifications, +and is very intuitive for people with non-technical backgrounds. +It is useful because tasks can be easily linked to other documents saved in Dropbox. +Thus, it is a great tool for managing non-code-related tasks. +Neither of these tools require much technical knowledge; +they merely require an agreement and workflow design +so that the people assigning the tasks are sure to set them up in the system. + \subsection{Choosing software} +Choosing the right personal and team working environment can also make your work easier. +Let's start looking at where you write code. +If you are working in R, \textbf{RStudio} is great.\sidenote{\url{https://www.rstudio.com}} +For Stata, the built-in do-file editor is the most widely adopted code editor, +but \textbf{Atom}\sidenote{\url{https://atom.io}} and \textbf{Sublime}\sidenote{\url{https://www.sublimetext.com/}} can also be configured to run Stata code. +Opening an entire directory and loading the whole tree view in the sidebar, +which gives you access to directory management actions, is a really useful feature. +This can be done using RStudio projects in RStudio, Stata projects in Stata, and directory managers in Atom and Sublime. + \section{Organizing code and data} \subsection{Version control} @@ -348,30 +398,3 @@ \subsection{Output management} But as you start to export tables and graphs, you'll want to save separate scripts, where \texttt{descriptive\_statistics.do} creates \texttt{descriptive\_statistics.tex}. - -\subsection{Preparing for collaboration and replication} - -Choosing the right personal and team working environment can also make your work easier. -Let's start looking at where you write code. -If you are working in R, \textbf{RStudio} is great.\sidenote{\url{https://www.rstudio.com}} -For Stata, the built-in do-file editor is the most widely adopted code editor, -but \textbf{Atom}\sidenote{\url{https://atom.io}} and \textbf{Sublime}\sidenote{\url{https://www.sublimetext.com/}} can also be configured to run Stata code. -Opening an entire directory and loading the whole tree view in the sidebar, -which gives you access to directory management actions, is a really useful feature. -This can be done using RStudio projects in RStudio, Stata projects in Stata, and directory managers in Atom and Sublime. - -%\textbf{Atom},\sidenote{\url{https://atom.io}}, which can open an entire targeted directory by writing \path{atom /path/to/directory/} in the command line (\textbf{Terminal} or \textbf{PowerShell}), after copying it from the browser. Opening the entire directory loads the whole tree view in the sidebar, and gives you access to directory management actions. You can start to manage your projects as a whole -- Atom is capable of sending code to Stata,\sidenote{\url{https://atom.io/packages/stata-exec}} writing and building \LaTeX,\sidenote{\url{https://atom.io/packages/latex}} and connecting directly with others to team code.\sidenote{\url{https://atom.io/packages/teletype}} It is highly customizable, and since it is your personal environment, there are lots of stylistic and functional options in extension packages that you can use to make your work easier and more enjoyable. - -When it comes to collaboration software,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Collaboration_Tools}} -the two most common softwares in use are Dropbox and GitHub.\sidenote{ -\url{https://michaelstepner.com/blog/git-vs-dropbox/}} -GitHub has the following features that are useful for efficient workflows: -- The Issues tab is a great tool for task management. -- It creates incentives for writing down why changes were made as they are saved, creating naturally documented code. -- It is useful also because tasks can clearly be tied to file versions. Thus, it serves as a great tool for -managing code-related tasks. - -On the other hand, Dropbox Paper provides a good interface with notifications. It is useful because tasks can be easily linked to other documents saved in Dropbox. Thus, it is a great tool for managing non-code-related tasks. - -Neither of these tools require much technical knowledge; they merely require an agreement and workflow design -so that the people assigning the tasks are sure to set them up in the system. Our team uses both. From 0d8f42c87b8e62a918c4bca7c410746f120b22d7 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 01:59:18 +0530 Subject: [PATCH 065/854] Fixes --- chapters/planning-data-work.tex | 2 +- chapters/preamble.tex | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 25f0fc869..f64fe91ab 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -87,7 +87,7 @@ \subsection{Setting up your computer} On MacOS this will be something like \path{/users/username/dropbox/project/...}, and on Windows, \path{C:/users/username/github/project/...}. We will write file paths such as \path{/Dropbox/project-title/DataWork/EncryptedData/} -using forward slashes, and mostly use only A-Z, dash (\texttt{-}), and underscore (\texttt{_}). +using forward slashes, and mostly use only A-Z, dash (\texttt{-}), and underscore (\texttt{\_}). You should \textit{always} use forward slashes (\texttt{/}) in file paths, just like an internet address, and no matter how your computer writes them, because the other type will cause your work to break many systems. diff --git a/chapters/preamble.tex b/chapters/preamble.tex index b3fe3defc..5ba9585d8 100644 --- a/chapters/preamble.tex +++ b/chapters/preamble.tex @@ -98,7 +98,7 @@ % subsection format \titleformat{\subsection}% -{\normalfont\large}% format applied to label+text +{\normalfont\itshape\large}% format applied to label+text {}% label {}% horizontal separation between label and title body {}% before the title body From 912fe0b34d31801f5ea4d3d1c7b1986ceda5eb91 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 02:39:04 +0530 Subject: [PATCH 066/854] Setting up --- chapters/planning-data-work.tex | 87 +++++++++++++++++++++------------ 1 file changed, 57 insertions(+), 30 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index f64fe91ab..0cc607553 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -42,19 +42,25 @@ \section{Preparing a collaborative work environment} Some of these skills may seem elementary, but thinking about simple things from a workflow perspective can help you make marginal improvements every day you work. +These add up, and together form a collaborative workflow +that will greatly accelerate your team's ability to get tasks done +on every project you take on together. Teams often develop their workflows as they go, -solving new challenges when they appear. -However, there are a number of tasks that will always have to be completed during any project. -These include organizing folders, -collaborating on code, -controlling different versions of a file, -and reviewing each other's work. -Thinking about the best way to do these tasks ahead of time, -instead of just doing it as quickly as you can when needed, -will save your team a lot of re-working. +solving new challenges as they arise. +This is broadly okay -- but it is important to recognize +that there are a number of tasks that will always have to be completed during any project, +and that the corresponding workflows can be agreed on in advance. +These include documentation methods, software choices, +naming schema, organizing folders and outputs, collaborating on code, +managing revisions to files, and reviewing each other's work. +These tasks appear in almost every project, +and also translate well between projects. +Therefore, there are large efficiency gains to +thinking about the best way to do these tasks ahead of time, +instead of just doing it quickly as needed. This chapter will outline the main points to discuss within the team, -and point to some possible solutions. +and suggest some common solutions. \subsection{Setting up your computer} @@ -62,32 +68,34 @@ \subsection{Setting up your computer} Make sure you have fully updated the operating system, that it is in good working order, and that you have a \textbf{password-protected} login. -All machines should have their hard disks encrypted; -this should be built-in to most modern operating systems -(the service is currently called BitLocker on Windows or FileVault on MacOS). -Encryption prevents your contents from ever being accessed without the password. -Then, make sure your computer is backed up to prevent information loss. -Follow the \textbf{3-2-1 rule}: -have 3 copies of everything, on at least -2 different hardware devices you have access to, -with 1 offsite storage.\sidenote{ +All machines should have hard disk encryption enabled. +Disk encryption is built in to most modern operating systems; +the service is currently called BitLocker on Windows or FileVault on MacOS. +Disk encryption prevents your files from ever being accessed without the system password. +As with all critical passwords, your system password should be strong, +memorable, and backed up in a separate secure location. + +Make sure your computer is backed up to prevent information loss. +Follow the \textbf{3-2-1 rule}: maintain 3 copies of everything, +on at least 2 different hardware devices you have access to, +with 1 offsite storage method.\sidenote{ \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} - One reasonable setup is having your primary computer, -a local hard drive managed with a tool like Time Machine -(or a secondary computer), -and a remote copy managed by a cloud backup service. +then a local hard drive managed with a tool like Time Machine +(alternatively, a fully synced secondary computer), +and a remote copy maintained by a cloud backup service. Dropbox and other synced files count only as local copies and never as remote backups, because other users can alter them. -Find your \textbf{home folder}. On MacOS, this will be a folder with your username. -On Windows, this will be something like ``This PC''. (It is never your desktop.) +Find your \textbf{home folder}. It is never your desktop. +On MacOS, this will be a folder with your username. +On Windows, this will be something like ``This PC''. Nearly everything we talk about will assume you are starting from here. Ensure you know how to get the \textbf{absolute file path} for any given file. On MacOS this will be something like \path{/users/username/dropbox/project/...}, and on Windows, \path{C:/users/username/github/project/...}. We will write file paths such as \path{/Dropbox/project-title/DataWork/EncryptedData/} -using forward slashes, and mostly use only A-Z, dash (\texttt{-}), and underscore (\texttt{\_}). +using forward slashes (\texttt{/}), and mostly use only A-Z, dash (\texttt{-}), and underscore (\texttt{\_}). You should \textit{always} use forward slashes (\texttt{/}) in file paths, just like an internet address, and no matter how your computer writes them, because the other type will cause your work to break many systems. @@ -96,6 +104,29 @@ \subsection{Setting up your computer} Making the structure of your files part of your workflow is really important, as is naming them correctly so you know what is where. +If you are working with others, you will most likely be using some kind +of file collaboration method. +The exact method you use will depend on your tasks, +but three methods are the most common. +\textbf{File syncing} is the most familiar method, +and is implemented by software like Dropbox and OneDrive. +Sync forces everyone to have the same version of every file at the same time, +which makes simultaneous editing difficult but other tasks easier. +\textbf{Version control} is another method, +and is implemented by tools like GitHub. +Version control allows everyone to have different versions at the same time, +making simultaneous editing easier but other tasks harder. +Finally, \textbf{server storage} is the least-used method, +because there is only one version of the materials, +and simultaneous access must be carefully regulated. +However, server storage ensures that everyone has access +to exactly the same files, and also enables +high-powered computing processes for large and complex data. +All three methods are used for sharing and collaborating, +and you should review the types of data work +that you are going to be doing, and plan which processes +will live in which types of locations. + \subsection{Documenting decisions and tasks} The first habit that many teams need to break is using e-mail for task management. @@ -179,10 +210,6 @@ \subsection{Version control} Their structures can be managed in parallel by using \texttt{iefolder} twice. The project folder can be maintained in a synced location like Dropbox, and the code folder can be maintained in a version-controlled location like GitHub. -While both are used for sharing and collaborating, -there is a sharp difference between the functionality of sync and version control. -Namely, sync forces everyone to have the same version of every file at all times -and does not support simultaneous editing well; version control does the opposite. Keeping code in a version-controlled folder will allow you to maintain better control of its history and functionality, and because of the specificity with which code depends on file structure, From 1f7b06367067f06cee6612450d8f2b23ac32df4d Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 03:31:17 +0530 Subject: [PATCH 067/854] Documenting and software --- chapters/planning-data-work.tex | 83 +++++++++++++++++++++++---------- 1 file changed, 59 insertions(+), 24 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 0cc607553..524d8a164 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -129,18 +129,22 @@ \subsection{Setting up your computer} \subsection{Documenting decisions and tasks} -The first habit that many teams need to break is using e-mail for task management. +Once your technical workspace is set up, +you need to decide how you are going to communicate with your team. +The first habit that many teams need to break is using e-mail for management tasks. E-mail is, simply put, not a system. It is not a system for anything. +E-mail was developed for communicating ``now'' and this is what it does well. It is not structured to manage group membership or to present the same information across a group of people, or to remind you when old information becomes relevant. It is not structured to allow people to collaborate over a long time or to review old discussions. -It is easy to miss or lose communications when they have relevance in the future. -E-mail is for communicating ``now'' and this is what it was designed to do. +It is therefore easy to miss or lose communications from the past when they have relevance in the present. Everything that is communicated over e-mail or any other medium should immediately be transferred into a system that is designed to keep records. -We call these systems collaboration tools, and there are several that are very useful. +We call these systems collaboration tools, and there are several that are very useful.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Collaboration_Tools}} -First of all, you will find a wide variety of task management tools online. +Many task management tools are online or web-based, +so that everyone on your team can access them simultaneously. Many of them are based on an underlying system known as ``Kanban boards''.\sidenote{ \url{https://en.wikipedia.org/wiki/Kanban_board}} This task-oriented system allows the team to create and assign tasks, @@ -148,38 +152,69 @@ \subsection{Documenting decisions and tasks} These systems also link communications to specific tasks so that the records related to decision making on those tasks is permanently recorded. A common and free implementation of this system is the one found in GitHub project boards. -You may also use a system like GitHub Issues or task-assignment on Dropbox Paper, -which have a more chronological structure, if this is appropriate to your project. +You may also use a system like GitHub Issues or task assignment on Dropbox Paper, +which has a more chronological structure, if this is appropriate to your project. What is important is that you have a system and you stick to it, -so that decisions and tasks are easily reviewable long after they are completed. - -When it comes to collaboration software,\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Collaboration_Tools}} -our team uses GitHub for code-related tasks, -and Dropbox Paper for more managerial and office tasks. -GitHub has the following features that are useful for efficient workflows: -- The Issues tab is a great tool for task management. -- It creates incentives for writing down why changes were made as they are saved, creating naturally documented code. -- It is useful also because tasks can clearly be tied to file versions. -Thus, it serves as a great tool for managing code-related tasks. +so that decisions, discussions, and tasks are easily reviewable long after they are completed. + +Just like we use different file sharing tools for different types of files, +we use different collaboration tools for different types of tasks. +Our team, for example, uses GitHub Issues for code-related tasks, +and Dropbox Paper for more managerial and office-related tasks. +GitHub creates incentives for writing down why changes were made +as they are saved, creating naturally documented code. +It is useful also because tasks in Issues can clearly be tied to file versions. +Thus, GitHub serves as a great tool for managing code-related tasks. On the other hand, Dropbox Paper provides a good interface with notifications, and is very intuitive for people with non-technical backgrounds. It is useful because tasks can be easily linked to other documents saved in Dropbox. Thus, it is a great tool for managing non-code-related tasks. Neither of these tools require much technical knowledge; they merely require an agreement and workflow design -so that the people assigning the tasks are sure to set them up in the system. +so that the people assigning the tasks are sure to set them up in the appropriate system. \subsection{Choosing software} Choosing the right personal and team working environment can also make your work easier. Let's start looking at where you write code. -If you are working in R, \textbf{RStudio} is great.\sidenote{\url{https://www.rstudio.com}} +This book focuses mainly on primary survey data, +so we are going to broadly assume that you are using ``small'' data +in one of the two popular desktop-based packages for that kind of work: R or Stata. +(If you are using another language, like Python, +or working with big data projects on a server installation, +you can skip this section.) +The most important part of working with code is a code editor. +This does not need to be the same program as the code runs in. +This can be preferable since your editor will not crash if your code does, +and may offer additional features aimed at writing code well. +If you are working in R, \textbf{RStudio} is the typical choice.\sidenote{ + \url{https://www.rstudio.com}} For Stata, the built-in do-file editor is the most widely adopted code editor, -but \textbf{Atom}\sidenote{\url{https://atom.io}} and \textbf{Sublime}\sidenote{\url{https://www.sublimetext.com/}} can also be configured to run Stata code. -Opening an entire directory and loading the whole tree view in the sidebar, -which gives you access to directory management actions, is a really useful feature. -This can be done using RStudio projects in RStudio, Stata projects in Stata, and directory managers in Atom and Sublime. +and \textbf{Atom}\sidenote{\url{https://atom.io}} and \textbf{Sublime}\sidenote{\url{https://www.sublimetext.com/}} can also be configured to run Stata code externally, while offering great accessibility features. +For example, these tools can work on an entire directory -- rather than a single file -- +which gives you access to directory views and file management actions, +such as folder management, Git integration, and simultaneous work with other types of files without leaving the editor. + +In our field of development economics, +Stata is by far the most commonly used programming language, +and the Stata do-file editor the most common editor. +We focus on Stata-specific tools and instructions in this book. +This is only in part due to its popularity. +Stata is primarily a scripting language for statistics and data, +meaning that its users often come from economics and statistics backgrounds +and understand Stata to be encoding a set of tasks as a record for the future. +We believe that this must change somewhat: +in particular, we think that practitioners of Stata +must begin to think about their workflows more as programmers do, +and that people who adopt this approach will be dramatically +more capable in their analytical ability. +This means that they will be more productive when managing teams, +and more able to focus on the challenges of experimental design +and econometric analysis, rather than spending excessive time +re-solving problems on the computer. +Stata also has relatively few resources of this type available, +and the ones that we have created and shared here +we hope will be an asset to all its users. \section{Organizing code and data} From 4fbc42505f000fc14173f3c5e11e3156edeb2757 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 03:32:19 +0530 Subject: [PATCH 068/854] Sync security (#186) --- chapters/planning-data-work.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 524d8a164..c56c21ce0 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -112,6 +112,7 @@ \subsection{Setting up your computer} and is implemented by software like Dropbox and OneDrive. Sync forces everyone to have the same version of every file at the same time, which makes simultaneous editing difficult but other tasks easier. +(They also have some security concerns which we will address later.) \textbf{Version control} is another method, and is implemented by tools like GitHub. Version control allows everyone to have different versions at the same time, From 3715bd5a07b23deea74098f50b74aa61c33c830a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 03:39:40 +0530 Subject: [PATCH 069/854] Encryption (#164) --- chapters/planning-data-work.tex | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index c56c21ce0..b896129ec 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -71,7 +71,11 @@ \subsection{Setting up your computer} All machines should have hard disk encryption enabled. Disk encryption is built in to most modern operating systems; the service is currently called BitLocker on Windows or FileVault on MacOS. -Disk encryption prevents your files from ever being accessed without the system password. +Disk encryption prevents your files from ever being accessed +without first entering the system password. +This is different from file-level encryption, +which makes individual files unreadable without a specific key. +We will address that in more detail later. As with all critical passwords, your system password should be strong, memorable, and backed up in a separate secure location. From 9abf0d8db46650b1f3567c4e520d063c5a631b56 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 03:54:51 +0530 Subject: [PATCH 070/854] Software choice (#170) --- chapters/planning-data-work.tex | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index b896129ec..004962f26 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -181,8 +181,22 @@ \subsection{Documenting decisions and tasks} \subsection{Choosing software} Choosing the right personal and team working environment can also make your work easier. -Let's start looking at where you write code. -This book focuses mainly on primary survey data, +It may be difficult or costly to switch halfway through a project, so +think ahead about the different software to be used. +Take into account the different levels of techiness of team members, +how important it is to access files offline constantly, +as well as the type of data you will need to access and the security needed. +Big datasets require additional infrastructure and may overburden +the traditional tools used for small datasets, +particularly if you are trying to sync or collaborate on them. +Also consider the cost of licenses, the time to learn new tools, +and the stability of the tools. +There are few strictly right or wrong answers, +but what is important is that you have a plan in advance +and understand how your tools with interact with your work. + +Next, think about how and where you write code. +The rest of this book focuses mainly on primary survey data, so we are going to broadly assume that you are using ``small'' data in one of the two popular desktop-based packages for that kind of work: R or Stata. (If you are using another language, like Python, From ead947cf0c819a344d75b5ae83f1dcdb32bb1e36 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 04:03:27 +0530 Subject: [PATCH 071/854] Environment (#202) --- chapters/planning-data-work.tex | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 004962f26..20cc5334f 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -195,6 +195,21 @@ \subsection{Choosing software} but what is important is that you have a plan in advance and understand how your tools with interact with your work. +Ultimately, the goal is to ensure that you will be able to hold +your code environment constant over the life cycle of a single project. +While this means you will inevitably have different projects +with different code environments, each one will be better than the last, +and you will avoid the extremely costly process of migrating a project +into a new code enviroment. +This can be set up down to the software level: +you need to ensure that even specific versions of software +and the individual packages you use +are referenced or maintained so that they can be reproduced going forward +even if their most recent version contains changes that would break your code. +(For example, our command \texttt{ieboilstart} in the \texttt{ietoolkit} package +provides functionality to support Stata version stability.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/ieboilstart}}) + Next, think about how and where you write code. The rest of this book focuses mainly on primary survey data, so we are going to broadly assume that you are using ``small'' data From 50edc6191327557a50afca2a8745e652e99a605e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 04:32:17 +0530 Subject: [PATCH 072/854] Management intro (#178) --- chapters/planning-data-work.tex | 122 +++++++++++++++++++------------- 1 file changed, 74 insertions(+), 48 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 20cc5334f..7ca490eb3 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -252,55 +252,33 @@ \subsection{Choosing software} \section{Organizing code and data} -\subsection{Version control} - -A \textbf{version control system} is the way you manage the changes to any computer file. -This is important, for example, for your team to be able to find the version of a presentation that you delivered to your donor, -but also to understand why the significance level of your estimates has changed. -Everyone who has ever encountered a file named something like \texttt{final\_report\_v5\_LJK\_KLE\_jun15.docx} -can appreciate how useful such a system can be. -Most file sharing solutions offer some level of version control. -These are usually enough to manage changes to binary files (such as Word and PowerPoint documents) without needing to appeal to these dreaded file names. -For code files, however, a more complex version control system is usually desirable. -We recommend using Git\sidenote{\textbf{Git:} a multi-user version control system for collaborating on and tracking changes to code as it is written.} for all plain text files. -Git tracks all the changes you make to your code, -and allows you to go back to previous versions without losing the information on changes made. -It also makes it possible to work on two parallel versions of the code, -so you don't risk breaking the code for other team members as you try something new, - -Increasingly, we recommend the entire data work folder -to be created and stored separately in GitHub. -Nearly all code and outputs (except datasets) are better managed this way. -Code is written in its native language, -and it's becoming more and more common for written outputs such as reports, -presentations and documentations to be written using different \textbf{literate programming} -tools such as {\LaTeX} and dynamic documents. -You should therefore feel comfortable having both a project folder and a code folder. -Their structures can be managed in parallel by using \texttt{iefolder} twice. -The project folder can be maintained in a synced location like Dropbox, -and the code folder can be maintained in a version-controlled location like GitHub. -Keeping code in a version-controlled folder will allow you -to maintain better control of its history and functionality, -and because of the specificity with which code depends on file structure, -you will be able to enforce better practices there than in the project folder. - - -\subsection{Folder management} - -The first thing your team will need to create is a shared folder.\sidenote{Common tools for folder sharing are Dropbox, Box, and OneDrive.} -If every team member is working on their local computers, -there will be a lot of crossed wires when collaborating on any single file, -and e-mailing one document back and forth is not efficient. -Your folder will contain all your project's documents. -It will be the living memory of your work. -The most important thing about this folder is for everyone in the team to know how to navigate it. -Creating folders with self-explanatory names will make this a lot easier. -Naming conventions may seem trivial, -but often times they only make sense to whoever created them. -It will often make sense for the person in the team who uses a folder the most to create it. +Organizing files and folders is not a trivial task. +What is intuitive to one person rarely comes naturally to another, +and searching for files and folders is everybody's least favorite task. +As often as not, you come up with the wrong one, +and then it becomes very easy to create problems that require complex resolutions later. +This section will provide basic tips on managing the folder +that will store your project's data work. + +We assume you will be working with code and data throughout your project. +We further assume you will want all your processes to be recorded +and easily findable at any point in time. +Maintaining an organized file structure for data work is the best way +to ensure that you, your teammates, and others +are able to easily work on, edit, and replicate your work in the future. +It also ensures that core automation processes like script tools +are able to interact well will your work, +whether they are yours or those of others. +File organization makes your own work easier as well as more transparent, +and plays well with tools like version control systems +that aim to cut down on the amount of repeated tasks you have to perform. +It is worth thinking in advance about how to store, name, and organize +the different types of files you will be working with, +so that there is no confusion down the line +and everyone has interoperable expectations. + +\subsection{File and folder management} -For the purpose of this book, -we're mainly interested in the folder that will store the project's data work. Agree with your team on a specific folder structure, and set it up at the beginning of the research project to prevent folder re-organization that may slow down your workflow and, @@ -316,6 +294,18 @@ \subsection{Folder management} is that changing from one project to another requires less time to get acquainted with a new organization scheme. +The first thing your team will need to create is a shared folder. +If every team member is working on their local computers, +there will be a lot of crossed wires when collaborating on any single file, +and e-mailing one document back and forth is not efficient. +Your folder will contain all your project's documents. +It will be the living memory of your work. +The most important thing about this folder is for everyone in the team to know how to navigate it. +Creating folders with self-explanatory names will make this a lot easier. +Naming conventions may seem trivial, +but often times they only make sense to whoever created them. +It will often make sense for the person in the team who uses a folder the most to create it. + \texttt{iefolder} also creates master do-files.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} Master scripts are a key element of code organization and collaboration, and we will discuss some important features soon. @@ -333,6 +323,42 @@ \subsection{Folder management} The code in \texttt{stata-master-dofile.do} shows how folder structure is reflected in a master do-file. +\subsection{Version control} + +A \textbf{version control system} is required to manage changes to any computer file. +A good version control system tracks who edited each file and when, +and additionally providers a protocol for ensuring that conflicting versions are avoided. +This is important, for example, for your team to be able to find the version of a presentation that you delivered to a donor, +and also to understand why the significance level of your estimates has changed. +Everyone who has ever encountered a file named something like \texttt{final\_report\_v5\_LJK\_KLE\_jun15.docx} +can appreciate how useful such a system can be. +Most file sharing solutions offer some kind of version control. +These are usually enough to manage changes to binary files (such as Word and PowerPoint documents) without needing to appeal to these dreaded filename-based versioning conventions. +For code files, however, a more complex version control system is usually desirable. +We recommend using Git\sidenote{\textbf{Git:} a multi-user version control system for collaborating on and tracking changes to code as it is written.} for all plain text files. +Git tracks all the changes you make to your code, +and allows you to go back to previous versions without losing the information on changes made. +It also makes it possible to work on two parallel versions of the code, +so you don't risk breaking the code for other team members as you try something new, + +Increasingly, we recommend the entire data work folder +to be created and stored separately in GitHub. +Nearly all code and outputs (except datasets) are better managed this way. +Code is written in its native language, +and it's becoming more and more common for written outputs such as reports, +presentations and documentations to be written using different \textbf{literate programming} +tools such as {\LaTeX} and dynamic documents. +You should therefore feel comfortable having both a project folder and a code folder. +Their structures can be managed in parallel by using \texttt{iefolder} twice. +The project folder can be maintained in a synced location like Dropbox, +and the code folder can be maintained in a version-controlled location like GitHub. +Keeping code in a version-controlled folder will allow you +to maintain better control of its history and functionality, +and because of the specificity with which code depends on file structure, +you will be able to enforce better practices there than in the project folder. + + + \subsection{Code management} From 10333adeb7061731f04ef503e35944aed47cf1ec Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 04:35:34 +0530 Subject: [PATCH 073/854] 3-2-1 clarification (#183) --- chapters/planning-data-work.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 7ca490eb3..df7b74dfe 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -87,7 +87,8 @@ \subsection{Setting up your computer} One reasonable setup is having your primary computer, then a local hard drive managed with a tool like Time Machine (alternatively, a fully synced secondary computer), -and a remote copy maintained by a cloud backup service. +and either a remote copy maintained by a cloud backup service +or all original files stored on a remote server. Dropbox and other synced files count only as local copies and never as remote backups, because other users can alter them. From 7d5bb69d3474eeb55b6eafb2288966769934b1ec Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 04:37:49 +0530 Subject: [PATCH 074/854] Note script-do interchangeable (#191) --- chapters/planning-data-work.tex | 2 ++ 1 file changed, 2 insertions(+) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index df7b74dfe..f2b8a30b6 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -234,6 +234,8 @@ \subsection{Choosing software} Stata is by far the most commonly used programming language, and the Stata do-file editor the most common editor. We focus on Stata-specific tools and instructions in this book. +Hence, we will use the terms `script' and `do-file' +interchangeably to refer to Stata code throughout. This is only in part due to its popularity. Stata is primarily a scripting language for statistics and data, meaning that its users often come from economics and statistics backgrounds From 79667ddf4835d1b0a592da414a814d39beded9fd Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 09:47:20 +0530 Subject: [PATCH 075/854] More details for iefolder (#83) --- chapters/planning-data-work.tex | 110 +++++++++++++++++++------------- 1 file changed, 65 insertions(+), 45 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index f2b8a30b6..edd595a20 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -1,4 +1,4 @@ -%------------------------------------------------ +% ---------------------------------------------------------------------------------------------- \begin{fullwidth} Preparation for data work begins long before you collect any data, @@ -32,7 +32,8 @@ so that you do not spend a lot of time later figuring out basic functions. \end{fullwidth} -%------------------------------------------------ +% ---------------------------------------------------------------------------------------------- +% ---------------------------------------------------------------------------------------------- \section{Preparing a collaborative work environment} @@ -62,6 +63,7 @@ \section{Preparing a collaborative work environment} This chapter will outline the main points to discuss within the team, and suggest some common solutions. +% ---------------------------------------------------------------------------------------------- \subsection{Setting up your computer} First things first: turn on your computer. @@ -133,6 +135,7 @@ \subsection{Setting up your computer} that you are going to be doing, and plan which processes will live in which types of locations. +% ---------------------------------------------------------------------------------------------- \subsection{Documenting decisions and tasks} Once your technical workspace is set up, @@ -179,6 +182,7 @@ \subsection{Documenting decisions and tasks} they merely require an agreement and workflow design so that the people assigning the tasks are sure to set them up in the appropriate system. +% ---------------------------------------------------------------------------------------------- \subsection{Choosing software} Choosing the right personal and team working environment can also make your work easier. @@ -253,6 +257,8 @@ \subsection{Choosing software} and the ones that we have created and shared here we hope will be an asset to all its users. +% ---------------------------------------------------------------------------------------------- +% ---------------------------------------------------------------------------------------------- \section{Organizing code and data} Organizing files and folders is not a trivial task. @@ -280,52 +286,51 @@ \section{Organizing code and data} so that there is no confusion down the line and everyone has interoperable expectations. +% ---------------------------------------------------------------------------------------------- \subsection{File and folder management} -Agree with your team on a specific folder structure, and +Agree with your team on a specific directory structure, and set it up at the beginning of the research project -to prevent folder re-organization that may slow down your workflow and, -more importantly, prevent your code files from running. -DIME Analytics created and maintains -\texttt{iefolder}\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefolder}} -as a part of our \texttt{ietoolkit} suite. -This command sets up a standardized folder structure for what we call the \texttt{/DataWork/} folder.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} -It includes folders for all the steps of a typical DIME project. -However, since each project will always have its own needs, -we tried to make it as easy as possible to adapt when that is the case. +in your synced or shared top-level folder. +This will prevent folder re-organization that may slow down your workflow and, +more importantly, ensure your code files are always able to run on any machine. +To support this, DIME Analytics created and maintains \texttt{iefolder}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/iefolder}} +as a part of our \texttt{ietoolkit} package. +This command sets up a pre-standardized folder structure for what we call the \texttt{/DataWork/} folder.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} +The \texttt{/DataWork/} folder includes folders for all the steps of a typical project. +Since each project will always have its own needs, +we have tried to make it as easy as possible to adapt when that is the case. The main advantage of having a universally standardized folder structure is that changing from one project to another requires less time to get acquainted with a new organization scheme. - -The first thing your team will need to create is a shared folder. -If every team member is working on their local computers, -there will be a lot of crossed wires when collaborating on any single file, -and e-mailing one document back and forth is not efficient. -Your folder will contain all your project's documents. -It will be the living memory of your work. -The most important thing about this folder is for everyone in the team to know how to navigate it. -Creating folders with self-explanatory names will make this a lot easier. -Naming conventions may seem trivial, -but often times they only make sense to whoever created them. -It will often make sense for the person in the team who uses a folder the most to create it. - -\texttt{iefolder} also creates master do-files.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} -Master scripts are a key element of code organization and collaboration, -and we will discuss some important features soon. -With regard to folder structure, it's important to keep in mind -that the master script should mimic the structure of the \texttt{/DataWork/} folder. -This is done through the creation of globals (in Stata) or string scalars (in R). -These are ``variables'' -- coding shortcuts that refer to subfolders, -so that those folders can be referenced without repeatedly writing out their complete filepaths. -Because the \texttt{/DataWork/} folder is shared by the whole team, -its structure is the same in each team member's computer. -What may differ is the path to the project folder (the highest-level shared folder). -This is reflected in the master script in such a way that -the only change necessary to run the entire code from a new computer -is to change the path to the project folder. -The code in \texttt{stata-master-dofile.do} shows how folder structure is reflected in a master do-file. - - +For our group, maintaining a single unified directory structure +across the entire portfolio of projects means that everyone +can easily move between projects without having to re-orient +themselves to how files and folders are organized. + +The DIME \texttt{iefolder} structure is not for everyone. +However, if you do not already have a standard file structure across projects, +it is intended to be an easy template to start from. +This structure operates by creating a \texttt{/DataWork/} folder at the project level, +and within that folder, it provides standardized directory structures +for each data source (in the primary data context, ``rounds'' of data collection). +For each, \texttt{iefolder} creates folders for raw encrypted data, +raw deidentified data, cleaned data, final data, outputs, and documentation. +In parallel, it creates folders for the code files +that move the data through this progression, +and for the files that manage final analytical work. +The command also has some flexibility for the addition of +folders for non-primary data sources, although this is less well developed. +The package also includes the \texttt{iegitaddmd} command, +which can place a \texttt{README.md} file in each of these folders. +These \textbf{Markdown} files provide an easy and GitHub-compatible way +to document the contents of every folder in the structure. + + + +% ---------------------------------------------------------------------------------------------- \subsection{Version control} A \textbf{version control system} is required to manage changes to any computer file. @@ -360,9 +365,7 @@ \subsection{Version control} and because of the specificity with which code depends on file structure, you will be able to enforce better practices there than in the project folder. - - - +% ---------------------------------------------------------------------------------------------- \subsection{Code management} Once you start a project's data work, @@ -432,6 +435,22 @@ \subsection{Code management} The master script is also where all the settings are established, such as folder paths, functions and constants used throughout the project. +\texttt{iefolder} also creates master do-files.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} +Master scripts are a key element of code organization and collaboration, +and we will discuss some important features soon. +With regard to folder structure, it's important to keep in mind +that the master script should mimic the structure of the \texttt{/DataWork/} folder. +This is done through the creation of globals (in Stata) or string scalars (in R). +These are ``variables'' -- coding shortcuts that refer to subfolders, +so that those folders can be referenced without repeatedly writing out their complete filepaths. +Because the \texttt{/DataWork/} folder is shared by the whole team, +its structure is the same in each team member's computer. +What may differ is the path to the project folder (the highest-level shared folder). +This is reflected in the master script in such a way that +the only change necessary to run the entire code from a new computer +is to change the path to the project folder. +The code in \texttt{stata-master-dofile.do} shows how folder structure is reflected in a master do-file. + Agree with your team on a plan to review code as it is written. Reading other people's code is the best way to improve your coding skills. And having another set of eyes on your code will make you more comfortable with the results you find. @@ -444,6 +463,7 @@ \subsection{Code management} Making sure that the code is running, and that other people can understand the code is also the easiest way to ensure a smooth project handover. +% ---------------------------------------------------------------------------------------------- \subsection{Output management} Another task that needs to be discussed with your team is the best way to manage outputs. From 025aff0f1528ff3f4eba9558b0dc01006307049b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 09:57:07 +0530 Subject: [PATCH 076/854] Naming (#187) --- chapters/planning-data-work.tex | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index edd595a20..94db0af0f 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -106,8 +106,7 @@ \subsection{Setting up your computer} You should \textit{always} use forward slashes (\texttt{/}) in file paths, just like an internet address, and no matter how your computer writes them, because the other type will cause your work to break many systems. -You can use spaces in names of non-technical files, but not technical ones.\sidenote{ - \url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf}} +You can use spaces in names of non-technical files, but not technical ones. Making the structure of your files part of your workflow is really important, as is naming them correctly so you know what is where. @@ -328,6 +327,32 @@ \subsection{File and folder management} These \textbf{Markdown} files provide an easy and GitHub-compatible way to document the contents of every folder in the structure. +Once the directory structure is set up, +you should adopt a file naming convention. +You will be working with two types of files: +``technical'' files, which are those that are accessed by code processes, +and ``non-technical'' files, which will not be accessed by code processes. +The former takes precedent: an Excel file is a technical file +even if it is a field log, because at some point it will be used by code. +We will not give much emphasis to non-technical files here; +but you should make sure to name them in an orderly fashion that works +for your team. +This will ensure you can find files within folders +and reduce the amount of time others will spend opening files +to find out what is inside them. +Technical files, however, have stricter requirements.\sidenote{ + \url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf}} +For example, you should never use spaces in technical names; +this can cause problems in code. (This includes all folder names.) +One practice that takes some getting used to +is the fact that the best names from a coding perspective +are usually the opposite of those from an English perspective. +For example, for a deidentified household dataset from the baseline round, +you will prefer want a name like \texttt{baseline-household-deidentified.dta}. +This ensures that all \texttt{baseline} data stays together, +then all \texttt{baseline-household} data, +and then provides unique information about this one. + % ---------------------------------------------------------------------------------------------- From e9bd94acd225132247bd183b524771cdebde17ca Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 10:10:48 +0530 Subject: [PATCH 077/854] Folders and plaintext (#191 #192) --- chapters/planning-data-work.tex | 40 ++++++++++++++++++--------------- 1 file changed, 22 insertions(+), 18 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 94db0af0f..c5f664e81 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -309,6 +309,28 @@ \subsection{File and folder management} can easily move between projects without having to re-orient themselves to how files and folders are organized. +The \texttt{/DataWork/} folder may be created either inside +an existing project-based folder structure, or it may be created separately. +Increasingly, we recommend that you create entire data work folder +separately in a git-managed folder, and reserve the project folder +for tasks related to data collection and other project management work. +The project folder can be maintained in a synced location like Dropbox, +and the code folder can be maintained in a version-controlled location like GitHub. +(A version-controlled folder can \textit{never} be stored inside a synced folder, +because the versioning features are extremely disruptive to others +when the syncing utility operates on them.) +Nearly all code and raw outputs (not datasets) are better managed this way. +This is because code files are usually \textbf{plaintext} files, +and non-technical files are usually \textbf{binary} files. +It's also becoming more and more common for written outputs such as reports, +presentations and documentations to be written using plaintext +tools such as {\LaTeX} and dynamic documents. +Keeping such plaintext files in a version-controlled folder allows you +to maintain better control of their history and functionality. +Because of the specificity with which code files depends on file structure, +you will be able to enforce better practices there than in the project folder, +which will usually be managed by a PI, FC, or field team members. + The DIME \texttt{iefolder} structure is not for everyone. However, if you do not already have a standard file structure across projects, it is intended to be an easy template to start from. @@ -353,8 +375,6 @@ \subsection{File and folder management} then all \texttt{baseline-household} data, and then provides unique information about this one. - - % ---------------------------------------------------------------------------------------------- \subsection{Version control} @@ -374,22 +394,6 @@ \subsection{Version control} It also makes it possible to work on two parallel versions of the code, so you don't risk breaking the code for other team members as you try something new, -Increasingly, we recommend the entire data work folder -to be created and stored separately in GitHub. -Nearly all code and outputs (except datasets) are better managed this way. -Code is written in its native language, -and it's becoming more and more common for written outputs such as reports, -presentations and documentations to be written using different \textbf{literate programming} -tools such as {\LaTeX} and dynamic documents. -You should therefore feel comfortable having both a project folder and a code folder. -Their structures can be managed in parallel by using \texttt{iefolder} twice. -The project folder can be maintained in a synced location like Dropbox, -and the code folder can be maintained in a version-controlled location like GitHub. -Keeping code in a version-controlled folder will allow you -to maintain better control of its history and functionality, -and because of the specificity with which code depends on file structure, -you will be able to enforce better practices there than in the project folder. - % ---------------------------------------------------------------------------------------------- \subsection{Code management} From d7c8b5560a4b518bb833040267533a4df43b7e87 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 10:12:22 +0530 Subject: [PATCH 078/854] Who (#201) --- chapters/planning-data-work.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index c5f664e81..65485fb2c 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -311,6 +311,7 @@ \subsection{File and folder management} The \texttt{/DataWork/} folder may be created either inside an existing project-based folder structure, or it may be created separately. +It should always be created by the leading RA by agreement with the PI. Increasingly, we recommend that you create entire data work folder separately in a git-managed folder, and reserve the project folder for tasks related to data collection and other project management work. From 03e80afd7ae20588673cf8d1eff91271aa1e5378 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 10:16:01 +0530 Subject: [PATCH 079/854] Style guide (#133) --- chapters/planning-data-work.tex | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 65485fb2c..3d93ac975 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -252,6 +252,11 @@ \subsection{Choosing software} and more able to focus on the challenges of experimental design and econometric analysis, rather than spending excessive time re-solving problems on the computer. +To support this goal, this book also includes +an introductory Stata Style Guide +that we use in our work, which provides +some new standards for coding so that code styles +can be harmonized across teams for easier understanding and reuse of code. Stata also has relatively few resources of this type available, and the ones that we have created and shared here we hope will be an asset to all its users. @@ -358,8 +363,8 @@ \subsection{File and folder management} The former takes precedent: an Excel file is a technical file even if it is a field log, because at some point it will be used by code. We will not give much emphasis to non-technical files here; -but you should make sure to name them in an orderly fashion that works -for your team. +but you should make sure to name them +in an orderly fashion that works for your team. This will ensure you can find files within folders and reduce the amount of time others will spend opening files to find out what is inside them. From 2dd180b9a553371cd0ac9534e80dcc9f41561467 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 10:18:49 +0530 Subject: [PATCH 080/854] Naming clarification (#119) --- chapters/planning-data-work.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 3d93ac975..5f3db8b77 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -106,7 +106,6 @@ \subsection{Setting up your computer} You should \textit{always} use forward slashes (\texttt{/}) in file paths, just like an internet address, and no matter how your computer writes them, because the other type will cause your work to break many systems. -You can use spaces in names of non-technical files, but not technical ones. Making the structure of your files part of your workflow is really important, as is naming them correctly so you know what is where. @@ -365,6 +364,9 @@ \subsection{File and folder management} We will not give much emphasis to non-technical files here; but you should make sure to name them in an orderly fashion that works for your team. +You can use spaces and datestamps in names of non-technical files, but not technical ones: +the former might have a name like \texttt{2019-10-30 Sampling Procedure Description.docx} +while the latter might have a name like \texttt{endline-sampling.do}. This will ensure you can find files within folders and reduce the amount of time others will spend opening files to find out what is inside them. From 786ce2be4c7f8487727b4b6b6d03de7b4b87d95e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 10:25:33 +0530 Subject: [PATCH 081/854] More naming (#122) --- chapters/planning-data-work.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 5f3db8b77..df38570d4 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -378,7 +378,8 @@ \subsection{File and folder management} is the fact that the best names from a coding perspective are usually the opposite of those from an English perspective. For example, for a deidentified household dataset from the baseline round, -you will prefer want a name like \texttt{baseline-household-deidentified.dta}. +you should prefer a name like \texttt{baseline-household-deidentified.dta}, +rather than the opposite way around as occurs in natural language. This ensures that all \texttt{baseline} data stays together, then all \texttt{baseline-household} data, and then provides unique information about this one. From 734a6b3a40baeb7a0e80306db4e05b226cdf04f5 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 10:37:48 +0530 Subject: [PATCH 082/854] Integrate VC and folder management --- chapters/planning-data-work.tex | 45 ++++++++++++++++++--------------- 1 file changed, 24 insertions(+), 21 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index df38570d4..8650c005e 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -336,8 +336,30 @@ \subsection{File and folder management} you will be able to enforce better practices there than in the project folder, which will usually be managed by a PI, FC, or field team members. -The DIME \texttt{iefolder} structure is not for everyone. -However, if you do not already have a standard file structure across projects, +Setting up the \texttt{/DataWork/} folder folder in a git-managed directory +also enabled you to use Git and GitHub for version control on your code files. +A \textbf{version control system} is required to manage changes to any technical file. +A good version control system tracks who edited each file and when, +and additionally providers a protocol for ensuring that conflicting versions are avoided. +This is important, for example, for your team to be able to find the version of a presentation that you delivered to a donor, +and also to understand why the significance level of your estimates has changed. +Everyone who has ever encountered a file named something like \texttt{final\_report\_v5\_LJK\_KLE\_jun15.docx} +can appreciate how useful such a system can be. +Most file-syncing solutions offer some kind of version control; +These are usually enough to manage changes to binary files (such as Word and PowerPoint documents) +without needing to rely on these dreaded filename-based versioning conventions. +For technical files, however, a more complex version control system is usually desirable. +We recommend using Git\sidenote{ + \textbf{Git:} a multi-user version control system for collaborating on and tracking changes to code as it is written.} +for all plaintext files. +Git tracks all the changes you make to your code, +and allows you to go back to previous versions without losing the information on changes made. +It also makes it possible to work on two parallel versions of the code, +so you don't risk breaking the code for other team members as you try something new. +The DIME \texttt{iefolder} approach is designed with this in mind. + +However, the DIME structure is not for everyone. +If you do not already have a standard file structure across projects, it is intended to be an easy template to start from. This structure operates by creating a \texttt{/DataWork/} folder at the project level, and within that folder, it provides standardized directory structures @@ -384,25 +406,6 @@ \subsection{File and folder management} then all \texttt{baseline-household} data, and then provides unique information about this one. -% ---------------------------------------------------------------------------------------------- -\subsection{Version control} - -A \textbf{version control system} is required to manage changes to any computer file. -A good version control system tracks who edited each file and when, -and additionally providers a protocol for ensuring that conflicting versions are avoided. -This is important, for example, for your team to be able to find the version of a presentation that you delivered to a donor, -and also to understand why the significance level of your estimates has changed. -Everyone who has ever encountered a file named something like \texttt{final\_report\_v5\_LJK\_KLE\_jun15.docx} -can appreciate how useful such a system can be. -Most file sharing solutions offer some kind of version control. -These are usually enough to manage changes to binary files (such as Word and PowerPoint documents) without needing to appeal to these dreaded filename-based versioning conventions. -For code files, however, a more complex version control system is usually desirable. -We recommend using Git\sidenote{\textbf{Git:} a multi-user version control system for collaborating on and tracking changes to code as it is written.} for all plain text files. -Git tracks all the changes you make to your code, -and allows you to go back to previous versions without losing the information on changes made. -It also makes it possible to work on two parallel versions of the code, -so you don't risk breaking the code for other team members as you try something new, - % ---------------------------------------------------------------------------------------------- \subsection{Code management} From c1c3c1e997d2e80b8fcd44bdd3520ec8d7c7bbb4 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 10:38:48 +0530 Subject: [PATCH 083/854] Editing and rearranging --- chapters/planning-data-work.tex | 62 ++++++++++++++++----------------- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 8650c005e..921c48a96 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -336,6 +336,36 @@ \subsection{File and folder management} you will be able to enforce better practices there than in the project folder, which will usually be managed by a PI, FC, or field team members. +Once the directory structure is set up, +you should adopt a file naming convention. +You will be working with two types of files: +``technical'' files, which are those that are accessed by code processes, +and ``non-technical'' files, which will not be accessed by code processes. +The former takes precedent: an Excel file is a technical file +even if it is a field log, because at some point it will be used by code. +We will not give much emphasis to non-technical files here; +but you should make sure to name them +in an orderly fashion that works for your team. +You can use spaces and datestamps in names of non-technical files, but not technical ones: +the former might have a name like \texttt{2019-10-30 Sampling Procedure Description.docx} +while the latter might have a name like \texttt{endline-sampling.do}. +This will ensure you can find files within folders +and reduce the amount of time others will spend opening files +to find out what is inside them. +Technical files, however, have stricter requirements.\sidenote{ + \url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf}} +For example, you should never use spaces in technical names; +this can cause problems in code. (This includes all folder names.) +One practice that takes some getting used to +is the fact that the best names from a coding perspective +are usually the opposite of those from an English perspective. +For example, for a deidentified household dataset from the baseline round, +you should prefer a name like \texttt{baseline-household-deidentified.dta}, +rather than the opposite way around as occurs in natural language. +This ensures that all \texttt{baseline} data stays together, +then all \texttt{baseline-household} data, +and then provides unique information about this one. + Setting up the \texttt{/DataWork/} folder folder in a git-managed directory also enabled you to use Git and GitHub for version control on your code files. A \textbf{version control system} is required to manage changes to any technical file. @@ -356,7 +386,7 @@ \subsection{File and folder management} and allows you to go back to previous versions without losing the information on changes made. It also makes it possible to work on two parallel versions of the code, so you don't risk breaking the code for other team members as you try something new. -The DIME \texttt{iefolder} approach is designed with this in mind. +The DIME approach is designed with this in mind. However, the DIME structure is not for everyone. If you do not already have a standard file structure across projects, @@ -376,36 +406,6 @@ \subsection{File and folder management} These \textbf{Markdown} files provide an easy and GitHub-compatible way to document the contents of every folder in the structure. -Once the directory structure is set up, -you should adopt a file naming convention. -You will be working with two types of files: -``technical'' files, which are those that are accessed by code processes, -and ``non-technical'' files, which will not be accessed by code processes. -The former takes precedent: an Excel file is a technical file -even if it is a field log, because at some point it will be used by code. -We will not give much emphasis to non-technical files here; -but you should make sure to name them -in an orderly fashion that works for your team. -You can use spaces and datestamps in names of non-technical files, but not technical ones: -the former might have a name like \texttt{2019-10-30 Sampling Procedure Description.docx} -while the latter might have a name like \texttt{endline-sampling.do}. -This will ensure you can find files within folders -and reduce the amount of time others will spend opening files -to find out what is inside them. -Technical files, however, have stricter requirements.\sidenote{ - \url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf}} -For example, you should never use spaces in technical names; -this can cause problems in code. (This includes all folder names.) -One practice that takes some getting used to -is the fact that the best names from a coding perspective -are usually the opposite of those from an English perspective. -For example, for a deidentified household dataset from the baseline round, -you should prefer a name like \texttt{baseline-household-deidentified.dta}, -rather than the opposite way around as occurs in natural language. -This ensures that all \texttt{baseline} data stays together, -then all \texttt{baseline-household} data, -and then provides unique information about this one. - % ---------------------------------------------------------------------------------------------- \subsection{Code management} From 399915572ca2885732ebf2597bf48aa702a76a34 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 10:44:42 +0530 Subject: [PATCH 084/854] 200 lines (#120) --- chapters/planning-data-work.tex | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 921c48a96..45cbe92dd 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -417,8 +417,8 @@ \subsection{Code management} but if the code is well-organized, they will be much easier to make. Below we discuss a few crucial steps to code organization. They all come from the principle that code is an output by itself, -not just a means to an end. -So code should be written thinking of how easy it will be for someone to read it later. +not just a means to an end, +and code should be written thinking of how easy it will be for someone to read it later. Code documentation is one of the main factors that contribute to readability, if not the main one. @@ -458,11 +458,12 @@ \subsection{Code management} You can then add and navigate through them using the find command. Since Stata code is harder to navigate, as you will need to scroll through the document, it's particularly important to avoid writing very long scripts. -One reasonable rule of thumb is to not write files that have more than 200 lines. -This is also true for other statistical software, -though not following it will not cause such a hassle. - -\codeexample{stata-master-dofile.do}{./code/stata-master-dofile.do} +Therefore, in Stata at least, you should also consider breaking code tasks down +into separate do-files, since there is no limit on how many you can have, +how detailed their names can be, and no advantage to writing longer files. +One reasonable rule of thumb is to not write do-files that have more than 200 lines. +This is an arbitrary limit, just like the standard restriction of each line to 80 characters: +it seems to be ``enough but not too much'' for most purposes. To bring all these smaller code files together, maintain a master script. A master script is the map of all your project's data work, @@ -476,7 +477,8 @@ \subsection{Code management} The master script is also where all the settings are established, such as folder paths, functions and constants used throughout the project. -\texttt{iefolder} also creates master do-files.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} +\texttt{iefolder} also creates master do-files.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} Master scripts are a key element of code organization and collaboration, and we will discuss some important features soon. With regard to folder structure, it's important to keep in mind @@ -584,3 +586,7 @@ \subsection{Output management} But as you start to export tables and graphs, you'll want to save separate scripts, where \texttt{descriptive\_statistics.do} creates \texttt{descriptive\_statistics.tex}. + +% ---------------------------------------------------------------------------------------------- + +\codeexample{stata-master-dofile.do}{./code/stata-master-dofile.do} From ed4bc239afff4f357586615bffdd44e5e2932461 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 10:48:07 +0530 Subject: [PATCH 085/854] Explain "absolute file path" (#184) --- chapters/planning-data-work.tex | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 45cbe92dd..3ee050279 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -99,6 +99,8 @@ \subsection{Setting up your computer} On Windows, this will be something like ``This PC''. Nearly everything we talk about will assume you are starting from here. Ensure you know how to get the \textbf{absolute file path} for any given file. +Using the absolute file path, starting from the filesystem root, +means that the computer will never accidentally load the wrong file. On MacOS this will be something like \path{/users/username/dropbox/project/...}, and on Windows, \path{C:/users/username/github/project/...}. We will write file paths such as \path{/Dropbox/project-title/DataWork/EncryptedData/} @@ -466,8 +468,8 @@ \subsection{Code management} it seems to be ``enough but not too much'' for most purposes. To bring all these smaller code files together, maintain a master script. -A master script is the map of all your project's data work, -a table of contents for the instructions that you code. +A master script is the map of all your project's data work +which serves as a table of contents for the instructions that you code. Anyone should be able to follow and reproduce all your work from raw data to all outputs by simply running this single script. By follow, we mean someone external to the project who has the master script can @@ -475,17 +477,17 @@ \subsection{Code management} (ii) have a general understanding of what is being done at every step, and (iii) see how codes and outputs are related. The master script is also where all the settings are established, -such as folder paths, functions and constants used throughout the project. +such as versions, folder paths, functions, and constants used throughout the project. -\texttt{iefolder} also creates master do-files.\sidenote{ +\texttt{iefolder} creates these as master do-files.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} Master scripts are a key element of code organization and collaboration, and we will discuss some important features soon. With regard to folder structure, it's important to keep in mind that the master script should mimic the structure of the \texttt{/DataWork/} folder. This is done through the creation of globals (in Stata) or string scalars (in R). -These are ``variables'' -- coding shortcuts that refer to subfolders, -so that those folders can be referenced without repeatedly writing out their complete filepaths. +These coding shortcuts can refer to subfolders, +so that those folders can be referenced without repeatedly writing out their absolute filepaths. Because the \texttt{/DataWork/} folder is shared by the whole team, its structure is the same in each team member's computer. What may differ is the path to the project folder (the highest-level shared folder). From 997bf5d7a0241d477ca8a6c2624c44e7ef933728 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 11:08:17 +0530 Subject: [PATCH 086/854] Code and outputs --- chapters/planning-data-work.tex | 94 ++++++++++++++++----------------- 1 file changed, 47 insertions(+), 47 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 3ee050279..44587c066 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -103,7 +103,7 @@ \subsection{Setting up your computer} means that the computer will never accidentally load the wrong file. On MacOS this will be something like \path{/users/username/dropbox/project/...}, and on Windows, \path{C:/users/username/github/project/...}. -We will write file paths such as \path{/Dropbox/project-title/DataWork/EncryptedData/} +We will write file paths such as \path{/Dropbox/project-titleDataWorkEncryptedData/} using forward slashes (\texttt{/}), and mostly use only A-Z, dash (\texttt{-}), and underscore (\texttt{\_}). You should \textit{always} use forward slashes (\texttt{/}) in file paths, just like an internet address, and no matter how your computer writes them, @@ -302,9 +302,9 @@ \subsection{File and folder management} To support this, DIME Analytics created and maintains \texttt{iefolder}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iefolder}} as a part of our \texttt{ietoolkit} package. -This command sets up a pre-standardized folder structure for what we call the \texttt{/DataWork/} folder.\sidenote{ +This command sets up a pre-standardized folder structure for what we call the \texttt{DataWork} folder.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} -The \texttt{/DataWork/} folder includes folders for all the steps of a typical project. +The \texttt{DataWork} folder includes folders for all the steps of a typical project. Since each project will always have its own needs, we have tried to make it as easy as possible to adapt when that is the case. The main advantage of having a universally standardized folder structure @@ -315,7 +315,7 @@ \subsection{File and folder management} can easily move between projects without having to re-orient themselves to how files and folders are organized. -The \texttt{/DataWork/} folder may be created either inside +The \texttt{DataWork} folder may be created either inside an existing project-based folder structure, or it may be created separately. It should always be created by the leading RA by agreement with the PI. Increasingly, we recommend that you create entire data work folder @@ -368,7 +368,7 @@ \subsection{File and folder management} then all \texttt{baseline-household} data, and then provides unique information about this one. -Setting up the \texttt{/DataWork/} folder folder in a git-managed directory +Setting up the \texttt{DataWork} folder folder in a git-managed directory also enabled you to use Git and GitHub for version control on your code files. A \textbf{version control system} is required to manage changes to any technical file. A good version control system tracks who edited each file and when, @@ -393,7 +393,7 @@ \subsection{File and folder management} However, the DIME structure is not for everyone. If you do not already have a standard file structure across projects, it is intended to be an easy template to start from. -This structure operates by creating a \texttt{/DataWork/} folder at the project level, +This structure operates by creating a \texttt{DataWork} folder at the project level, and within that folder, it provides standardized directory structures for each data source (in the primary data context, ``rounds'' of data collection). For each, \texttt{iefolder} creates folders for raw encrypted data, @@ -483,36 +483,37 @@ \subsection{Code management} \url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} Master scripts are a key element of code organization and collaboration, and we will discuss some important features soon. -With regard to folder structure, it's important to keep in mind -that the master script should mimic the structure of the \texttt{/DataWork/} folder. +The master script should mimic the structure of the \texttt{DataWork} folder. This is done through the creation of globals (in Stata) or string scalars (in R). These coding shortcuts can refer to subfolders, so that those folders can be referenced without repeatedly writing out their absolute filepaths. -Because the \texttt{/DataWork/} folder is shared by the whole team, +Because the \texttt{DataWork} folder is shared by the whole team, its structure is the same in each team member's computer. -What may differ is the path to the project folder (the highest-level shared folder). +The only difference between machines should be +the path to the project or \texttt{DataWork} folder (the highest-level shared folder). This is reflected in the master script in such a way that the only change necessary to run the entire code from a new computer -is to change the path to the project folder. +is to change the path to the project folder to reflect the filesystem and username. The code in \texttt{stata-master-dofile.do} shows how folder structure is reflected in a master do-file. -Agree with your team on a plan to review code as it is written. +Finally, agree with your team on a plan to review code as it is written. Reading other people's code is the best way to improve your coding skills. And having another set of eyes on your code will make you more comfortable with the results you find. It's normal (and common) to make mistakes as you write your code quickly. Reading it again to organize and comment it as you prepare it to be reviewed will help you identify them. Try to have code review scheduled frequently, as you finish writing a piece of code, or complete a small task. If you wait for a long time to have your code review, and it gets too long, -preparing it for code review and reviewing them will require more time and work, +preparation and code review will require more time and work, and that is usually the reason why this step is skipped. -Making sure that the code is running, -and that other people can understand the code is also the easiest way to ensure a smooth project handover. +Making sure that the code is running properly on other machines, +and that other people can read and understand the code easily, +is also the easiest way to ensure a smooth project handover. % ---------------------------------------------------------------------------------------------- \subsection{Output management} -Another task that needs to be discussed with your team is the best way to manage outputs. -A great number of them will be created during the course of a project, +The final task that needs to be discussed with your team is the best way to manage outputs. +A great number of outputs will be created during the course of a project, from raw outputs such as tables and graphs to final products such as presentations, papers and reports. When the first outputs are being created, agree on where to store them, what software to use, and how to keep track of them. @@ -521,20 +522,24 @@ \subsection{Output management} Decisions about storage of final outputs are made easier by technical constraints. As discussed above, Git is a great way to control for different versions of plain text files, and sync software such as Dropbox are better for binary files. -So storing raw outputs in formats like \texttt{.tex} and \texttt{.eps} in Git and -final outputs in PDF, PowerPoint or Word, makes sense. -Storing plain text outputs on Git makes it easier to identify changes that affect results. +Raw outputs in formats like \texttt{.tex} and \texttt{.eps} can be managed with Git, +while final outputs like PDF, PowerPoint, or Word, can be kept in a synced folder. +Storing plaintext outputs on Git makes it easier to identify changes that affect results. If you are re-running all of your code from the master when significant changes to the code are made, the outputs will be overwritten, and changes in coefficients and number of observations, for example, -will be highlighted. +will be highlighted for you to review. +In fact, one of the most effective ways to check code quickly +is simply to commit all your code and outputs, +then re-run the entire thing and examine any flagged changes in the directory. % What software to use Though formatted text software such as Word and PowerPoint are still prevalent, -more and more researchers are choosing to write final outputs using -\LaTeX.\sidenote{\url{https://www.latex-project.org}} +researchers are increasinly choosing to prepare even final outputs +like documents and presentations using {\LaTeX}.\sidenote{ + \url{https://www.latex-project.org}} {\LaTeX} is a document preparation system that can create both text documents and presentations. -The main difference between them is that {\LaTeX} uses plain text, -and it's necessary to learn its markup convention to use it. +The main advantage is that {\LaTeX} uses plaintext for all formatting, +and it is necessary to learn its specific markup convention to use it. The main advantage of using {\LaTeX} is that you can write dynamic documents, that import inputs every time they are compiled. This means you can skip the copying and pasting whenever an output is updated. @@ -542,22 +547,21 @@ \subsection{Output management} Creating documents in {\LaTeX} using an integrated writing environment such as TeXstudio is great for outputs that focus mainly on text, but include small chunks of code and static code outputs. -This book, for example, was written in \LaTeX. +This book, for example, was written in {\LaTeX} and managed on GitHub. Another option is to use the statistical software's dynamic document engines. This means you can write both text (in Markdown) and code in the script, -and the result will usually be a PDF or html file including code, -text and code outputs. +and the result will usually be a PDF or \texttt{html} file including code, text, and outputs. Dynamic document tools are better for including large chunks of code and dynamically created graphs and tables, -but formatting can be trickier. -So it's great for creating appendices, -or quick document with results as you work on them, -but not for final papers and reports. +but formatting these can be much trickier and less full-featured than other editors. +So dynamic documents can be great for creating appendices +or quick documents with results as you work on them, +but are not usuall considered for final papers and reports. RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com/}} is the most widely adopted solution in R. There are also different options for Markdown in Stata, -such as German Rodriguez' \texttt{markstat},\sidenote{\url{https://data.princeton.edu/stata/markdown}} +such as \texttt{markstat},\sidenote{\url{https://data.princeton.edu/stata/markdown}} Stata 15 dynamic documents,\sidenote{\url{https://www.stata.com/new-in-stata/markdown/}} -and Ben Jann's \texttt{webdoc}\sidenote{\url{http://repec.sowi.unibe.ch/stata/webdoc/index.html}} and +\texttt{webdoc},\sidenote{\url{http://repec.sowi.unibe.ch/stata/webdoc/index.html}} and \texttt{texdoc}.\sidenote{\url{http://repec.sowi.unibe.ch/stata/texdoc/index.html}} Whichever options you choose, @@ -573,21 +577,17 @@ \subsection{Output management} And anyone who has tried to recreate a graph after a few months probably knows that it can be hard to remember where you saved the code that created it. Here, naming conventions and code organization play a key role in not re-writing scripts again and again. -Use intuitive and descriptive names when you save your code. -It's often desirable to have the names of your outputs and scripts linked, -so, for example, \texttt{merge.do} creates \texttt{merged.dta}. -Document output creation in the Master script, -meaning before the line that runs a script there are a few lines of comments listing +It is common for teams to maintain an analyisis file or folder with ``exploratory analysis'', +which are pieces of code that are commented and written only to be found again in the future, +but not cleaned up to be included in any outputs yet. +Once you are happy with a partiular result or output, however, +it should be named and moved to a dedicated location. +It's typically desirable to have the names of outputs and scripts linked, +so, for example, \texttt{factor-analysis.do} creates \texttt{factor-analysis-f1.eps} and so on. +Document output creation in the Master script that runs these files, +so that before the line that runs a particular analysis script there are a few lines of comments listing data sets and functions that are necessary for it to run, as well as all outputs created by that script. -When performing data analysis, -it's ideal to write one script for each output, -as well as linking them through name. -This means you may have a long script with ``exploratory analysis'', -just to document everything you have tried. -But as you start to export tables and graphs, -you'll want to save separate scripts, where -\texttt{descriptive\_statistics.do} creates \texttt{descriptive\_statistics.tex}. % ---------------------------------------------------------------------------------------------- From 5c918b4c3536ca6151fbe9805197ba7ea363656f Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 11:33:36 +0530 Subject: [PATCH 087/854] Intro updates --- chapters/planning-data-work.tex | 42 +++++++++++++++++---------------- 1 file changed, 22 insertions(+), 20 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 44587c066..57b6c1f88 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -5,21 +5,22 @@ and involves planning both the software tools you will use yourself as well as the collaboration platforms and processes for your team. In order to be prepared to work on the data you receive with a group, -you need to know what you are getting into. +you need to plan out the structure of your workflow in advance. This means knowing which data sets and output you need at the end of the process, -how they will stay organized and linked, -what different types and levels of data you'll handle, -and whether the data will require special handling due to volume or privacy considerations. +how they will stay organized, what types of data you'll handle, +and whether the data will require special handling due to size or privacy considerations. Identifying these details creates a \textbf{data map} for your project, giving you and your team a sense of how information resources should be organized. It's okay to update this map once the project is underway -- the point is that everyone knows what the plan is. -Then, you must identify and prepare your collaborative tools and workflow. -Changing software and protocols half-way through a project can be costly and time-consuming, -so it's important to think ahead about decisions that may seem of little consequence -(think: creating a new folder and moving files into it). -Similarly, having a self-documenting discussion platform +To build this plan, you will need to prepare collaborative tools and workflows. +Changing software or protocols half-way through a project can be costly and time-consuming, +so it's important to think ahead about decisions that may seem of little consequence. +For example, things as simple as sharing services, folder structures, and file names +can be extremely painful to alter down the line in any project. +Similarly, making sure to set up a self-documenting discussion platform +and version control processes makes working together on outputs much easier from the very first discussion. This chapter will discuss some tools and processes that will help prepare you for collaboration and replication. @@ -29,7 +30,7 @@ meaning you will become most comfortable with each tool only by using it in real-world work. Get to know them well early on, -so that you do not spend a lot of time later figuring out basic functions. +so that you do not spend a lot of time learning through trial and error. \end{fullwidth} % ---------------------------------------------------------------------------------------------- @@ -111,24 +112,25 @@ \subsection{Setting up your computer} Making the structure of your files part of your workflow is really important, as is naming them correctly so you know what is where. -If you are working with others, you will most likely be using some kind -of file collaboration method. -The exact method you use will depend on your tasks, -but three methods are the most common. +If you are working with others, you will most likely be using +some kind of active file sharing software. +The exact providers and combinations you use will depend on your tasks, +but in general, there are three file sharing paradigms that are the most common. \textbf{File syncing} is the most familiar method, and is implemented by software like Dropbox and OneDrive. Sync forces everyone to have the same version of every file at the same time, which makes simultaneous editing difficult but other tasks easier. -(They also have some security concerns which we will address later.) +They also have some security concerns which we will address later. \textbf{Version control} is another method, -and is implemented by tools like GitHub. -Version control allows everyone to have different versions at the same time, -making simultaneous editing easier but other tasks harder. +and is implemented by tools like Git and GitHub. +Version control allows everyone to access different versions at the same time, +making simultaneous editing easier but some other tasks harder. +It is also only optimized for specific types of files. Finally, \textbf{server storage} is the least-used method, because there is only one version of the materials, and simultaneous access must be carefully regulated. -However, server storage ensures that everyone has access -to exactly the same files, and also enables +Server storage ensures that everyone has access +to exactly the same files and environment, and it also enables high-powered computing processes for large and complex data. All three methods are used for sharing and collaborating, and you should review the types of data work From cebdbbb29c0afc0d3aeda03b2c0400ffb8dcab6d Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 13:36:47 +0530 Subject: [PATCH 088/854] Edits and reorg file management --- chapters/planning-data-work.tex | 356 ++++++++++++++++++-------------- 1 file changed, 199 insertions(+), 157 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 57b6c1f88..068688c1f 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -1,7 +1,7 @@ % ---------------------------------------------------------------------------------------------- \begin{fullwidth} -Preparation for data work begins long before you collect any data, +Preparation for collaborative data work begins long before you collect any data, and involves planning both the software tools you will use yourself as well as the collaboration platforms and processes for your team. In order to be prepared to work on the data you receive with a group, @@ -14,10 +14,10 @@ It's okay to update this map once the project is underway -- the point is that everyone knows what the plan is. -To build this plan, you will need to prepare collaborative tools and workflows. -Changing software or protocols half-way through a project can be costly and time-consuming, +To implement this plan, you will need to prepare collaborative tools and workflows. +Changing software or protocols halfway through a project can be costly and time-consuming, so it's important to think ahead about decisions that may seem of little consequence. -For example, things as simple as sharing services, folder structures, and file names +For example, things as simple as sharing services, folder structures, and filenames can be extremely painful to alter down the line in any project. Similarly, making sure to set up a self-documenting discussion platform and version control processes @@ -26,7 +26,7 @@ will help prepare you for collaboration and replication. We will try to provide free, open-source, and platform-agnostic tools wherever possible, and point to more detailed instructions when relevant. -However, most have a learning and adaptation process, +Most have a learning and adaptation process, meaning you will become most comfortable with each tool only by using it in real-world work. Get to know them well early on, @@ -39,30 +39,31 @@ \section{Preparing a collaborative work environment} Being comfortable using your computer and having the tools you need in reach is key. -This section provides a brief introduction to key concepts and toolkits -that can help you take on the work you will be primarily responsible for. +This section provides a brief introduction to core concepts and tools +that can help you handle the work you will be primarily responsible for. Some of these skills may seem elementary, but thinking about simple things from a workflow perspective -can help you make marginal improvements every day you work. -These add up, and together form a collaborative workflow +can help you make marginal improvements every day you work +that add up to substantial gains over the course of many projects. +Together, these processes should form a collaborative workflow that will greatly accelerate your team's ability to get tasks done on every project you take on together. -Teams often develop their workflows as they go, +Teams often develop their workflows over time, solving new challenges as they arise. -This is broadly okay -- but it is important to recognize -that there are a number of tasks that will always have to be completed during any project, -and that the corresponding workflows can be agreed on in advance. +This is good. But it is important to recognize +that there are a number of tasks that will exist for every project, +and that their corresponding workflows can be agreed on in advance. These include documentation methods, software choices, naming schema, organizing folders and outputs, collaborating on code, managing revisions to files, and reviewing each other's work. These tasks appear in almost every project, -and also translate well between projects. -Therefore, there are large efficiency gains to -thinking about the best way to do these tasks ahead of time, -instead of just doing it quickly as needed. +and their solutions translate well between projects. +Therefore, there are large efficiency gains over time to +thinking in advance about the best way to do these tasks, +instead of throwing together a solution when the task arises. This chapter will outline the main points to discuss within the team, -and suggest some common solutions. +and suggest some common solutions for these tasks. % ---------------------------------------------------------------------------------------------- \subsection{Setting up your computer} @@ -71,115 +72,138 @@ \subsection{Setting up your computer} Make sure you have fully updated the operating system, that it is in good working order, and that you have a \textbf{password-protected} login. -All machines should have hard disk encryption enabled. -Disk encryption is built in to most modern operating systems; + \index{password protection} +All machines should have \textbf{hard disk encryption} enabled. + \index{encryption} +Disk encryption is built-in on most modern operating systems; the service is currently called BitLocker on Windows or FileVault on MacOS. Disk encryption prevents your files from ever being accessed without first entering the system password. This is different from file-level encryption, which makes individual files unreadable without a specific key. -We will address that in more detail later. +(We will address that in more detail later.) As with all critical passwords, your system password should be strong, memorable, and backed up in a separate secure location. Make sure your computer is backed up to prevent information loss. -Follow the \textbf{3-2-1 rule}: maintain 3 copies of everything, + \index{backup} +Follow the \textbf{3-2-1 rule}: maintain 3 copies of all critical data, on at least 2 different hardware devices you have access to, with 1 offsite storage method.\sidenote{ \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} One reasonable setup is having your primary computer, -then a local hard drive managed with a tool like Time Machine +a local hard drive managed with a tool like Time Machine (alternatively, a fully synced secondary computer), and either a remote copy maintained by a cloud backup service or all original files stored on a remote server. Dropbox and other synced files count only as local copies and never as remote backups, -because other users can alter them. +because other users can alter or delete them. + \index{Dropbox} Find your \textbf{home folder}. It is never your desktop. On MacOS, this will be a folder with your username. On Windows, this will be something like ``This PC''. -Nearly everything we talk about will assume you are starting from here. -Ensure you know how to get the \textbf{absolute file path} for any given file. -Using the absolute file path, starting from the filesystem root, +Ensure you know how to get the \textbf{absolute filepath} for any given file. +Using the absolute filepath, starting from the filesystem root, means that the computer will never accidentally load the wrong file. + \index{filepaths} On MacOS this will be something like \path{/users/username/dropbox/project/...}, and on Windows, \path{C:/users/username/github/project/...}. -We will write file paths such as \path{/Dropbox/project-titleDataWorkEncryptedData/} -using forward slashes (\texttt{/}), and mostly use only A-Z, dash (\texttt{-}), and underscore (\texttt{\_}). -You should \textit{always} use forward slashes (\texttt{/}) in file paths, -just like an internet address, and no matter how your computer writes them, -because the other type will cause your work to break many systems. -Making the structure of your files part of your workflow is really important, -as is naming them correctly so you know what is where. - -If you are working with others, you will most likely be using -some kind of active file sharing software. -The exact providers and combinations you use will depend on your tasks, +We will write filepaths such as \path{/Dropbox/project-titleDataWorkEncryptedData/}, +assuming the ``Dropbox'' folder lives inside your home folder. +Filepaths will use forward slashes (\texttt{/}) to indicate folders, +and typically use only A-Z (the 26 English characters), +dashes (\texttt{-}), and underscores (\texttt{\_}) in folder names and filenames. +You should \textit{always} use forward slashes (\texttt{/}) in filepaths in code, +just like an internet address, and no matter how your computer provides them, +because the other type will cause your code to break on many systems. +Making the structure of your directories a core part of your workflow is very important, +since otherwise you will not be able to reliably transfer the instructions +for replicating or carrying out your analytical work. + +When you are working with others, you will most likely be using +some kind of \textbf{file sharing} software. + \index{file sharing} +The exact services you use will depend on your tasks, but in general, there are three file sharing paradigms that are the most common. \textbf{File syncing} is the most familiar method, and is implemented by software like Dropbox and OneDrive. + \index{file syncing} Sync forces everyone to have the same version of every file at the same time, which makes simultaneous editing difficult but other tasks easier. They also have some security concerns which we will address later. \textbf{Version control} is another method, -and is implemented by tools like Git and GitHub. -Version control allows everyone to access different versions at the same time, +commonly implemented by tools like Git and GitHub. + \index{version control} +Version control allows everyone to access different versions of files at the same time, making simultaneous editing easier but some other tasks harder. It is also only optimized for specific types of files. -Finally, \textbf{server storage} is the least-used method, +Finally, \textbf{server storage} is the least-common method, because there is only one version of the materials, and simultaneous access must be carefully regulated. + \index{server storage} Server storage ensures that everyone has access to exactly the same files and environment, and it also enables high-powered computing processes for large and complex data. -All three methods are used for sharing and collaborating, +All three file sharing methods are used for collaborative workflows, and you should review the types of data work -that you are going to be doing, and plan which processes -will live in which types of locations. +that you are going to be doing, and plan which types of files +will live in which types of sharing services. +It is important to note that they are, in general, not interoperable: +you cannot have version-controlled files inside a syncing service, +or vice versa, without setting up complex workarounds, +and you cannot shift files between them without losing historical information. +Therefore, choosing the correct sharing service at the outset is essential. % ---------------------------------------------------------------------------------------------- \subsection{Documenting decisions and tasks} -Once your technical workspace is set up, +Once your technical and sharing workspace is set up, you need to decide how you are going to communicate with your team. -The first habit that many teams need to break is using e-mail for management tasks. -E-mail is, simply put, not a system. It is not a system for anything. -E-mail was developed for communicating ``now'' and this is what it does well. -It is not structured to manage group membership or to present the same information +The first habit that many teams need to break +is using instant communication for management and documentation. +Email is, simply put, not a system. It is not a system for anything. Neither is WhatsApp. + \index{email} \index{WhatsApp} +These tools are developed for communicating ``now'' and this is what they does well. +These tools are not structured to manage group membership or to present the same information across a group of people, or to remind you when old information becomes relevant. -It is not structured to allow people to collaborate over a long time or to review old discussions. +They are not structured to allow people to collaborate over a long time or to review old discussions. It is therefore easy to miss or lose communications from the past when they have relevance in the present. -Everything that is communicated over e-mail or any other medium should +Everything that is communicated over e-mail or any other instant medium should immediately be transferred into a system that is designed to keep records. We call these systems collaboration tools, and there are several that are very useful.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Collaboration_Tools}} + \index{collaboration tools} -Many task management tools are online or web-based, -so that everyone on your team can access them simultaneously. -Many of them are based on an underlying system known as ``Kanban boards''.\sidenote{ +Many collaboration tools are web-based +so that everyone on your team can access them simultaneously +and have live discussions about tasks and processes. +Many are based on an underlying system known as ``Kanban''.\sidenote{ \url{https://en.wikipedia.org/wiki/Kanban_board}} This task-oriented system allows the team to create and assign tasks, -to track progress across time, and to quickly see the project state. -These systems also link communications to specific tasks so that -the records related to decision making on those tasks is permanently recorded. -A common and free implementation of this system is the one found in GitHub project boards. -You may also use a system like GitHub Issues or task assignment on Dropbox Paper, -which has a more chronological structure, if this is appropriate to your project. -What is important is that you have a system and you stick to it, +carry out discussions related to single tasks, +track task progress across time, and quickly see the overall project state. +These systems therefore link communication to specific tasks so that +the records related to decision making on those tasks is permanently recorded +and easy to find in the future when questions about that task come up. +One popular and free implementation of this system is the one found in GitHub project boards. +Other systems which offer similar features (but are not explicitly Kanban-based) +are GitHub Issues and Dropbox Paper, which has a more chronological structure. +What is important is that your team chooses its system and sticks to it, so that decisions, discussions, and tasks are easily reviewable long after they are completed. Just like we use different file sharing tools for different types of files, -we use different collaboration tools for different types of tasks. +we can use different collaboration tools for different types of tasks. Our team, for example, uses GitHub Issues for code-related tasks, and Dropbox Paper for more managerial and office-related tasks. GitHub creates incentives for writing down why changes were made -as they are saved, creating naturally documented code. +in response to specific discussions +as they are completed, creating naturally documented code. It is useful also because tasks in Issues can clearly be tied to file versions. -Thus, GitHub serves as a great tool for managing code-related tasks. -On the other hand, Dropbox Paper provides a good interface with notifications, +On the other hand, Dropbox Paper provides a clean interface with task notifications, and is very intuitive for people with non-technical backgrounds. It is useful because tasks can be easily linked to other documents saved in Dropbox. -Thus, it is a great tool for managing non-code-related tasks. +Therefore, it is a better tool for managing non-code-related tasks. Neither of these tools require much technical knowledge; they merely require an agreement and workflow design so that the people assigning the tasks are sure to set them up in the appropriate system. @@ -187,7 +211,8 @@ \subsection{Documenting decisions and tasks} % ---------------------------------------------------------------------------------------------- \subsection{Choosing software} -Choosing the right personal and team working environment can also make your work easier. +Choosing the right working environments can make your work significantly easier. + \index{software environments} It may be difficult or costly to switch halfway through a project, so think ahead about the different software to be used. Take into account the different levels of techiness of team members, @@ -198,43 +223,51 @@ \subsection{Choosing software} particularly if you are trying to sync or collaborate on them. Also consider the cost of licenses, the time to learn new tools, and the stability of the tools. -There are few strictly right or wrong answers, +There are few strictly right or wrong choices for software, but what is important is that you have a plan in advance -and understand how your tools with interact with your work. +and understand how your tools will interact with your work. Ultimately, the goal is to ensure that you will be able to hold -your code environment constant over the life cycle of a single project. +your code environment constant over the lifecycle of a single project. While this means you will inevitably have different projects with different code environments, each one will be better than the last, and you will avoid the extremely costly process of migrating a project -into a new code enviroment. +into a new code enviroment while it is still ongoing. This can be set up down to the software level: -you need to ensure that even specific versions of software +you should ensure that even specific versions of software and the individual packages you use are referenced or maintained so that they can be reproduced going forward -even if their most recent version contains changes that would break your code. +even if their most recent releases contain changes that would break your code. + \index{software versions} (For example, our command \texttt{ieboilstart} in the \texttt{ietoolkit} package provides functionality to support Stata version stability.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/ieboilstart}}) -Next, think about how and where you write code. -The rest of this book focuses mainly on primary survey data, -so we are going to broadly assume that you are using ``small'' data -in one of the two popular desktop-based packages for that kind of work: R or Stata. +Next, think about how and where you write and execute code. +This book focuses mainly on primary survey data, +so we are going to broadly assume that you are using ``small'' datasets +in one of the two most popular desktop-based packages: R or Stata. (If you are using another language, like Python, or working with big data projects on a server installation, -you can skip this section.) -The most important part of working with code is a code editor. -This does not need to be the same program as the code runs in. -This can be preferable since your editor will not crash if your code does, +many of the same principles apply but the specifics will be different.) +The most visible part of working with code is a code editor, +since most of your time will be spent writing and re-writing your code. +This does not need to be the same program as the code runs in, +and the various members of your team do not need to use the same editor. +Using an external editor can be preferable since your editor will not crash if your code does, and may offer additional features aimed at writing code well. If you are working in R, \textbf{RStudio} is the typical choice.\sidenote{ \url{https://www.rstudio.com}} For Stata, the built-in do-file editor is the most widely adopted code editor, -and \textbf{Atom}\sidenote{\url{https://atom.io}} and \textbf{Sublime}\sidenote{\url{https://www.sublimetext.com/}} can also be configured to run Stata code externally, while offering great accessibility features. -For example, these tools can work on an entire directory -- rather than a single file -- +but \textbf{Atom}\sidenote{\url{https://atom.io}} and +\textbf{Sublime}\sidenote{\url{https://www.sublimetext.com/}} +can also be configured to run Stata code externally, +while offering great code accessibility and quality features. +(We recommend setting up and becoming comfortable with one of these.) +For example, these editors can access an entire directory -- rather than a single file -- which gives you access to directory views and file management actions, -such as folder management, Git integration, and simultaneous work with other types of files without leaving the editor. +such as folder management, Git integration, +and simultaneous work with other types of files, without leaving the editor. In our field of development economics, Stata is by far the most commonly used programming language, @@ -281,12 +314,12 @@ \section{Organizing code and data} and easily findable at any point in time. Maintaining an organized file structure for data work is the best way to ensure that you, your teammates, and others -are able to easily work on, edit, and replicate your work in the future. -It also ensures that core automation processes like script tools -are able to interact well will your work, +are able to easily advance, edit, and replicate your work in the future. +It also ensures that automated processes from code and scripting tools +are able to interact well with your work, whether they are yours or those of others. File organization makes your own work easier as well as more transparent, -and plays well with tools like version control systems +and interacts well with tools like version control systems that aim to cut down on the amount of repeated tasks you have to perform. It is worth thinking in advance about how to store, name, and organize the different types of files you will be working with, @@ -296,17 +329,20 @@ \section{Organizing code and data} % ---------------------------------------------------------------------------------------------- \subsection{File and folder management} -Agree with your team on a specific directory structure, and -set it up at the beginning of the research project -in your synced or shared top-level folder. -This will prevent folder re-organization that may slow down your workflow and, -more importantly, ensure your code files are always able to run on any machine. -To support this, DIME Analytics created and maintains \texttt{iefolder}\sidenote{ +Agree with your team on a specific directory structure, +and set it up at the beginning of the research project +in your top-level shared folder (the one over which you can control access permissions). +This will prevent future folder reorganizations that may slow down your workflow and, +more importantly, ensure that your code files are always able to run on any machine. +To support consistent folder organization, DIME Analytics maintains \texttt{iefolder}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iefolder}} as a part of our \texttt{ietoolkit} package. -This command sets up a pre-standardized folder structure for what we call the \texttt{DataWork} folder.\sidenote{ + \index{\texttt{iefolder}} \index{\texttt{ietoolkit}} +This Stata command sets up a pre-standardized folder structure +for what we call the \texttt{DataWork} folder.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} The \texttt{DataWork} folder includes folders for all the steps of a typical project. + \index{\texttt{DataWork} folder} Since each project will always have its own needs, we have tried to make it as easy as possible to adapt when that is the case. The main advantage of having a universally standardized folder structure @@ -314,53 +350,99 @@ \subsection{File and folder management} time to get acquainted with a new organization scheme. For our group, maintaining a single unified directory structure across the entire portfolio of projects means that everyone -can easily move between projects without having to re-orient +can easily move between projects without having to reorient themselves to how files and folders are organized. +The DIME file structure is not for everyone. +But if you do not already have a standard file structure across projects, +it is intended to be an easy template to start from. +This system operates by creating a \texttt{DataWork} folder at the project level, +and within that folder, it provides standardized directory structures +for each data source (in the primary data context, ``rounds'' of data collection). +For each, \texttt{iefolder} creates folders for raw encrypted data, +raw deidentified data, cleaned data, final data, outputs, and documentation. +In parallel, it creates folders for the code files +that move the data through this progression, +and for the files that manage final analytical work. +The command also has some flexibility for the addition of +folders for non-primary data sources, although this is less well developed. +The package also includes the \texttt{iegitaddmd} command, +which can place a \texttt{README.md} file in each of these folders. +These \textbf{Markdown} files provide an easy and git-compatible way +to document the contents of every folder in the structure. + \index{Markdown} + The \texttt{DataWork} folder may be created either inside an existing project-based folder structure, or it may be created separately. It should always be created by the leading RA by agreement with the PI. -Increasingly, we recommend that you create entire data work folder -separately in a git-managed folder, and reserve the project folder -for tasks related to data collection and other project management work. -The project folder can be maintained in a synced location like Dropbox, -and the code folder can be maintained in a version-controlled location like GitHub. -(A version-controlled folder can \textit{never} be stored inside a synced folder, +Increasingly, our recommendation is to create the \texttt{DataWork} folder +separately from the project management materials, +reserving the ``project folder'' for data collection and other management work. + \index{project folder} +This is so the project folder can be maintained in a synced location like Dropbox, +while the code folder can be maintained in a version-controlled location like GitHub. +(Remember, a version-controlled folder can \textit{never} be stored inside a synced folder, because the versioning features are extremely disruptive to others -when the syncing utility operates on them.) -Nearly all code and raw outputs (not datasets) are better managed this way. +when the syncing utility operates on them, and vice versa.) +Nearly all code files and raw outputs (not datasets) are best managed this way. This is because code files are usually \textbf{plaintext} files, and non-technical files are usually \textbf{binary} files. + \index{plaintext} \index{binary files} It's also becoming more and more common for written outputs such as reports, presentations and documentations to be written using plaintext tools such as {\LaTeX} and dynamic documents. + \index{{\LaTeX}} \index{dynamic documents} Keeping such plaintext files in a version-controlled folder allows you to maintain better control of their history and functionality. -Because of the specificity with which code files depends on file structure, -you will be able to enforce better practices there than in the project folder, +Because of the high degree with which code files depend on file structure, +you will be able to enforce better practices in a separate folder than in the project folder, which will usually be managed by a PI, FC, or field team members. -Once the directory structure is set up, +Setting up the \texttt{DataWork} folder folder in a version-controlled directory +also enables you to use Git and GitHub for version control on your code files. +A \textbf{version control system} is required to manage changes to any technical file. +A good version control system tracks who edited each file and when, +and additionally provides a protocol for ensuring that conflicting versions are avoided. +This is important, for example, for your team +to be able to find the version of a presentation that you delivered to a donor, +or to understand why the significance level of your estimates has changed. +Everyone who has ever encountered a file named something like \texttt{final\_report\_v5\_LJK\_KLE\_jun15.docx} +can appreciate how useful such a system can be. +Most syncing services offer some kind of rudimentary version control; +These are usually enough to manage changes to binary files (such as Word and PowerPoint documents) +without needing to rely on dreaded filename-based versioning conventions. +For technical files, however, a more detailed version control system is usually desirable. +We recommend using Git\sidenote{ + \textbf{Git:} a multi-user version control system for collaborating on and tracking changes to code as it is written.} +for all plaintext files. +Git tracks all the changes you make to your code, +and allows you to go back to previous versions without losing the information on changes made. +It also makes it possible to work on multiple parallel versions of the code, +so you don't risk breaking the code for other team members as you try something new. +The DIME file management and organization approach is designed with this in mind. + +Once the \texttt{DataWork} folder's directory structure is set up, you should adopt a file naming convention. -You will be working with two types of files: +You will generally be working with two types of files: ``technical'' files, which are those that are accessed by code processes, and ``non-technical'' files, which will not be accessed by code processes. The former takes precedent: an Excel file is a technical file even if it is a field log, because at some point it will be used by code. We will not give much emphasis to non-technical files here; -but you should make sure to name them -in an orderly fashion that works for your team. -You can use spaces and datestamps in names of non-technical files, but not technical ones: -the former might have a name like \texttt{2019-10-30 Sampling Procedure Description.docx} -while the latter might have a name like \texttt{endline-sampling.do}. -This will ensure you can find files within folders +but you should make sure to name them in an orderly fashion that works for your team. +These rules will ensure you can find files within folders and reduce the amount of time others will spend opening files to find out what is inside them. -Technical files, however, have stricter requirements.\sidenote{ +Some of the differences between the two file types are major and may be new to you. For example, +you can use spaces and datestamps in names of non-technical files, but not technical ones: +the former might have a name like \texttt{2019-10-30 Sampling Procedure Description.docx} +while the latter might have a name like \texttt{endline-sampling.do}. +Technical files have stricter requirements than non-technical ones.\sidenote{ \url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf}} For example, you should never use spaces in technical names; this can cause problems in code. (This includes all folder names.) -One practice that takes some getting used to +Similarly, technical files should never include capital letters. +One organizational practice that takes some getting used to is the fact that the best names from a coding perspective are usually the opposite of those from an English perspective. For example, for a deidentified household dataset from the baseline round, @@ -368,47 +450,7 @@ \subsection{File and folder management} rather than the opposite way around as occurs in natural language. This ensures that all \texttt{baseline} data stays together, then all \texttt{baseline-household} data, -and then provides unique information about this one. - -Setting up the \texttt{DataWork} folder folder in a git-managed directory -also enabled you to use Git and GitHub for version control on your code files. -A \textbf{version control system} is required to manage changes to any technical file. -A good version control system tracks who edited each file and when, -and additionally providers a protocol for ensuring that conflicting versions are avoided. -This is important, for example, for your team to be able to find the version of a presentation that you delivered to a donor, -and also to understand why the significance level of your estimates has changed. -Everyone who has ever encountered a file named something like \texttt{final\_report\_v5\_LJK\_KLE\_jun15.docx} -can appreciate how useful such a system can be. -Most file-syncing solutions offer some kind of version control; -These are usually enough to manage changes to binary files (such as Word and PowerPoint documents) -without needing to rely on these dreaded filename-based versioning conventions. -For technical files, however, a more complex version control system is usually desirable. -We recommend using Git\sidenote{ - \textbf{Git:} a multi-user version control system for collaborating on and tracking changes to code as it is written.} -for all plaintext files. -Git tracks all the changes you make to your code, -and allows you to go back to previous versions without losing the information on changes made. -It also makes it possible to work on two parallel versions of the code, -so you don't risk breaking the code for other team members as you try something new. -The DIME approach is designed with this in mind. - -However, the DIME structure is not for everyone. -If you do not already have a standard file structure across projects, -it is intended to be an easy template to start from. -This structure operates by creating a \texttt{DataWork} folder at the project level, -and within that folder, it provides standardized directory structures -for each data source (in the primary data context, ``rounds'' of data collection). -For each, \texttt{iefolder} creates folders for raw encrypted data, -raw deidentified data, cleaned data, final data, outputs, and documentation. -In parallel, it creates folders for the code files -that move the data through this progression, -and for the files that manage final analytical work. -The command also has some flexibility for the addition of -folders for non-primary data sources, although this is less well developed. -The package also includes the \texttt{iegitaddmd} command, -which can place a \texttt{README.md} file in each of these folders. -These \textbf{Markdown} files provide an easy and GitHub-compatible way -to document the contents of every folder in the structure. +and finally provides unique information about this specific file. % ---------------------------------------------------------------------------------------------- \subsection{Code management} From c1600e829426d02288e0cb632fabbe5d46db6a62 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 13:56:43 +0530 Subject: [PATCH 089/854] Edits and reorg output management --- chapters/planning-data-work.tex | 157 ++++++++++++++++++-------------- 1 file changed, 87 insertions(+), 70 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 068688c1f..6de1e2f87 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -108,7 +108,7 @@ \subsection{Setting up your computer} means that the computer will never accidentally load the wrong file. \index{filepaths} On MacOS this will be something like \path{/users/username/dropbox/project/...}, -and on Windows, \path{C:/users/username/github/project/...}. +and on Windows, \path{C:/users/username/Github/project/...}. We will write filepaths such as \path{/Dropbox/project-titleDataWorkEncryptedData/}, assuming the ``Dropbox'' folder lives inside your home folder. Filepaths will use forward slashes (\texttt{/}) to indicate folders, @@ -327,7 +327,7 @@ \section{Organizing code and data} and everyone has interoperable expectations. % ---------------------------------------------------------------------------------------------- -\subsection{File and folder management} +\subsection{Organizing files and folder structures} Agree with your team on a specific directory structure, and set it up at the beginning of the research project @@ -366,9 +366,9 @@ \subsection{File and folder management} and for the files that manage final analytical work. The command also has some flexibility for the addition of folders for non-primary data sources, although this is less well developed. -The package also includes the \texttt{iegitaddmd} command, +The package also includes the \texttt{ieGitaddmd} command, which can place a \texttt{README.md} file in each of these folders. -These \textbf{Markdown} files provide an easy and git-compatible way +These \textbf{Markdown} files provide an easy and Git-compatible way to document the contents of every folder in the structure. \index{Markdown} @@ -438,7 +438,7 @@ \subsection{File and folder management} the former might have a name like \texttt{2019-10-30 Sampling Procedure Description.docx} while the latter might have a name like \texttt{endline-sampling.do}. Technical files have stricter requirements than non-technical ones.\sidenote{ - \url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf}} + \url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-Git/slides/naming-slides/naming-slides.pdf}} For example, you should never use spaces in technical names; this can cause problems in code. (This includes all folder names.) Similarly, technical files should never include capital letters. @@ -453,7 +453,7 @@ \subsection{File and folder management} and finally provides unique information about this specific file. % ---------------------------------------------------------------------------------------------- -\subsection{Code management} +\subsection{Documenting and organizing code} Once you start a project's data work, the number of scripts, datasets, and outputs that you have to manage will grow very quickly. @@ -466,42 +466,48 @@ \subsection{Code management} not just a means to an end, and code should be written thinking of how easy it will be for someone to read it later. -Code documentation is one of the main factors that contribute to readability, -if not the main one. -There are two types of comments that should be included in code. -The first one describes what is being done. -This should be easy to understand from the code itself if you know the language well enough and the code is clear. -But writing plain English (or whichever language you communicate with your team on) -will make it easier for everyone to read. -The second type of comment is what differentiates commented code from well-commented code: -it explains why the code is performing a task in a particular way. -As you are writing code, you are making a series of decisions that -(hopefully) make perfect sense to you at the time. -However, you will probably not remember how they were made in a couple of weeks. -So write them down in your code. -There are other ways to document decisions -(GitHub offers a lot of different documentation options, for example), -but information that is relevant to understand the code should always be written in the code itself. - -Code organization is the next level. -Start by adding a code header. -This should include simple things such as stating the purpose of the script and the name of the person who wrote it. +Code documentation is one of the main factors that contribute to readability. +Start by adding a code header to every file. +This should include simple things such as the purpose of the script and the name of the person who wrote it. If you are using a version control software, the last time a modification was made and the person who made it will be recorded by that software. Otherwise, you should include it in the header. -Finally, and more importantly, use it to track the inputs and outputs of the script. -When you are trying to track down which code creates a data set, this will be very helpful. - -Breaking your code into readable steps is also good practice on code organization. +Finally, use the header to track the inputs and outputs of the script. +When you are trying to track down which code creates which data set, this will be very helpful. +While there are other ways to document decisions related to creating code +(GitHub offers a lot of different documentation options, for example), +the information that is relevant to understand the code should always be written in the code file. + +Mixed among the code itself, are two types of comments that should be included. +The first type of comment describes what is being done. +This might be easy to understand from the code itself +if you know the language well enough and the code is clear, +but often it is still a great deal of work to reverse-engineer the code's intent. +Writing the task in plain English (or whichever language you communicate with your team on) +will make it easier for everyone to read and understand the code's purpose. +The second type of comment explains why the code is performing a task in a particular way. +As you are writing code, you are making a series of decisions that +(hopefully) make perfect sense to you at the time. +These are often highly specialized and may exploit functionality +that is not obvious or has not been seen by others before. +Even you will probably not remember the exact choices that were made in a couple of weeks. +Therefore, you must document your precise processes in your code. + +Code organization means keeping each piece of code in an easily findable location. + \index{code organization} +Breaking your code into independently readable ``chunks'' is one good practice on code organization, +because it ensures each component does not depend on a complex program state +created by other chunks that are not obvious from the immediate context. One way to do this is to create sections where a specific task is completed. So, for example, if you want to find the line in your code where a variable was created, you can go straight to \texttt{PART 2: Create new variables}, -instead of reading line by line of the code. -RStudio makes it very easy to create sections, and compiles them into an interactive script index. +instead of reading line by line through the entire code. +RStudio, for example makes it very easy to create sections, +and it compiles them into an interactive script index for you. In Stata, you can use comments to create section headers, -though they're just there to make the reading easier. -Adding a code index to the header by copying and pasting section titles is the easiest way to create a code map. -You can then add and navigate through them using the find command. +though they're just there to make the reading easier and don't have functionality. +Adding an index to the header by copying and pasting section titles is the easiest way to create a code map. +You can then add and navigate through them using the \texttt{find} command. Since Stata code is harder to navigate, as you will need to scroll through the document, it's particularly important to avoid writing very long scripts. Therefore, in Stata at least, you should also consider breaking code tasks down @@ -511,7 +517,8 @@ \subsection{Code management} This is an arbitrary limit, just like the standard restriction of each line to 80 characters: it seems to be ``enough but not too much'' for most purposes. -To bring all these smaller code files together, maintain a master script. +To bring all these smaller code files together, you must maintain a master script. + \index{master do-file} A master script is the map of all your project's data work which serves as a table of contents for the instructions that you code. Anyone should be able to follow and reproduce all your work from @@ -540,54 +547,81 @@ \subsection{Code management} is to change the path to the project folder to reflect the filesystem and username. The code in \texttt{stata-master-dofile.do} shows how folder structure is reflected in a master do-file. -Finally, agree with your team on a plan to review code as it is written. +In order to maintain these practices and ensure they are functioning well, +you should agree with your team on a plan to review code as it is written. + \index{code review} Reading other people's code is the best way to improve your coding skills. And having another set of eyes on your code will make you more comfortable with the results you find. -It's normal (and common) to make mistakes as you write your code quickly. +It's normal (and common) to make mistakes as you write your code. Reading it again to organize and comment it as you prepare it to be reviewed will help you identify them. -Try to have code review scheduled frequently, as you finish writing a piece of code, or complete a small task. -If you wait for a long time to have your code review, and it gets too long, +Try to have a code review scheduled frequently, +every time you finish writing a piece of code, or complete a small task. +If you wait for a long time to have your code review, and it gets too complex, preparation and code review will require more time and work, and that is usually the reason why this step is skipped. Making sure that the code is running properly on other machines, and that other people can read and understand the code easily, -is also the easiest way to ensure a smooth project handover. +is also the easiest way to be prepared in advance for a smooth project handover. % ---------------------------------------------------------------------------------------------- \subsection{Output management} -The final task that needs to be discussed with your team is the best way to manage outputs. +The final task that needs to be discussed with your team is the best way to manage output files. A great number of outputs will be created during the course of a project, -from raw outputs such as tables and graphs to final products such as presentations, papers and reports. +and these will include both raw outputs such as tables and graphs +and final products such as presentations, papers and reports. When the first outputs are being created, agree on where to store them, -what software to use, and how to keep track of them. +what softwares and formats to use, and how to keep track of them. % Where to store outputs -Decisions about storage of final outputs are made easier by technical constraints. -As discussed above, Git is a great way to control for different versions of -plain text files, and sync software such as Dropbox are better for binary files. -Raw outputs in formats like \texttt{.tex} and \texttt{.eps} can be managed with Git, -while final outputs like PDF, PowerPoint, or Word, can be kept in a synced folder. +Decisions about storage of outputs are made easier by technical constraints. +As discussed above, version control systems like Git are a great way to manage +plaintext files, and sync softwares such as Dropbox are better for binary files. +Outputs will similarly come in these two formats, depending on your software. +Binary outputs like Excel files, PDFs, PowerPoints, or Word documents can be kept in a synced folder. +Raw outputs in plaintext formats like \texttt{.tex} and \texttt{.eps} +can be created from most analytical software and managed with Git. Storing plaintext outputs on Git makes it easier to identify changes that affect results. If you are re-running all of your code from the master when significant changes to the code are made, the outputs will be overwritten, and changes in coefficients and number of observations, for example, will be highlighted for you to review. In fact, one of the most effective ways to check code quickly -is simply to commit all your code and outputs, +is simply to commit all your code and outputs using Git, then re-run the entire thing and examine any flagged changes in the directory. +No matter what choices you make, +you will need to make updates to your outputs quite frequently. +And anyone who has tried to recreate a graph after a few months probably knows +that it can be hard to remember where you saved the code that created it. +Here, naming conventions and code organization play a key role +in not re-writing scripts again and again. +It is common for teams to maintain one analyisis file or folder with ``exploratory analysis'', +which are pieces of code that are stored only to be found again in the future, +but not cleaned up to be included in any outputs yet. +Once you are happy with a partiular result or output, +it should be named and moved to a dedicated location. +It's typically desirable to have the names of outputs and scripts linked, +so, for example, \texttt{factor-analysis.do} creates \texttt{factor-analysis-f1.eps} and so on. +Document output creation in the Master script that runs these files, +so that before the line that runs a particular analysis script +there are a few lines of comments listing +data sets and functions that are necessary for it to run, +as well as all outputs created by that script. + % What software to use +Compiling the raw outputs from your statistical software into useful formats +is the final step in producing research outputs for public consumption. Though formatted text software such as Word and PowerPoint are still prevalent, -researchers are increasinly choosing to prepare even final outputs +researchers are increasingly choosing to prepare final outputs like documents and presentations using {\LaTeX}.\sidenote{ - \url{https://www.latex-project.org}} + \url{https://www.latex-project.org}} \index{{\LaTeX}.} {\LaTeX} is a document preparation system that can create both text documents and presentations. The main advantage is that {\LaTeX} uses plaintext for all formatting, and it is necessary to learn its specific markup convention to use it. The main advantage of using {\LaTeX} is that you can write dynamic documents, that import inputs every time they are compiled. This means you can skip the copying and pasting whenever an output is updated. -Because it's written in plain text, it's also easier to control and document changes using Git. +Because it's written in plaintext, it's also easier to control and document changes using Git. Creating documents in {\LaTeX} using an integrated writing environment such as TeXstudio is great for outputs that focus mainly on text, but include small chunks of code and static code outputs. @@ -615,23 +649,6 @@ \subsection{Output management} keep in mind that learning how to use a new tool may require some time investment upfront that will be paid off as your project advances. -% Keeping track of outputs -Finally, no matter what choices you make regarding software and folder organization, -you will need to make changes to your outputs quite frequently. -And anyone who has tried to recreate a graph after a few months probably knows -that it can be hard to remember where you saved the code that created it. -Here, naming conventions and code organization play a key role in not re-writing scripts again and again. -It is common for teams to maintain an analyisis file or folder with ``exploratory analysis'', -which are pieces of code that are commented and written only to be found again in the future, -but not cleaned up to be included in any outputs yet. -Once you are happy with a partiular result or output, however, -it should be named and moved to a dedicated location. -It's typically desirable to have the names of outputs and scripts linked, -so, for example, \texttt{factor-analysis.do} creates \texttt{factor-analysis-f1.eps} and so on. -Document output creation in the Master script that runs these files, -so that before the line that runs a particular analysis script there are a few lines of comments listing -data sets and functions that are necessary for it to run, -as well as all outputs created by that script. % ---------------------------------------------------------------------------------------------- From 992db2df091f66cfed49ad47c67ecdb02ca47074 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 30 Oct 2019 13:57:59 +0530 Subject: [PATCH 090/854] Git language --- chapters/planning-data-work.tex | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 6de1e2f87..813649f91 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -366,7 +366,7 @@ \subsection{Organizing files and folder structures} and for the files that manage final analytical work. The command also has some flexibility for the addition of folders for non-primary data sources, although this is less well developed. -The package also includes the \texttt{ieGitaddmd} command, +The package also includes the \texttt{iegitaddmd} command, which can place a \texttt{README.md} file in each of these folders. These \textbf{Markdown} files provide an easy and Git-compatible way to document the contents of every folder in the structure. @@ -387,11 +387,11 @@ \subsection{Organizing files and folder structures} Nearly all code files and raw outputs (not datasets) are best managed this way. This is because code files are usually \textbf{plaintext} files, and non-technical files are usually \textbf{binary} files. - \index{plaintext} \index{binary files} + \index{plaintext}\index{binary files} It's also becoming more and more common for written outputs such as reports, presentations and documentations to be written using plaintext tools such as {\LaTeX} and dynamic documents. - \index{{\LaTeX}} \index{dynamic documents} + \index{{\LaTeX}}\index{dynamic documents} Keeping such plaintext files in a version-controlled folder allows you to maintain better control of their history and functionality. Because of the high degree with which code files depend on file structure, @@ -581,7 +581,7 @@ \subsection{Output management} Binary outputs like Excel files, PDFs, PowerPoints, or Word documents can be kept in a synced folder. Raw outputs in plaintext formats like \texttt{.tex} and \texttt{.eps} can be created from most analytical software and managed with Git. -Storing plaintext outputs on Git makes it easier to identify changes that affect results. +Tracking plaintext outputs with Git makes it easier to identify changes that affect results. If you are re-running all of your code from the master when significant changes to the code are made, the outputs will be overwritten, and changes in coefficients and number of observations, for example, will be highlighted for you to review. From 6679036cc48b25bd0fc5a3170d8f5e7befe85ee6 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 31 Oct 2019 10:26:46 +0530 Subject: [PATCH 091/854] Intro and organization --- chapters/research-design.tex | 179 ++++++++++++++++++++++------------- manuscript.tex | 2 +- 2 files changed, 114 insertions(+), 67 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 35fc02d17..907acf506 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -1,4 +1,4 @@ -%------------------------------------------------ +%----------------------------------------------------------------------------------------------- \begin{fullwidth} Research design is the process of structuring field work @@ -6,49 +6,89 @@ that will answer a specific research question. You don't need to be an expert in this, and there are lots of good resources out there -that focus on designing interventions and evaluations. -This section will present a very brief overview -of the most common methods that are used, -so that you can have an understanding of -how to construct appropriate counterfactuals, -data structures, and the corresponding code tools -as you are setting up your data structure -before going to the field to collect data. - -You can categorize most research questions into one of two main types. -There are \textbf{cross-sectional}, descriptive, and observational analyses, -which seek only to describe something for the first time, -such as the structure or variation of a population. -We will not describe these here, -because there are endless possibilities -and they tend to be sector-specific. -For all sectors, however, there are also causal research questions, -both experimental and quasi-experimental, -which rely on establishing \textbf{exogenous variation} in some input -to draw a conclusion about the impact of its effect -on various outcomes of interest. -We'll focus on these \textbf{causal designs}, since the literature -offers a standardized set of approaches, with publications -and code tools available to support your work. +that focus on designing interventions and evaluations +as well as on econometric approaches. +This section will present a brief overview +of the most common methods that are used in development research. +Specifically, we will introduce you to several ``causal inference'' methods +that are frequently used to understand the impact of real development programs. +The intent is for you to obtain an understanding of +the way in which each method constructs treatment and control groups, +the data structures needed to estimate the corresponding effects, +and any available code tools that will assist you in this process. + +This is important to understand before going into the field for several reasons. +If you do not understand how to calculate the correct estimator for your study, +you will not be able to assess the power of your research design. +You will also be unable to make tradeoffs in the field +when you inevitable have to allocate scarce resources +between tasks like maximizing sample size +and ensuring follow-up with specific individuals. +You will save a lot of time by understanding the way +your data needs to be organized and set up as it comes in +before you will be able to calculate meaningful results. +Just as importantly, understanding each of these approaches +will allow you to keep your eyes open for research opportunities: +many of the most interesting projects occur because people in the field +recognize the opportunity to implement one of these methods on the fly +in response to an unexpected event in the field. +While somewhat more conceptual than practical, +a basic understanding of your project's chosen approach will make you +much more effective at the analytical part of your work. \end{fullwidth} -%------------------------------------------------ - -\section{Counterfactuals and treatment effects} - -In causal analysis, a researcher is attempting to obtain estimates -of a specific \textbf{treatment effect}, or the change in outcomes -\index{treatment effect} -caused by a change in exposure to some intervention or circumstance.\cite{abadie2018econometric} -In the potential outcomes framework, -we can never observe this directly: -we never see the same person in both their treated and untreated state.\sidenote{\url{http://www.stat.columbia.edu/~cook/qr33.pdf}} -Instead, we make inferences from samples: -we try to devise a comparison group that evidence suggests -would be identical to the treated group had they not been treated. +%----------------------------------------------------------------------------------------------- +%----------------------------------------------------------------------------------------------- + +\section{Causality, inference, and identification} + +The primary goal of research design is to establish \textbf{identification} +for a parameter of interest -- that is, to demonstrate +a source of variation in a particular input that has no other possible channel +to alter a particular outcome, in order to assert that some change in that outcome +was caused by that change in the input. + \index{identification} +When we are discussing the types of inputs commonly referred to as +``programs'' or ``interventions'', we are typically attempting to obtain estimates +of a program-specific \textbf{treatment effect}, or the change in outcomes +directly attributable to exposure to what we call the \textbf{treatment}.\cite{abadie2018econometric} + \index{treatment effect} +You can categorize most research designs into one of two main types: +\textbf{experimental} designs, in which the research team +is directly responsible for creating the variation in treatment, +and \textbf{quasi-experimental} designs, in which the team +identifies a ``natural'' source of variation and uses it for identification. +Nearly all methods can fall into either category. + + +%----------------------------------------------------------------------------------------------- +\subsection{Defining treatment and control groups} + +The key assumption behind estimating treatment effects is that every +person, facility, village, or whatever the unit of intervention is +has two possible states: their outcome if they do not recieve the treatment +and their outcome if they do recieve the treatment. +Each unit's treatment effect is the difference between these two states, +and the true \textbf{average treatment effect} is the average of all of +these differences across the potentially treated population. + \index{average treatment effect} + +In reality, we never see the same unit in both their treated and untreated state simultaneously, +so measuring and averaging these effects directly is impossible.\sidenote{ + \url{http://www.stat.columbia.edu/~cook/qr33.pdf}} +Instead, we typically make inferences from samples. +\textbf{Causal inference} methods are those in which we are able to estimate the +average treatment effect without observing individual-level effects, +but can obtain it from some comparison of averages with a \textbf{control} group. + \index{causal inference}\index{control group} +Each method is based around a way of comparing another set of observations -- +the ``control'' observations -- with the treatment group, +which would have identical to the treated group in the absence of the treatment. +Therefore, almost all of these designs can be accurately described +as a series of between-group comparisons.\sidenote{ + \url{http://nickchk.com/econ305.html}} This \textbf{control group} serves as a counterfactual to the treatment group, -\index{control group}\index{treatment group} and we compare the distributions of outcomes within each to make a computation of how different the groups are from each other. \textit{Causal Inference} and \textit{Causal Inference: The Mixtape} @@ -74,17 +114,14 @@ \section{Counterfactuals and treatment effects} such as stratification and clustering, and ensuring that time trends are handled sensibly. We aren't even going to get into regression models here. -Almost all experimental designs can be accurately described -as a series of between-group comparisons.\sidenote{\url{http://nickchk.com/econ305.html}} The models you will construct and estimate are intended to do two things: to express the intention of your research design, and to help you group the potentially endless concepts of field reality into intellectually tractable categories. In other words, these models tell the story of your research design. -%------------------------------------------------ - -\section{Experimental research designs} +%----------------------------------------------------------------------------------------------- +\subsection{Experimental research designs} Experimental research designs explicitly allow the research team to change the condition of the populations being studied,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Experimental_Methods}} @@ -101,6 +138,32 @@ \section{Experimental research designs} \textbf{difference-in-difference} (``panel-data'' studies), and \textbf{regression discontinuity} (``cutoff'' studies). +%----------------------------------------------------------------------------------------------- +\subsection{Quasi-experimental research designs} + +\textbf{Quasi-experimental} research designs,\sidenote{ +\url{https://dimewiki.worldbank.org/wiki/Quasi-Experimental_Methods}} +are inference methods based on methods other than explicit experimentation. +Instead, they rely on ``experiments of nature'', +in which natural variation can be argued to approximate +the type of exogenous variations in circumstances +that a researcher would attempt to create with an experiment.\cite{dinardo2016natural} + +Unlike with planned experimental designs, +quasi-experimental designs typically require the extra luck +of having data collected at the right times and places +to exploit events that occurred in the past. +Therefore, these methods often use either secondary data, +or use primary data in a cross-sectional retrospective method, +applying additional corrections as needed to make +the treatment and comparison groups plausibly identical. + +%----------------------------------------------------------------------------------------------- +%----------------------------------------------------------------------------------------------- +\section{Working with specific research designs} + + +%----------------------------------------------------------------------------------------------- \subsection{Cross-sectional RCTs} \textbf{Cross-sectional RCTs} are the simplest possible study design: @@ -133,6 +196,7 @@ \subsection{Cross-sectional RCTs} may reduce the variance of estimates, but there is debate on the importance of these tests and corrections. +%----------------------------------------------------------------------------------------------- \subsection{Differences-in-differences} \textbf{Differences-in-differences}\sidenote{ @@ -168,6 +232,7 @@ \subsection{Differences-in-differences} and it is the \textit{interaction} of treatment and time indicators that we interpret as the differential effect of the treatment assignment. +%----------------------------------------------------------------------------------------------- \subsection{Regression discontinuity} \textbf{Regression discontinuity (RD)} designs differ from other RCTs @@ -198,27 +263,7 @@ \subsection{Regression discontinuity} has to be decided and tested against various options for robustness. The rest of the model depends largely on the design and execution of the experiment. -%------------------------------------------------ - -\section{Quasi-experimental designs} - -\textbf{Quasi-experimental} research designs,\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Quasi-Experimental_Methods}} -are inference methods based on methods other than explicit experimentation. -Instead, they rely on ``experiments of nature'', -in which natural variation can be argued to approximate -the type of exogenous variations in circumstances -that a researcher would attempt to create with an experiment.\cite{dinardo2016natural} - -Unlike with planned experimental designs, -quasi-experimental designs typically require the extra luck -of having data collected at the right times and places -to exploit events that occurred in the past. -Therefore, these methods often use either secondary data, -or use primary data in a cross-sectional retrospective method, -applying additional corrections as needed to make -the treatment and comparison groups plausibly identical. - +%----------------------------------------------------------------------------------------------- \subsection{Instrumental variables} Instrumental variables designs utilize variation in an @@ -246,6 +291,7 @@ \subsection{Instrumental variables} usually those backed by extensive qualitative analysis, are acceptable as high-quality evidence. +%----------------------------------------------------------------------------------------------- \subsection{Matching estimators} \textbf{Matching} estimators rely on the assumption that, @@ -272,6 +318,7 @@ \subsection{Matching estimators} One solution, as with the experimental variant of 2SLS proposed above, is to incorporate matching models into explicitly experimental designs. +%----------------------------------------------------------------------------------------------- \subsection{Synthetic controls} \textbf{Synthetic controls} methods\cite{abadie2015comparative} diff --git a/manuscript.tex b/manuscript.tex index d2ec49a8b..f8cfffc15 100644 --- a/manuscript.tex +++ b/manuscript.tex @@ -50,7 +50,7 @@ \chapter{Chapter 2: Planning data work before going to field} % CHAPTER 3 %---------------------------------------------------------------------------------------- -\chapter{Chapter 3: Designing research for causal inference} +\chapter{Chapter 3: Structuring data for causal inference} \label{ch:3} \input{chapters/research-design.tex} From c839831877fe8785e63aee14e5d4827fbb0c036a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 31 Oct 2019 11:01:35 +0530 Subject: [PATCH 092/854] Estimating treatment effects --- chapters/research-design.tex | 94 +++++++++++++++++++----------------- manuscript.tex | 2 +- 2 files changed, 52 insertions(+), 44 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 907acf506..1771a97ee 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -60,20 +60,34 @@ \section{Causality, inference, and identification} identifies a ``natural'' source of variation and uses it for identification. Nearly all methods can fall into either category. - %----------------------------------------------------------------------------------------------- -\subsection{Defining treatment and control groups} +\subsection{Estimating treatment effects using control groups} The key assumption behind estimating treatment effects is that every -person, facility, village, or whatever the unit of intervention is -has two possible states: their outcome if they do not recieve the treatment -and their outcome if they do recieve the treatment. -Each unit's treatment effect is the difference between these two states, -and the true \textbf{average treatment effect} is the average of all of +person, facility, or village (or whatever the unit of intervention is) +has two possible states: their outcomes if they do not recieve some treatment +and their outcomes if they do recieve that treatment. +Each unit's treatment effect is the individual difference between these two states, +and the \textbf{average treatment effect (ATE)} is the average of all of these differences across the potentially treated population. \index{average treatment effect} - -In reality, we never see the same unit in both their treated and untreated state simultaneously, +This is the most common parameter that research designs will want to estimate. +In most designs, the goal is to establish a ``counterfactual scenario'' for the treatment group +with which outcomes can be directly compared. +There are several resources that provide more or less mathematically intensive +approaches to understanding how various methods to his. +\textit{Causal Inference} and \textit{Causal Inference: The Mixtape} +provides a detailed practical introduction to and history of +each of these methods.\sidenote{ + \url{https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/} + \\ \noindent \url{http://scunning.com/cunningham_mixtape.pdf}} +\textit{Mostly Harmless Econometrics} and \textit{Mastering Metrics} +are canonical treatments of the mathematics behind all econometric approaches.\sidenote{ + \url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion} + \\ \noindent \url{http://assets.press.princeton.edu/chapters/s10363.pdf}} + +Intuitively, the problem is as follows: we can never observe the same unit +in both their treated and untreated states simultaneously, so measuring and averaging these effects directly is impossible.\sidenote{ \url{http://www.stat.columbia.edu/~cook/qr33.pdf}} Instead, we typically make inferences from samples. @@ -81,44 +95,38 @@ \subsection{Defining treatment and control groups} average treatment effect without observing individual-level effects, but can obtain it from some comparison of averages with a \textbf{control} group. \index{causal inference}\index{control group} -Each method is based around a way of comparing another set of observations -- -the ``control'' observations -- with the treatment group, -which would have identical to the treated group in the absence of the treatment. -Therefore, almost all of these designs can be accurately described +Every research design is based around a way of comparing another set of observations -- +the ``control'' observations -- against the treatment group. +They all work to establish that the control observations would have been +identical \textit{on average} to the treated group in the absence of the treatment. +Then, the mathematical properties of averages implies that the calculated +difference in averages is equivalent to the average difference: +exactly the parameter we are seeking to estimate. +Therefore, almost all designs can be accurately described as a series of between-group comparisons.\sidenote{ \url{http://nickchk.com/econ305.html}} -This \textbf{control group} serves as a counterfactual to the treatment group, -and we compare the distributions of outcomes within each -to make a computation of how different the groups are from each other. -\textit{Causal Inference} and \textit{Causal Inference: The Mixtape} -provides a detailed practical introduction to and history of -each of these methods, so we will only introduce you to -them very abstractly in this chapter.\sidenote{\url{https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/} -\\ \noindent \url{http://scunning.com/cunningham_mixtape.pdf}} -Each of the methods described in this chapter -relies on some variant of this basic strategy. -In counterfactual causal analysis, -the econometric models and estimating equations +Most of the methods that you will encounter rely on some variant of this strategy, +which is designed to maximize the ability to estimate the effect +of an average unit being offered the treatment being evaluated. +The focus on identification of the treatment effect, however, +means there are several essential features to this approach +that are not common in other types of statistical and data science work. +First, the econometric models and estimating equations used do not attempt to create a predictive or comprehensive model -of how the outcome of interest is generated -- -typically we do not care about measures of fit or predictive accuracy -like R-squared values or root mean square errors. -Instead, the econometric models desribed here aim to -correctly describe the experimental design being used, -so that the correct estimate of the difference -between the treatment and control groups is obtained -and can be interpreted as the effect of the treatment on outcomes. - -Correctly describing the experiment means accounting for design factors -such as stratification and clustering, and -ensuring that time trends are handled sensibly. -We aren't even going to get into regression models here. -The models you will construct and estimate are intended to do two things: -to express the intention of your research design, -and to help you group the potentially endless concepts of field reality -into intellectually tractable categories. -In other words, these models tell the story of your research design. +of how the outcome of interest is generated. +Typically, these designs are not interested in predictive accuracy, +and the estimates and predictions that these models produce +will not be as good at predicting outcomes or fitting the data as other models. +Additionally, when control variables or other variables are used in estimation, +there is no guarantee that those parameters are marginal effects. +They can only be interpreted as correlative averages, +unless the experimenter has additional sources of identification for them. +The models you will construct and estimate are intended to do exactly one thing: +to express the intention of your project's research design, +and to accurately estimate the effect of the treatment it is evaluating. +In other words, these models tell the story of the research design +in a way that clarifies the exact comparison being made between control and treatment. %----------------------------------------------------------------------------------------------- \subsection{Experimental research designs} diff --git a/manuscript.tex b/manuscript.tex index f8cfffc15..bdd777bb4 100644 --- a/manuscript.tex +++ b/manuscript.tex @@ -50,7 +50,7 @@ \chapter{Chapter 2: Planning data work before going to field} % CHAPTER 3 %---------------------------------------------------------------------------------------- -\chapter{Chapter 3: Structuring data for causal inference} +\chapter{Chapter 3: Evaluating impacts of research designs} \label{ch:3} \input{chapters/research-design.tex} From c99af818a621c22a888edfa534e294cd07e9f448 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 31 Oct 2019 11:43:40 +0530 Subject: [PATCH 093/854] Experiments --- chapters/research-design.tex | 90 +++++++++++++++++++++++++----------- 1 file changed, 64 insertions(+), 26 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 1771a97ee..3b63182a7 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -129,42 +129,80 @@ \subsection{Estimating treatment effects using control groups} in a way that clarifies the exact comparison being made between control and treatment. %----------------------------------------------------------------------------------------------- -\subsection{Experimental research designs} +\subsection{Experimental and quasi-experimental research designs} Experimental research designs explicitly allow the research team -to change the condition of the populations being studied,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Experimental_Methods}} -in the form of NGO programs, government regulations, +to change the condition of the populations being studied,\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Experimental_Methods}} +often in the form of NGO programs, government regulations, information campaigns, and many more types of interventions.\cite{banerjee2009experimental} -The classic method is the \textbf{randomized control trial (RCT)}.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Randomized_Control_Trials}} -(Not everyone agrees this is the best way to do research.\sidenote{\url{https://www.nber.org/papers/w14690.pdf}}) -\index{randomized control trial} -There treatment and control groups are drawn from the same underlying population -so that the strong condition of statistical equality -in the absence of the experiment can be assumed. -Three RCT-based methods are discussed here: -\textbf{cross-sectional randomization} (``endline-only'' studies), -\textbf{difference-in-difference} (``panel-data'' studies), -and \textbf{regression discontinuity} (``cutoff'' studies). - -%----------------------------------------------------------------------------------------------- -\subsection{Quasi-experimental research designs} +The classic method is the \textbf{randomized control trial (RCT)}.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Randomized_Control_Trials}} + \index{randomized control trials} +In randomized control trials, the control group is randomized -- +that is, from an eligible population, +a random subset of units are not given access to the treatment, +so that they may serve as a counterfactual for those who are. +A randomized control group, intuitively, is meant to represent +how things would have turned out for the treated group +if they had not been treated, and it is particularly effective at doing so +as evidenced by its broad credibility in fields ranging from clinical medicine to development. +However, there are many types of treatments that are impractical or unethical +to effectively approach using an experimental strategy, +and therefore many limitations to accessing ``big questions'' +through RCT approaches.\sidenote{ + \url{https://www.nber.org/papers/w14690.pdf}} + +Randomized designs all share several major statistical concerns. +The first is the fact that it is always possible to select a control group, +by chance, which was not in fact going to be very similar to the treatment group. +This feature is called randomization noise, and all RCTs share the need to understand +how randomization noise may impact the estimates that are obtained. +Second, takeup and implementation fidelity are extremely important, +since programs will by definition have no effect +if they are not in fact accepted by or delivered to +the people who are supposed to recieve them. +Unfortunately, these effects kick in very quickly and are highly nonlinear: +70\% takeup or efficacy doubles the required sample, and 50\% quadruples it.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up}} +Such effects are also very hard to correct ex post, +since they require strong assumptions about the randomness or non-randomness of takeup. +Therefore a large amount of field time and descriptive work +must be dedicated to understanding how these effects played out in a given study, +and often overshadow the effort put into the econometric design itself. \textbf{Quasi-experimental} research designs,\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Quasi-Experimental_Methods}} -are inference methods based on methods other than explicit experimentation. + \url{https://dimewiki.worldbank.org/wiki/Quasi-Experimental_Methods}} +by contrast, are inference methods based on events not controlled by the research team. Instead, they rely on ``experiments of nature'', in which natural variation can be argued to approximate -the type of exogenous variations in circumstances +the type of exogenous variation in treatment availability that a researcher would attempt to create with an experiment.\cite{dinardo2016natural} - -Unlike with planned experimental designs, +Unlike with carefully planned experimental designs, quasi-experimental designs typically require the extra luck -of having data collected at the right times and places -to exploit events that occurred in the past. +of having access to data collected at the right times and places +to exploit events that occurred in the past, +or having the ability to collect data in a time and place +dictated by the availability of identification. Therefore, these methods often use either secondary data, -or use primary data in a cross-sectional retrospective method, -applying additional corrections as needed to make -the treatment and comparison groups plausibly identical. +or use primary data in a cross-sectional retrospective method. + +Quasi-experimental designs therefore can access a much broader range of questions, +and with much less effort in terms of executing an intervention. +However, they require in-depth understanding of the precise events +the researcher wishes to address in order to know what data to collect +and how to model the underlying natural experiment. +Additionally, because the population who will have been exposed +to such events is limited by the scale of the event, +quasi-experimental designs are often power-constrained. +There is nothing the research team can do to increase power +by providing treatment to more people or expanding the control group: +instead, power is typically maximized by ensuring +that sampling is carried out effectively +and that attrition from the sampled groups is dealt with effectively. +Sampling noise and survey non-response are therefore analogous +to the randomization noise and implementation failures +that can be observed in RCT designs, and have similar implications for field work. %----------------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------------- From 41510d5d6fd295303e680a698a5994e5161f9525 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 1 Nov 2019 12:07:11 +0530 Subject: [PATCH 094/854] RCTs --- chapters/research-design.tex | 82 ++++++++++++++++++++++++------------ manuscript.tex | 2 +- 2 files changed, 55 insertions(+), 29 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 3b63182a7..3713e05d1 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -206,41 +206,67 @@ \subsection{Experimental and quasi-experimental research designs} %----------------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------------- -\section{Working with specific research designs} +\section{Obtaining treatment effects from specific research designs} %----------------------------------------------------------------------------------------------- -\subsection{Cross-sectional RCTs} +\subsection{Cross-sectional randomized control trials (RCTs)} \textbf{Cross-sectional RCTs} are the simplest possible study design: a program is implemented, surveys are conducted, and data is analyzed. -The randomization process -draws the treatment and control groups from the same underlying population. -This implies the groups' outcome means would be identical in expectation -before intervention, and would have been identical at measurement -- -therefore, differences are due to the effect of the intervention. -Cross-sectional data is simple because -for research teams do not need track individuals over time, -or analyze attrition and follow-up other than non-response. -Cross-sectional designs can have a time dimension; -they are then called ``repeated cross-sections'', -but do not imply a panel structure for individual observations. - -Typically, the cross-sectional model is developed -only with controls for the research design. -\textbf{Balance checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/iebaltab}} can be utilized, but an effective experiment -can use \textbf{stratification} (sometimes called blocking) aggressively\sidenote{\url{https://blogs.worldbank.org/impactevaluations/impactevaluations/how-randomize-using-many-baseline-variables-guest-post-thomas-barrios}} to ensure balance -before data is collected.\cite{athey2017econometrics} -\index{balance} -Stratification disaggregates a single experiment to a collection -of smaller experiments by conducting randomization within -``sufficiently similar'' strata groups. -Adjustments for balance variables are never necessary in RCTs, +The randomization process constructs the control group at random +from the population that is eligible to recieve each treatment. +Therefore, by construction, each unit's receipt of the treatment +is unrelated to any of its other characteristics +and the ordinary least squares (OLS) regression +of outcome on treatment, without any control variables, +is an unbiased estimate of the average treatment effect. +Cross-sectional data is simple to handle because +for research teams do not need track anything over time. +A cross-section is simply a representative set of observations +taken at a single point in time. +If this point in time is after a treatment has been fully delivered, +then the outcome values at that point in time +already reflect the effect of the treatment. + +What needs to be carefully maintainted in data for cross-sectional RCTs +is the treatment randomization process itself, +as well as detailed field data about differences +in data quality and loss-to-follow up across groups.\cite{athey2017econometrics} +Only these details are needed to construct the appropriate estimator: +clustering of the estimate is required at the level +at which the treatment is assigned to observations, +and controls are required for variables which +were used to stratify the treatment (in the form of strata fixed effects).\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/impactevaluations/how-randomize-using-many-baseline-variables-guest-post-thomas-barrios}} +\textbf{Randomization inference} can be used +to esetimate the underlying variability in the randomization process +(more on this in the next chapter). +\textbf{Balance checks} are typically reported as evidence of an effective randomization, +and are particularly important when the design is quasi-experimental +(since then the randomization process cannot be simulated explicitly). +However, controls for balance variables are usually unnecessary in RCTs, because it is certain that the true data-generating process -has no correlation between the treatment and the balance factors.\sidenote{\url{https://blogs.worldbank.org/impactevaluations/should-we-require-balance-t-tests-baseline-observables-randomized-experiments}} -However, controls for imbalance that are not part of the design -may reduce the variance of estimates, but there is debate on -the importance of these tests and corrections. +has no correlation between the treatment and the balance factors.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/should-we-require-balance-t-tests-baseline-observables-randomized-experiments}} + +Analysis is typically straightforward with a strong understanding of the randomization. +A typical analysis will include a decription of the sampling and randomization process, +summary statistics for the eligible population, +balance checks for randomization and sample selection, +a primary regression specification (with multiple hypotheses appropriately adjusted), +additional specifications with adjustments for attrition, balance, and other potential contamination, +and randomization-inference analysis or other placebo regression approaches. +There are a number of tools that are available +to help with the complete process of data collection,\sidenote{ + \url{https://toolkit.povertyactionlab.org/resource/coding-resources-randomized-evaluations}} +to analyze balance,\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/iebaltab}} +and to visualize treatment effects.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/iegraph}} +Tools and methods for analyzing selective attrition are available.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} + %----------------------------------------------------------------------------------------------- \subsection{Differences-in-differences} diff --git a/manuscript.tex b/manuscript.tex index bdd777bb4..7ef175dc4 100644 --- a/manuscript.tex +++ b/manuscript.tex @@ -50,7 +50,7 @@ \chapter{Chapter 2: Planning data work before going to field} % CHAPTER 3 %---------------------------------------------------------------------------------------- -\chapter{Chapter 3: Evaluating impacts of research designs} +\chapter{Chapter 3: Evaluating impact through research design} \label{ch:3} \input{chapters/research-design.tex} From b51b3677cdab7157ea3d5523ae5696ffd04d4f13 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 2 Nov 2019 08:23:29 +0530 Subject: [PATCH 095/854] Diff-in-diff --- bibliography.bib | 11 ++++ chapters/research-design.tex | 108 ++++++++++++++++++++++++----------- 2 files changed, 86 insertions(+), 33 deletions(-) diff --git a/bibliography.bib b/bibliography.bib index 3a2272318..741246b65 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -1,3 +1,14 @@ +@article{schulz2010consort, + title={CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials}, + author={Schulz, Kenneth F and Altman, Douglas G and Moher, David}, + journal={BMC medicine}, + volume={8}, + number={1}, + pages={18}, + year={2010}, + publisher={BioMed Central} +} + @article{blischak2016quick, title={A quick introduction to version control with {Git} and {GitHub}}, author={Blischak, John D and Davenport, Emily R and Wilson, Greg}, diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 3713e05d1..3b434e800 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -210,12 +210,13 @@ \section{Obtaining treatment effects from specific research designs} %----------------------------------------------------------------------------------------------- -\subsection{Cross-sectional randomized control trials (RCTs)} +\subsection{Cross-sectional designs} -\textbf{Cross-sectional RCTs} are the simplest possible study design: +\textbf{Cross-sectional} surveys are the simplest possible study design: a program is implemented, surveys are conducted, and data is analyzed. -The randomization process constructs the control group at random +When it is an RCT, a randomization process constructs the control group at random from the population that is eligible to recieve each treatment. +When it is observational, we present other evidence that a similar equivalence holds. Therefore, by construction, each unit's receipt of the treatment is unrelated to any of its other characteristics and the ordinary least squares (OLS) regression @@ -267,42 +268,83 @@ \subsection{Cross-sectional randomized control trials (RCTs)} Tools and methods for analyzing selective attrition are available.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} - %----------------------------------------------------------------------------------------------- \subsection{Differences-in-differences} -\textbf{Differences-in-differences}\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Difference-in-Differences}} -\index{differences-in-differences} -(abbreviated as DD, DiD, diff-in-diff, and other variants) -deals with the construction of controls differently: -it uses a panel data structure to additionally use each -unit in the pre-treatment phase as an additional control for itself post-treatment (the first difference), -then comparing that mean change with the control group (the second difference).\cite{mckenzie2012beyond} -Therefore, rather than relying entirely on treatment-control balance for identification, -this class of designs intends to test whether \textit{changes} -in outcomes over time were different in the treatment group than the control group.\sidenote{\url{https://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice}} -The primary identifying assumption for diff-in-diff is \textbf{parallel trends}, -the idea that the change in all groups over time would have been identical -in the absence of the treatment. - -Diff-in-diff experiments therefore require substantially more effort -in the field work portion, so that the \textbf{panel} of observations is well-constructed.\sidenote{\url{https://www.princeton.edu/~otorres/Panel101.pdf}} +Where cross-sectional designs draw their estimates of treatment effects +from differences in outcome levels in a single measurement, +\textbf{differences-in-differences}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Difference-in-Differences}} +designs (abbreviated as DD, DiD, diff-in-diff, and other variants) +estimate treatment effects from /textit{changes} in outcomes +between two or more rounds of measurement. + \index{differences-in-differences} +In these designs, three control groups are used – +the baseline level of treatment units, +the baseline level of non-treatment units, +and the endline level of non-treatment units.\sidenote{ + \url{https://www.princeton.edu/~otorres/DID101.pdf}} +The estimated treatment effect is the excess change +of units that recieve the treatment, as they recieve it: +calculating that value is equivalent to taking +the difference in means at endline and subtracting +the difference in means at baseline +(giving the name of a ``difference-in-differences'').\cite{mckenzie2012beyond} +The regression model includes a control variable for treatment assignment, +and a control variable for the measurement round, +but the treatment effect estimate corresponds to +an interaction variable for treatment and round: +the group of observations for which the treatment is active. +This model critically depends on the assumption that, +in the absense of the treatment, +the two groups would have changed performance at the same rate over time, +typically referred to as the \textbf{parallel trends} assumption.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice}} + +There are two main types of data structures for differences-in-differences: +\textbf{repeated cross-sections} and \textbf{panel data}. +In repeated cross-sections, each round contains a random sample +of observations from the treated and untreated groups; +as in cross-sectional designs, both the randomization and sampling processes, +as well as their execution in the field, +are critically important to maintain alongside the survey results. +In panel data structures, we attempt to observe the exact same units +in the repeated rounds, so that we see the same individuals +both before and after they have recieved treatment (or not).\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/what-are-we-estimating-when-we-estimate-difference-differences}} +This allows each unit's baseline outcome to be used +as an additional control for its endline outcome, +a \textbf{fixed effects} design often referred to as an ANCOVA model, +which can provide large increases in power and robustness.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/another-reason-prefer-ancova-dealing-changes-measurement-between-baseline-and-follow}} +When tracking individuals over rounds for this purpose, +maintaining sampling and tracking records is especially important, +because attrition and loss-to-follow-up will remove that unit's information +from all rounds of observation, not just the one they are unobserved in. +Panel-stype experiments therefore require substantially more effort +in the field work portion.\sidenote{ + \url{https://www.princeton.edu/~otorres/Panel101.pdf}} Since baseline and endline data collection may be far apart, it is important to create careful records during the first round so that follow-ups can be conducted with the same subjects, -and \textbf{attrition} across rounds can be properly taken into account.\sidenote{\url{http://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} -Depending on the distribution of results, -estimates may become completely uninformative -with relatively little loss to follow-up. - -The diff-in-diff model is a four-way comparison.\sidenote{\url{https://dimewiki.worldbank.org/wiki/ieddtab}} -The experimental design intends to compare treatment to control, -after taking out the pre-levels for both.\sidenote{\url{https://www.princeton.edu/~otorres/DID101.pdf}} -Therefore the model includes a time period indicator, -a treatment group indicator (the pre-treatment control is the base level), -and it is the \textit{interaction} of treatment and time indicators -that we interpret as the differential effect of the treatment assignment. +and attrition across rounds can be properly taken into account.\sidenote{ + \url{http://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} + +As with cross-sectional designs, this set of study designs is widespread. +Therefore there exist a large number of standardized tools for analysis. +Our \texttt{ietoolkit} package includes the \texttt{ieddtab} command +which produces standardized tables for reporting results.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/ieddtab}} +For more complicated versions of the model +(and they can get quite complicated quite quickly), +you can use an online dashboard to simulate counterfactual results.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/econometrics-sandbox-event-study-designs-co}} +As in cross-sectional designs, these main specifications +will always be accompanied by balance checks (using baseline values), +as well as randomization, selection, and attrition analysis. +In trials of this type, reporting experimental design and execution +using the CONSORT style is common in many disciplines +and will help you to track your data over time.\cite{schulz2010consort} %----------------------------------------------------------------------------------------------- \subsection{Regression discontinuity} From b648242693129fb29abd51a8b81df5cf0ce1e912 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 2 Nov 2019 09:30:10 +0530 Subject: [PATCH 096/854] RD --- bibliography.bib | 21 +++++++++ chapters/research-design.tex | 85 ++++++++++++++++++++++++------------ 2 files changed, 79 insertions(+), 27 deletions(-) diff --git a/bibliography.bib b/bibliography.bib index 741246b65..68cca0540 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -1,3 +1,24 @@ +@article{calonico2019regression, + title={Regression discontinuity designs using covariates}, + author={Calonico, Sebastian and Cattaneo, Matias D and Farrell, Max H and Titiunik, Rocio}, + journal={Review of Economics and Statistics}, + volume={101}, + number={3}, + pages={442--451}, + year={2019}, + publisher={MIT Press} +} + +@article{hausman2018regression, + title={Regression discontinuity in time: Considerations for empirical applications}, + author={Hausman, Catherine and Rapson, David S}, + journal={Annual Review of Resource Economics}, + volume={10}, + pages={533--552}, + year={2018}, + publisher={Annual Reviews} +} + @article{schulz2010consort, title={CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials}, author={Schulz, Kenneth F and Altman, Douglas G and Moher, David}, diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 3b434e800..e48ba3176 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -349,33 +349,64 @@ \subsection{Differences-in-differences} %----------------------------------------------------------------------------------------------- \subsection{Regression discontinuity} -\textbf{Regression discontinuity (RD)} designs differ from other RCTs -\index{regression discontinuity} -in that the treatment group is not directly randomly assigned, -even though it is often applied in the context of a specific experiment.\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Regression_Discontinuity}} -(In practice, many RDs are quasi-experimental, but this section -will treat them as though they are designed by the researcher.) -In an RD design, there is a \textbf{running variable} -which gives eligible people access to some program, -and a strict cutoff determines who is included.\cite{lee2010regression} -This is usually justified by budget limitations. -The running variable should not be the outcome of interest, -and while it can be time, that may require additional modeling assumptions. -Those who qualify are given the intervention and those who don't are not; -this process substitutes for explicit randomization.\sidenote{\url{http://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn}} - -For example, imagine that there is a strict income cutoff created -for a program that subsidizes some educational resources. -Here, income is the running variable. -The intuition is that the people who are ``barely eligible'' -should not in reality be very different from those who are ``barely ineligible'', -and that resulting differences between them at measurement -are therefore due to the intervention or program.\cite{imbens2008regression} -For the modeling component, the \textbf{bandwidth}, -or the size of the window around the cutoff to use, -has to be decided and tested against various options for robustness. -The rest of the model depends largely on the design and execution of the experiment. +\textbf{Regression discontinuity (RD)} designs exploit sharp breaks or limits +in policy designs to separate a group of potentially eligible recipients +into comparable gorups of individuals who do and do not recieve a treatment.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Regression_Discontinuity}} +These types of designs differ from cross-sectional and diff-in-diff designs +in that the group eligible to recieve treatment is not defined directly, +but instead created during the process of the treatment implementation. + \index{regression discontinuity} +In an RD design, there is typically some program or event +which has limited availability dye to practical considerations or poicy choices +and is therefore made available only to individuals who meet a certain threshold requirement. +The intuition of this design is that there is an underlying \textbf{running variable} +which serves as the sole determinant of access to the program, +and a strict cutoff determines the value of this variable at which eligibility stops.\cite{imbens2008regression} +Common examples are test score thresholds, income thresholds, and some types of lotteries.\sidenote{ + \url{http://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn}} +The intuition is that individuals who are just on the recieving side of the threshold +will be very nearly indistinguishable from those on the non-receiving side, +and their post-treatment outcomes are therefore directly comparable.\cite{lee2010regression} +The key assumption here is that the running variable cannot be directly manipulated +by the potential recipients; if the running variable is time there are special considerations.\cite{hausman2018regression} + +Regression discontinuity designs are, once implemented, +very similar in analysis to cross-sectional or differences-in-differences designs. +Depending on the data that is available, +the analytical approach will center on the comparison of individuals +who are narrowly on the inclusion side of the discontinuity, +compared against those who are narrowly on the exclusion side. +The regression model will be identical to the matching research designs +(ie, contingent whether data has one or more rounds +and whether the same units are known to be observed repeatedly). +The treatment effect will be identified, however, by the addition of a control +for the running variable -- meaning that the treatment effect variable +will only be applicable for observations in a small window around the cutoff. +In the RD model, the functional form of that control and the size of that window +(often referred to as the choice of \textbf{bandwidth} for the design) +are the critical parameters for the result.\cite{calonico2019regression} +Therefore, RD analysis often includes extensive robustness checking +using a variety of both functional forms and bandwidths, +as well as placebo testing for non-realized locations of the cutoff +(conceptually similar to the idea of randomization inference). + +In the analytical stage, regression discontinuity designs +often include a large component of visual evidence presentation.\sidenote{ + \url{http://faculty.smu.edu/kyler/courses/7312/presentations/baumer/Baumer\_RD.pdf}} +These presentations help to suggest both the functional form +of the underlying relationship and the type of change observed at the discontinuity, +and help to avoid pitfalls in modelling that are difficult to detect with hypothesis tests.\sidenote{ + \url{http://econ.lse.ac.uk/staff/spischke/ec533/RD.pdf}} +Because these designs are so flexible compared to others, +there is an extensive set of commands that help assess +the efficacy and results from these designs under various assumptions.\sidenote{ + \url{https://sites.google.com/site/rdpackages/}} +These packages support the testing and reporting +of robust plotting and estimation procedures, +tests for manipulation of the running variable, +and tests for power, sample size, and randomization inference approaches +that will complement the main regression approach used for point estimates. %----------------------------------------------------------------------------------------------- \subsection{Instrumental variables} From ca43e574ce3655777b43f091f3c93a556e8e7ce4 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 2 Nov 2019 09:34:53 +0530 Subject: [PATCH 097/854] RD --- chapters/research-design.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index e48ba3176..1b2d88848 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -376,7 +376,8 @@ \subsection{Regression discontinuity} Depending on the data that is available, the analytical approach will center on the comparison of individuals who are narrowly on the inclusion side of the discontinuity, -compared against those who are narrowly on the exclusion side. +compared against those who are narrowly on the exclusion side.\sidenote{ + \url{https://cattaneo.princeton.edu/books/Cattaneo-Idrobo-Titiunik_2019_CUP-Vol1.pdf}} The regression model will be identical to the matching research designs (ie, contingent whether data has one or more rounds and whether the same units are known to be observed repeatedly). From 4d76c5e518584d309e768bbcb30f1548c0c43856 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 2 Nov 2019 09:35:30 +0530 Subject: [PATCH 098/854] Typo --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 1b2d88848..404e3f1b9 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -377,7 +377,7 @@ \subsection{Regression discontinuity} the analytical approach will center on the comparison of individuals who are narrowly on the inclusion side of the discontinuity, compared against those who are narrowly on the exclusion side.\sidenote{ - \url{https://cattaneo.princeton.edu/books/Cattaneo-Idrobo-Titiunik_2019_CUP-Vol1.pdf}} + \url{https://cattaneo.princeton.edu/books/Cattaneo-Idrobo-Titiunik_2019\_CUP-Vol1.pdf}} The regression model will be identical to the matching research designs (ie, contingent whether data has one or more rounds and whether the same units are known to be observed repeatedly). From 89ab2a166c331ce0615de23a2901b3e105e20056 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 2 Nov 2019 09:59:22 +0530 Subject: [PATCH 099/854] Identification --- chapters/research-design.tex | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 404e3f1b9..c62832a1d 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -53,12 +53,28 @@ \section{Causality, inference, and identification} of a program-specific \textbf{treatment effect}, or the change in outcomes directly attributable to exposure to what we call the \textbf{treatment}.\cite{abadie2018econometric} \index{treatment effect} -You can categorize most research designs into one of two main types: +When identification is believed, then we can say with confidence +that our estimate of the treatment effect would, +with an infinite amount of data, +give us a precise estimate of that treatment effect. +Under this condition, we can proceed to draw evidence from the limited samples we have access to, +using statistical techniques to express the uncertainty of not having infinite data. +Without identification, we cannot say that the estimate would be accurate, +even with unlimited data, and therefore cannot associate it to the treatment +in the small samples that we typically have access to. +Conversely, more data is not a substitute for a well-identified experimental design. +Therefore it is important to understand how exactly your study +identifies its estimate of treatment effects, +so you can calculate and interpret those estimates appropriately. +All the study designs we discuss here use the \textbf{potential outcomes} framework +to compare the a group that recieved some treatment to another, counterfactual group. +Each of these types of approaches can be used in two contexts: \textbf{experimental} designs, in which the research team is directly responsible for creating the variation in treatment, and \textbf{quasi-experimental} designs, in which the team identifies a ``natural'' source of variation and uses it for identification. -Nearly all methods can fall into either category. +Neither type of approach is implicitly better or worse, +and both are capable of achieving effect identification under different contexts. %----------------------------------------------------------------------------------------------- \subsection{Estimating treatment effects using control groups} @@ -412,7 +428,10 @@ \subsection{Regression discontinuity} %----------------------------------------------------------------------------------------------- \subsection{Instrumental variables} -Instrumental variables designs utilize variation in an +Instrumental variables designs, unlike the previous set, begin by assuming +that the treatment delivered in the study in question is +inextricably linked to the outcomes and therefore not directly identifiable. +utilize variation in an otherwise-unrelated predictor of exposure to a treatment condition as an ``instrument'' for the treatment condition itself.\sidenote{\url{https://dimewiki.worldbank.org/wiki/instrumental_variables}} \index{instrumental variables} From 447ed759f0b263acfbac93bccf559b5c2bd08519 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sun, 3 Nov 2019 11:53:56 +0530 Subject: [PATCH 100/854] IV --- chapters/research-design.tex | 65 +++++++++++++++++++++++------------- 1 file changed, 41 insertions(+), 24 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index c62832a1d..8317a409c 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -428,33 +428,50 @@ \subsection{Regression discontinuity} %----------------------------------------------------------------------------------------------- \subsection{Instrumental variables} -Instrumental variables designs, unlike the previous set, begin by assuming -that the treatment delivered in the study in question is +\textbf{Instrumental variables (IV)} designs, unlike the previous approaches, +begins by assuming that the treatment delivered in the study in question is inextricably linked to the outcomes and therefore not directly identifiable. -utilize variation in an -otherwise-unrelated predictor of exposure to a treatment condition -as an ``instrument'' for the treatment condition itself.\sidenote{\url{https://dimewiki.worldbank.org/wiki/instrumental_variables}} -\index{instrumental variables} -The simplest example is actually experimental -- -in a randomization design, we can use instrumental variables -based on an \textit{offer} to join some program, -rather than on the actual inclusion in the program.\cite{angrist2001instrumental} -The reason for doing this is that the \textbf{second stage} -of actual program takeup may be severely self-selected, -making the group of program participants in fact -wildly different from the group of non-participants.\sidenote{\url{http://www.rebeccabarter.com/blog/2018-05-23-instrumental_variables/}} -The corresponding \textbf{two-stage-least-squares (2SLS)} estimator\sidenote{\url{http://www.nuff.ox.ac.uk/teaching/economics/bond/IV\%20Estimation\%20Using\%20Stata.pdf}} -solves this by conditioning on only the random portion of takeup -- -in this case, the randomized offer of enrollment in the program. - -Unfortunately, instrumental variables designs are known -to have very high variances relative to \textbf{ordinary least squares}.\cite{young2017consistency} +Instead, similar to regression discontinuity designs, +it attempts to focus on a subset of the variation in treatment uptake +and assesses that limited window of variation that can be argued +to be unrelated to other factors.\cite{angrist2001instrumental} +To so so, the IV approach selects an \textbf{instrument} +for the treatment status -- an otherwise-unrelated predictor of exposure to treatment +that affects the uptake status of an individual.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/instrumental_variables}} +Whereas regression discontinuity designs are ``sharp'' -- +treatment status is completely determined by which side of a cutoff an individual is on -- +IV designs are ``fuzzy'', meaning that they do not completely determine +the treatment status but instead influence the \textit{probability} of treatment. + +As in regression discontinuity designs, +the fundamental form of the regression +is similar to either cross-sectional or differences-in-differences designs. +However, instead of controlling for the running variable directly, +the IV approach typically uses the \textbf{two-stage-least-squares (2SLS)} estimator.\sidenote{ + \url{http://www.nuff.ox.ac.uk/teaching/economics/bond/IV\%20Estimation\%20Using\%20Stata.pdf}} +This estimator forms a prediction of the probability that the unit recieves treatment +based on a regression against the instrumental variable. +That prediction will, by assumption, be the portion of the actual treatment +that is due to the instrument and not any other source, +and since the instrument is unrelated to all other factors, +this portion of the treatment can be used to assess its effects. +Unfortunately, these estimators are known +to have very high variances relative other methods, +particularly when the relationship between the intrument and the treatment is small.\cite{young2017consistency} IV designs furthermore rely on strong but untestable assumptions about the relationship between the instrument and the outcome.\cite{bound1995problems} -Therefore IV designs face special scrutiny, -and only the most believable designs, -usually those backed by extensive qualitative analysis, -are acceptable as high-quality evidence. +Therefore IV designs face intense scrutiny on the strength and exogeneity of the instrument, +and tests for sensitivity to alternative specifications and samples +are usually required with an instrumental variables analysis. +However, the method has special experimental cases that are significantly easier to assess: +for example, a randomized treatment \textit{assignment} can be used as an instrument +for the eventual uptake of the treatment itself, +especially in cases where uptake is expected to be low, +or in circumstances where the treatment is available +to those who are not specifically assigned to it (``encouragement designs''). + +In practice, there %----------------------------------------------------------------------------------------------- \subsection{Matching estimators} From 4e29565611ccae7d7d163328bf092357aeab266e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 5 Nov 2019 18:53:20 +0530 Subject: [PATCH 101/854] IV analysis --- bibliography.bib | 13 +++++++++++++ chapters/research-design.tex | 18 +++++++++++++++++- 2 files changed, 30 insertions(+), 1 deletion(-) diff --git a/bibliography.bib b/bibliography.bib index 68cca0540..49a37dcb3 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -1,3 +1,16 @@ +@inbook {stock2005weak, + title = {Testing for Weak Instruments in Linear IV Regression}, + booktitle = {Identification and Inference for Econometric Models}, + year = {2005}, + pages = {80-108}, + publisher = {Cambridge University Press}, + organization = {Cambridge University Press}, + address = {New York}, + url = {http://www.economics.harvard.edu/faculty/stock/files/TestingWeakInstr_Stock\%2BYogo.pdf}, + author = {James Stock and Motohiro Yogo}, + editor = {Donald W.K. Andrews} +} + @article{calonico2019regression, title={Regression discontinuity designs using covariates}, author={Calonico, Sebastian and Cattaneo, Matias D and Farrell, Max H and Titiunik, Rocio}, diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 8317a409c..7298251b9 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -471,7 +471,23 @@ \subsection{Instrumental variables} or in circumstances where the treatment is available to those who are not specifically assigned to it (``encouragement designs''). -In practice, there +In practice, there are a variety of packages that can be used +to analyse data and report results from instrumental variables designs. +While the built-in command \texttt{ivregress} will often be used +to create the final results, these are not sufficient on their own. +The \textbf{first stage} of the design should be extensively tested, +to demonstrate the strength of the relationship between +the instrument and the treatment variable being instrumented.\cite{stock2005weak} +This can be done using the \texttt{weakiv} and \texttt{weakivtest} commands.\sidenote{ + \url{https://www.carolinpflueger.com/WangPfluegerWeakivtest_20141202.pdf}} +Additionally, tests should be run that identify and exclude individual +observations or clusters that have extreme effects on the estimator, +using customized bootstrap or leave-one-out approaches. +Finally, bounds can be constructed allowing for imperfections +in the exogeneity of the instrument using loosened assumptions, +particularly when the underlying instrument is not directly randomized.\sidenote{ + \url{http://www.damianclarke.net/research/papers/practicalIV-CM.pdf}} + %----------------------------------------------------------------------------------------------- \subsection{Matching estimators} From 5c11de53d75864ea5a94df5971328dc4d7df53bc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 5 Nov 2019 20:49:10 +0530 Subject: [PATCH 102/854] Matching --- chapters/research-design.tex | 72 ++++++++++++++++++++++++++---------- 1 file changed, 52 insertions(+), 20 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 7298251b9..a4c15e9d6 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -490,31 +490,63 @@ \subsection{Instrumental variables} %----------------------------------------------------------------------------------------------- -\subsection{Matching estimators} - -\textbf{Matching} estimators rely on the assumption that, -\index{matching} -conditional on some observable characteristics, -untreated units can be compared to treated units, -as if the treatment had been fully randomized.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Matching}} -In other words, they assert that differential takeup -is sufficiently predictable by observed characteristics. -These assertions are somewhat testable,\sidenote{\url{https://dimewiki.worldbank.org/wiki/iematch}} -and there are a large number of ``treatment effect'' -packages devoted to standardizing reporting of various tests.\sidenote{\url{http://fmwww.bc.edu/repec/usug2016/drukker_uksug16.pdf}} - +\subsection{Matching} + +\textbf{Matching} methods use observable characteristics of individuals +to directly construct treatment and control groups as similar as possible +to each other, either before a randomization process +or after the collection of non-randomized data.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Matching}} + \index{matching} +Matching observations may be one-to-one or many-to-many; +in any case, the result of a matching process +is similar in concept to the use of randomization strata +in simple randomized control trials. +In this way, the method can be conceptualized +as averaging across the results of a large number of ``micro-experiments'' +in which the randomized units are verifiably similar aside from the treatment. + +When matching is performed before a randomization process, +it can be done on any observable characteristics, +including outcomes, if they are available. +The randomization should then record an indicator for the matching group. +This approach is stratification taken to its most extreme: +it reduces the number of potential randomizations dramatically +from the possible number that would be available +if the matching was not conducted, +and therefore reduces the variance caused by the study design. +When matching is done ex post in order to substitute for randomization, +it is based on the assertion that within the matched groups, +the assignment of treatment is as good as random. However, since most matching models rely on a specific linear model, such as the typical \textbf{propensity score matching} estimator, they are open to the criticism of ``specification searching'', meaning that researchers can try different models of matching -until one, by chance, leads to the final result that was desired. +until one, by chance, leads to the final result that was desired; +analytical approaches have shown that the better the fit of the matching model, +the more likely it is that it has arisen by chance and is therefore biased. Newer methods, such as \textbf{coarsened exact matching},\cite{iacus2012causal} -are designed to remove some of the modelling, -such that simple differences between matched observations -are sufficient to estimate treatment effects -given somewhat weaker assumptions on the structure of that effect. -One solution, as with the experimental variant of 2SLS proposed above, -is to incorporate matching models into explicitly experimental designs. +are designed to remove some of the dependence on linearity. +In all ex-post cases, pre-specification of the exact matching model +can prevent some of the potential criticisms on this front, +but ex-post matching in general is not regarded as a strong approach. + +Analysis of data from matching designs is relatively straightforward; +the simplest design only requires controls (indicator variables) for each group +or, in the case of propensity scoring and similar approaches, +weighting the data appropriately in order to balance the analytical samples on the selected variables. +The \texttt{teffects} suite in stata provides a wide variety +of estimators and analytical tools for various designs.\sidenote{ + \url{https://ssc.wisc.edu/sscc/pubs/stata_psmatch.htm}} +The coarsened exact matching (`cem`) package applies the nonparametric approach.\sidenote{ + \url{https://gking.harvard.edu/files/gking/files/cem-stata.pdf}} +DIME's \texttt{iematch} package produces matchings based on a single continuous matching variable.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/iematch}} +In any of these cases, detailed reporting of the matching model is required, +including the resulting effective weights of observations, +since in some cases the lack of overlapping supports for treatment and control +mean that a large number of observations will be weighted near zero +and the estimated effect will be generated based on a subset of the data. %----------------------------------------------------------------------------------------------- \subsection{Synthetic controls} From 40892c41982bf6cced0e157bedf0c4334539aedb Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 5 Nov 2019 21:10:00 +0530 Subject: [PATCH 103/854] Synth --- chapters/research-design.tex | 44 ++++++++++++++++++++++++++---------- 1 file changed, 32 insertions(+), 12 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index a4c15e9d6..55002ecd1 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -551,15 +551,35 @@ \subsection{Matching} %----------------------------------------------------------------------------------------------- \subsection{Synthetic controls} -\textbf{Synthetic controls} methods\cite{abadie2015comparative} -\index{synthetic controls} -are designed for a particularly interesting situation: -one where useful controls for an intervention simply do not exist. -Canonical examples are policy changes at state or national levels, -since at that scope there are no other units quite like -the one that was affected by the policy change -(much less sufficient \textit{N} for a regression estimation).\cite{gobillon2016regional} -In this method, \textbf{time series data} is almost always required, -and the control comparison is contructed by creating -a linear combination of other units such that pre-treatment outcomes -for the treated unit are best approximated by that specific combination. +\textbf{Synthetic control} is a relative newer method +for the case when appropriate counterfactual individuals +do not exist in reality and there are very few (often only one) treatment unit.\cite{abadie2015comparative} + \index{synthetic controls} +For example, state- or national-level policy changes +are typically very difficult to find valid comparators for, +since the set of potential comparators is usually small and diverse +and therefore there are no close matches to the treated unit. +Intuitively, the synthetic control method works +by constructing a counterfactual version of the treated unit +using an average of the other units available. +This is a particularly effective approach +when the lower-level components of the units would be directly comparable: +people, households, business, and so on in the case of states and countries; +os passengers or cargo shipments in the case of transport corridors, for example.\cite{gobillon2016regional} +This is because in those situations the average of the untreated units +can be thought of as balancing by matching the composition of the treated unit. + +To construct this estimator, the synthetic controls method requires +a significant amount of retrospective data on the treatment unit and possible comparators, +including historical data on the outcome of interest for all units. +The counterfactual blend is chosen by optimizing the prediction of past outcomes +based on the potential input characteristics, +and typically selects a small set of comparators to weight into the final analysis. +These datasets therefore may not have a large number of variables or observations, +but the extent of the time series both before and after the implementation +of the treatment are the key sources of power for the estimate. +Visualizations are often excellent demonstrations of these results. +The `synth` package provides functionality for use in Stata, +although since there are a large number of possible parameters +and implementations of the design it can be complex to operate.s\sidenote{ + \url{https://web.stanford.edu/~jhain/synthpage.html}} From 6223fb4c60731d2e2febb92b6309acc06bcde72f Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 13:47:37 +0530 Subject: [PATCH 104/854] Abadie citation (#262) --- bibliography.bib | 11 +++++++++++ chapters/research-design.tex | 2 +- 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/bibliography.bib b/bibliography.bib index 49a37dcb3..bf6935128 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -1,3 +1,14 @@ +@article{abadie2010synthetic, + title={Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program}, + author={Abadie, Alberto and Diamond, Alexis and Hainmueller, Jens}, + journal={Journal of the American statistical Association}, + volume={105}, + number={490}, + pages={493--505}, + year={2010}, + publisher={Taylor \& Francis} +} + @inbook {stock2005weak, title = {Testing for Weak Instruments in Linear IV Regression}, booktitle = {Identification and Inference for Econometric Models}, diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 55002ecd1..621a7580d 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -561,7 +561,7 @@ \subsection{Synthetic controls} and therefore there are no close matches to the treated unit. Intuitively, the synthetic control method works by constructing a counterfactual version of the treated unit -using an average of the other units available. +using an average of the other units available.\cite{abadie2010synthetic} This is a particularly effective approach when the lower-level components of the units would be directly comparable: people, households, business, and so on in the case of states and countries; From 932c6e2f5a7b08f4676d00cf67b924d3ff4cf8f9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 13:50:18 +0530 Subject: [PATCH 105/854] Spatial RD (#220) --- chapters/research-design.tex | 2 ++ 1 file changed, 2 insertions(+) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 621a7580d..0979001a3 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -400,6 +400,8 @@ \subsection{Regression discontinuity} The treatment effect will be identified, however, by the addition of a control for the running variable -- meaning that the treatment effect variable will only be applicable for observations in a small window around the cutoff. +(Spatial discontinuity designs are handled a bit differently due to their multidimensionality.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/spatial-jumps}}) In the RD model, the functional form of that control and the size of that window (often referred to as the choice of \textbf{bandwidth} for the design) are the critical parameters for the result.\cite{calonico2019regression} From 99e635048e073a13e67ed8d03a94b0700fc051b8 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 13:55:14 +0530 Subject: [PATCH 106/854] Clarify pre-trends under randomization (#218) --- chapters/research-design.tex | 3 +++ 1 file changed, 3 insertions(+) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 0979001a3..f38c563d1 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -316,6 +316,9 @@ \subsection{Differences-in-differences} the two groups would have changed performance at the same rate over time, typically referred to as the \textbf{parallel trends} assumption.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice}} +Experimental approaches satisfy this requirement in expectation, +but a given randomization should still be checked for pre-trends +as an extension of balance checking. There are two main types of data structures for differences-in-differences: \textbf{repeated cross-sections} and \textbf{panel data}. From bdb7ec7bce71a2d75e598876d8130f4a754ed505 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 13:56:21 +0530 Subject: [PATCH 107/854] Intuition of DD (#217) --- chapters/research-design.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index f38c563d1..b21773427 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -300,8 +300,8 @@ \subsection{Differences-in-differences} the baseline level of non-treatment units, and the endline level of non-treatment units.\sidenote{ \url{https://www.princeton.edu/~otorres/DID101.pdf}} -The estimated treatment effect is the excess change -of units that recieve the treatment, as they recieve it: +The estimated treatment effect is the excess growth +of units that recieve the treatment, in the period they recieve it: calculating that value is equivalent to taking the difference in means at endline and subtracting the difference in means at baseline From 8a2819155fd54062e32af768a717179dd92288dd Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 13:57:58 +0530 Subject: [PATCH 108/854] Soften balance language (#216) --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index b21773427..7f8c72ca0 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -259,7 +259,7 @@ \subsection{Cross-sectional designs} \textbf{Randomization inference} can be used to esetimate the underlying variability in the randomization process (more on this in the next chapter). -\textbf{Balance checks} are typically reported as evidence of an effective randomization, +\textbf{Balance checks} are often reported as evidence of an effective randomization, and are particularly important when the design is quasi-experimental (since then the randomization process cannot be simulated explicitly). However, controls for balance variables are usually unnecessary in RCTs, From 91e9e3c047deb2988798a12ec449b3cbdfe36719 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 14:05:25 +0530 Subject: [PATCH 109/854] Causal emphasis for RCTs (#89) --- chapters/research-design.tex | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 7f8c72ca0..288b9f720 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -163,6 +163,8 @@ \subsection{Experimental and quasi-experimental research designs} how things would have turned out for the treated group if they had not been treated, and it is particularly effective at doing so as evidenced by its broad credibility in fields ranging from clinical medicine to development. +Therefore RCTs are very popular tools for determining the causal impact +of specific prorgrams or policy interventions. However, there are many types of treatments that are impractical or unethical to effectively approach using an experimental strategy, and therefore many limitations to accessing ``big questions'' @@ -272,7 +274,7 @@ \subsection{Cross-sectional designs} summary statistics for the eligible population, balance checks for randomization and sample selection, a primary regression specification (with multiple hypotheses appropriately adjusted), -additional specifications with adjustments for attrition, balance, and other potential contamination, +additional specifications with adjustments for non-response, balance, and other potential contamination, and randomization-inference analysis or other placebo regression approaches. There are a number of tools that are available to help with the complete process of data collection,\sidenote{ @@ -281,7 +283,7 @@ \subsection{Cross-sectional designs} \url{https://dimewiki.worldbank.org/wiki/iebaltab}} and to visualize treatment effects.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iegraph}} -Tools and methods for analyzing selective attrition are available.\sidenote{ +Tools and methods for analyzing selective non-response are available.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} %----------------------------------------------------------------------------------------------- From d78c164a51f998ad1d667deda5edf1c3bbcc9f62 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 14:10:34 +0530 Subject: [PATCH 110/854] Add IE in Practice (#87) --- chapters/research-design.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 288b9f720..f03261086 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -92,13 +92,14 @@ \subsection{Estimating treatment effects using control groups} with which outcomes can be directly compared. There are several resources that provide more or less mathematically intensive approaches to understanding how various methods to his. +\textit{Impact Evaluation in Practice} is a strong general guide to these methods.\sidenote{ + \url{https://www.worldbank.org/en/programs/sief-trust-fund/publication/impact-evaluation-in-practice}} \textit{Causal Inference} and \textit{Causal Inference: The Mixtape} -provides a detailed practical introduction to and history of -each of these methods.\sidenote{ +provides more detailed mathematical approaches fo the tools.\sidenote{ \url{https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/} \\ \noindent \url{http://scunning.com/cunningham_mixtape.pdf}} \textit{Mostly Harmless Econometrics} and \textit{Mastering Metrics} -are canonical treatments of the mathematics behind all econometric approaches.\sidenote{ +are canonical treatments of the statistical principles behind all econometric approaches.\sidenote{ \url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion} \\ \noindent \url{http://assets.press.princeton.edu/chapters/s10363.pdf}} From 2a736f0f911345a10f31ccc628e7d63756fddb37 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 15:18:10 +0530 Subject: [PATCH 111/854] Typos --- chapters/research-design.tex | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index f03261086..1d92e4ba6 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -295,7 +295,7 @@ \subsection{Differences-in-differences} \textbf{differences-in-differences}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Difference-in-Differences}} designs (abbreviated as DD, DiD, diff-in-diff, and other variants) -estimate treatment effects from /textit{changes} in outcomes +estimate treatment effects from \textit{changes} in outcomes between two or more rounds of measurement. \index{differences-in-differences} In these designs, three control groups are used – @@ -308,7 +308,7 @@ \subsection{Differences-in-differences} calculating that value is equivalent to taking the difference in means at endline and subtracting the difference in means at baseline -(giving the name of a ``difference-in-differences'').\cite{mckenzie2012beyond} +(hence the singular ``difference-in-differences'').\cite{mckenzie2012beyond} The regression model includes a control variable for treatment assignment, and a control variable for the measurement round, but the treatment effect estimate corresponds to @@ -343,7 +343,7 @@ \subsection{Differences-in-differences} maintaining sampling and tracking records is especially important, because attrition and loss-to-follow-up will remove that unit's information from all rounds of observation, not just the one they are unobserved in. -Panel-stype experiments therefore require substantially more effort +Panel-style experiments therefore require substantially more effort in the field work portion.\sidenote{ \url{https://www.princeton.edu/~otorres/Panel101.pdf}} Since baseline and endline data collection may be far apart, @@ -380,7 +380,7 @@ \subsection{Regression discontinuity} but instead created during the process of the treatment implementation. \index{regression discontinuity} In an RD design, there is typically some program or event -which has limited availability dye to practical considerations or poicy choices +which has limited availability due to practical considerations or poicy choices and is therefore made available only to individuals who meet a certain threshold requirement. The intuition of this design is that there is an underlying \textbf{running variable} which serves as the sole determinant of access to the program, @@ -437,13 +437,13 @@ \subsection{Regression discontinuity} \subsection{Instrumental variables} \textbf{Instrumental variables (IV)} designs, unlike the previous approaches, -begins by assuming that the treatment delivered in the study in question is +begin by assuming that the treatment delivered in the study in question is inextricably linked to the outcomes and therefore not directly identifiable. Instead, similar to regression discontinuity designs, -it attempts to focus on a subset of the variation in treatment uptake +IV attempts to focus on a subset of the variation in treatment uptake and assesses that limited window of variation that can be argued to be unrelated to other factors.\cite{angrist2001instrumental} -To so so, the IV approach selects an \textbf{instrument} +To do so, the IV approach selects an \textbf{instrument} for the treatment status -- an otherwise-unrelated predictor of exposure to treatment that affects the uptake status of an individual.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/instrumental_variables}} @@ -546,7 +546,7 @@ \subsection{Matching} The \texttt{teffects} suite in stata provides a wide variety of estimators and analytical tools for various designs.\sidenote{ \url{https://ssc.wisc.edu/sscc/pubs/stata_psmatch.htm}} -The coarsened exact matching (`cem`) package applies the nonparametric approach.\sidenote{ +The coarsened exact matching (\texttt{cem}) package applies the nonparametric approach.\sidenote{ \url{https://gking.harvard.edu/files/gking/files/cem-stata.pdf}} DIME's \texttt{iematch} package produces matchings based on a single continuous matching variable.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iematch}} @@ -573,7 +573,7 @@ \subsection{Synthetic controls} This is a particularly effective approach when the lower-level components of the units would be directly comparable: people, households, business, and so on in the case of states and countries; -os passengers or cargo shipments in the case of transport corridors, for example.\cite{gobillon2016regional} +or passengers or cargo shipments in the case of transport corridors, for example.\cite{gobillon2016regional} This is because in those situations the average of the untreated units can be thought of as balancing by matching the composition of the treated unit. @@ -587,7 +587,7 @@ \subsection{Synthetic controls} but the extent of the time series both before and after the implementation of the treatment are the key sources of power for the estimate. Visualizations are often excellent demonstrations of these results. -The `synth` package provides functionality for use in Stata, +The \texttt{synth} package provides functionality for use in Stata, although since there are a large number of possible parameters -and implementations of the design it can be complex to operate.s\sidenote{ +and implementations of the design it can be complex to operate.\sidenote{ \url{https://web.stanford.edu/~jhain/synthpage.html}} From d6f5bd7355f8984d27439746ad8335061f2b1f9b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 15:57:39 +0530 Subject: [PATCH 112/854] Broad outline --- chapters/sampling-randomization-power.tex | 232 ++++++++++++---------- 1 file changed, 123 insertions(+), 109 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index c48292bcc..24f068323 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -1,4 +1,4 @@ -%------------------------------------------------ +%----------------------------------------------------------------------------------------------- \begin{fullwidth} Sampling, randomization, and power calculations are the core elements of experimental design. @@ -27,8 +27,11 @@ and maximizes the likelihood that reported effect sizes are accurate. \end{fullwidth} -%------------------------------------------------ -\section{Reproducibility in sampling, randomization, and power calculation} +%----------------------------------------------------------------------------------------------- + +\section{General principles of random processes} + +\subsection{Reproducibility in sampling, randomization, and power calculation} Reproducibility in statistical programming is absolutely essential.\cite{orozco2018make} This is especially true when simulating or analyzing random processes, @@ -76,8 +79,6 @@ \section{Reproducibility in sampling, randomization, and power calculation} for any random process. Note the three components: versioning, sorting, and seeding. Why are \texttt{check1} and \texttt{check3} the same? Why is \texttt{check2} different? -\codeexample{replicability.do}{./code/replicability.do} - Commands like \texttt{bys:} and \texttt{merge} will re-sort your data as part of their execution, To reiterate: any process that includes a random component is a random process, including sampling, randomization, power calculation, @@ -92,10 +93,11 @@ \section{Reproducibility in sampling, randomization, and power calculation} If there are any differences, the process has not reproduced, and \texttt{cf} will return an error, as shown here. -\codeexample{randomization-cf.do}{./code/randomization-cf.do} +%----------------------------------------------------------------------------------------------- + +\section{Simple sampling and randomization} -%------------------------------------------------ -\section{Sampling} +\subsection{Sampling} \textbf{Sampling} is the process of randomly selecting units of observation \index{sampling} @@ -122,7 +124,93 @@ \section{Sampling} instead of creating a \texttt{treatment} variable, create a \texttt{sample} variable. -\codeexample{simple-sample.do}{./code/simple-sample.do} +\section{Randomization} + +\textbf{Randomization} is the process of assigning units to some kind of treatment program. +Many of the Stata techniques shown here can also be used for sampling, +by understanding ``being included in the sample'' as a treatment in itself. +Randomization is used to assign treatment programs in development research +because it guarantees that, \textit{on average}, +the treatment will not be correlated with anything it did not cause.\cite{duflo2007using} +However, as you have just seen, +any random process induces noise: so does randomization. +You may get unlucky and create important correlations by chance -- +in fact, you can almost always identify some treatment assignment that +creates the appearance of statistical relationships that are not really there. +This section will show you how to assess and control this \textbf{randomization noise}. + +To do that, we create a randomization \textbf{program}, which +\index{programming} +allows us to re-run the randomization method many times +and assess the amount of randomization noise correctly.\sidenote{\url{https://data.princeton.edu/stata/programming}} +Storing the randomization code as a program allows us to access the whole code block +with a single line of code, so we can tinker with the randomization process +separately from its application to the data. +Programming takes a few lines of code that may be new to you, +but getting used to this structure is very useful. +A simple randomization program is shown below. +This code randomizes observations into two groups by combining +\texttt{xtile} and \texttt{recode}, +which can be extended to any proportions for any number of arms. + +%----------------------------------------------------------------------------------------------- + +\section{Clustering and stratification} + +To control randomization noise, we often use techniques +\index{clustering}\index{stratification} +that reduce the likelihood of a ``bad draw''.\cite{athey2017econometrics} +These techniques can be used in any random process, +including sampling; their implementation is nearly identical in code. + +\subsection{Clustering} + +Many studies collect data at a different level of observation than the randomization unit.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} +For example, a policy may only be able to affect an entire village, +but you are interested in household behavior. +This type of structure is called \textbf{clustering},\sidenote{\url{https://dimewiki.worldbank.org/wiki/Multi-stage_(Cluster)_Sampling}} +because the units are assigned to treatment in clusters. +Because the treatments are assigned in clusters, however, +there are in reality fewer randomized groups than the number of units in the data. +Therefore, standard errors for clustered designs must also be clustered, +at the level at which the treatment was assigned.\sidenote{\url{https://blogs.worldbank.org/impactevaluations/when-should-you-cluster-standard-errors-new-wisdom-econometrics-oracle}} + +Clustered randomization must typically be implemented manually; +it typically relies on subsetting the data intelligently +to the desired assignment levels. +We demonstrate here. + +\subsection{Stratification} + + +We mean this in a specific way: we want to exclude +randomizations with certain correlations, +or we want to improve the \textbf{balance} +of the average randomization draw.\cite{bruhn2009pursuit} +The most common is \textbf{stratification},\sidenote{\url{https://dimewiki.worldbank.org/wiki/Stratified_Random_Sample}} +which splits the sampling frame into ``similar'' groups -- \textbf{strata} -- +and randomizes \textit{within} each of these groups. +It is important to preserve the overall likelihood for each unit to be included, +otherwise statistical corrections become necessary. +For a simple stratified randomization design, +it is necessary to include strata \textbf{fixed effects}, +or an indicator variable for each strata, in any final regression. +This accounts for the fact that randomizations were conducted within the strata, +by comparing each unit to the others within its own strata. + +However, manually implementing stratified randomization +is prone to error: in particular, it is difficult to precisely account +for the interaction of group sizes and multiple treatment arms, +particularly when a given strata can contain a small number of clusters, +and when there are a large number of treatment arms.\cite{carril2017dealing} +The user-written \texttt{randtreat} command +properly implements stratification, +with navigable options for handling common pitfalls. +We demonstrate the use of this command here. + +%----------------------------------------------------------------------------------------------- + +\section{Power calculation and randomization inference} The fundamental contribution of sampling to the power of a research design is this: if you randomly sample a set number of observations from a set frame, @@ -134,6 +222,8 @@ \section{Sampling} \textbf{sampling noise}, the uncertainty in statistical estimates caused by sampling. \index{sampling noise} +\subsection{Sampling error and randomization noise} + Portions of this noise can be reduced through design choices such as clustering and stratification. In general, all sampling requires \textbf{inverse probability weights}. @@ -163,8 +253,6 @@ \section{Sampling} give estimation results far from the true value, and others give results close to it. -\codeexample{sample-noise.do}{./code/sample-noise.do} - The output of the code is a distribution of means in sub-populations of the overall data. This distribution is centered around the true population mean, but its dispersion depends on the exact structure of the population. @@ -182,38 +270,6 @@ \section{Sampling} would lead to parameter estimates in the indicated range. This approach says nothing about the truth or falsehood of any hypothesis. -%------------------------------------------------ -\section{Randomization} - -\textbf{Randomization} is the process of assigning units to some kind of treatment program. -Many of the Stata techniques shown here can also be used for sampling, -by understanding ``being included in the sample'' as a treatment in itself. -Randomization is used to assign treatment programs in development research -because it guarantees that, \textit{on average}, -the treatment will not be correlated with anything it did not cause.\cite{duflo2007using} -However, as you have just seen, -any random process induces noise: so does randomization. -You may get unlucky and create important correlations by chance -- -in fact, you can almost always identify some treatment assignment that -creates the appearance of statistical relationships that are not really there. -This section will show you how to assess and control this \textbf{randomization noise}. - -To do that, we create a randomization \textbf{program}, which -\index{programming} -allows us to re-run the randomization method many times -and assess the amount of randomization noise correctly.\sidenote{\url{https://data.princeton.edu/stata/programming}} -Storing the randomization code as a program allows us to access the whole code block -with a single line of code, so we can tinker with the randomization process -separately from its application to the data. -Programming takes a few lines of code that may be new to you, -but getting used to this structure is very useful. -A simple randomization program is shown below. -This code randomizes observations into two groups by combining -\texttt{xtile} and \texttt{recode}, -which can be extended to any proportions for any number of arms. - -\codeexample{randomization-program-1.do}{./code/randomization-program-1.do} - With this program created and executed, the next part of the code, shown below, can set up for reproducibility. @@ -237,61 +293,7 @@ \section{Randomization} that randomization can spuriously produce between \texttt{price} and \texttt{treatment}. -\codeexample{randomization-program-2.do}{./code/randomization-program-2.do} - -\subsection{Clustering and stratification} - -To control randomization noise, we often use techniques -\index{clustering}\index{stratification} -that reduce the likelihood of a ``bad draw''.\cite{athey2017econometrics} -These techniques can be used in any random process, -including sampling; their implementation is nearly identical in code. -We mean this in a specific way: we want to exclude -randomizations with certain correlations, -or we want to improve the \textbf{balance} -of the average randomization draw.\cite{bruhn2009pursuit} -The most common is \textbf{stratification},\sidenote{\url{https://dimewiki.worldbank.org/wiki/Stratified_Random_Sample}} -which splits the sampling frame into ``similar'' groups -- \textbf{strata} -- -and randomizes \textit{within} each of these groups. -It is important to preserve the overall likelihood for each unit to be included, -otherwise statistical corrections become necessary. -For a simple stratified randomization design, -it is necessary to include strata \textbf{fixed effects}, -or an indicator variable for each strata, in any final regression. -This accounts for the fact that randomizations were conducted within the strata, -by comparing each unit to the others within its own strata. - -However, manually implementing stratified randomization -is prone to error: in particular, it is difficult to precisely account -for the interaction of group sizes and multiple treatment arms, -particularly when a given strata can contain a small number of clusters, -and when there are a large number of treatment arms.\cite{carril2017dealing} -The user-written \texttt{randtreat} command -properly implements stratification, -with navigable options for handling common pitfalls. -We demonstrate the use of this command here. - -\codeexample{randtreat-strata.do}{./code/randtreat-strata.do} - -Many studies collect data at a different level of observation than the randomization unit.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} -For example, a policy may only be able to affect an entire village, -but you are interested in household behavior. -This type of structure is called \textbf{clustering},\sidenote{\url{https://dimewiki.worldbank.org/wiki/Multi-stage_(Cluster)_Sampling}} -because the units are assigned to treatment in clusters. -Because the treatments are assigned in clusters, however, -there are in reality fewer randomized groups than the number of units in the data. -Therefore, standard errors for clustered designs must also be clustered, -at the level at which the treatment was assigned.\sidenote{\url{https://blogs.worldbank.org/impactevaluations/when-should-you-cluster-standard-errors-new-wisdom-econometrics-oracle}} - -Clustered randomization must typically be implemented manually; -it typically relies on subsetting the data intelligently -to the desired assignment levels. -We demonstrate here. - -\codeexample{randtreat-clusters.do}{./code/randtreat-clusters.do} - -%------------------------------------------------ -\section{Power calculations} +\subsection{Power calculations} When we have decided on a practical sampling and randomization design, we next assess its \textbf{power}.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Power_Calculations_in_Stata}} @@ -324,8 +326,6 @@ \section{Power calculations} but they will not answer most of the practical questions that complex experimental designs require. -\subsection{Minimum detectable effect} - To determine the \textbf{minimum detectable effect}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Minimum_Detectable_Effect}} \index{minimum detectable effect} -- the smallest true effect that your design can detect -- @@ -362,11 +362,6 @@ \subsection{Minimum detectable effect} lets us calibrate whether the experiment we propose is realistic given the constraints of the amount of data we can collect. -\codeexample{minimum-detectable-effect.do}{./code/minimum-detectable-effect.do} - - -\subsection{Minimum sample size} - Another way to think about the power of a design is to figure out how many observations you need to include to test various hypotheses -- the \textbf{minimum sample size}. @@ -379,9 +374,6 @@ \subsection{Minimum sample size} and report significance across those groups instead of across variation in the size of the effect. -\codeexample{minimum-sample-size.do}{./code/minimum-sample-size.do} - - Using the concepts of minimum detectable effect and minimum sample size in tandem can help answer a key question that typical approaches to power often do not. @@ -402,4 +394,26 @@ \subsection{Minimum sample size} simulation ensures you will have understood the key questions well enough to report standard measures of power once your design is decided. -%------------------------------------------------ +\subsection{Randomization inference} + +% code + +\codeexample{replicability.do}{./code/replicability.do} + +\codeexample{randomization-cf.do}{./code/randomization-cf.do} + +\codeexample{simple-sample.do}{./code/simple-sample.do} + +\codeexample{sample-noise.do}{./code/sample-noise.do} + +\codeexample{randomization-program-1.do}{./code/randomization-program-1.do} + +\codeexample{randomization-program-2.do}{./code/randomization-program-2.do} + +\codeexample{randtreat-strata.do}{./code/randtreat-strata.do} + +\codeexample{randtreat-clusters.do}{./code/randtreat-clusters.do} + +\codeexample{minimum-detectable-effect.do}{./code/minimum-detectable-effect.do} + +\codeexample{minimum-sample-size.do}{./code/minimum-sample-size.do} From 3cfb595461cf3e1aaae12f0d1ded21aa8fd5395b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 18:08:02 +0530 Subject: [PATCH 113/854] Randomness in Stata --- chapters/sampling-randomization-power.tex | 140 +++++++++++++--------- 1 file changed, 84 insertions(+), 56 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 24f068323..f8ba7d1dd 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -29,69 +29,97 @@ %----------------------------------------------------------------------------------------------- -\section{General principles of random processes} - -\subsection{Reproducibility in sampling, randomization, and power calculation} - -Reproducibility in statistical programming is absolutely essential.\cite{orozco2018make} -This is especially true when simulating or analyzing random processes, -and sampling, randomization, and power calculation -are the prime examples of these sorts of tasks. -This section is a short introduction to ensuring that code -which generates randomized outputs is reproducible. -There are three key inputs to assuring reproducibility in these processes: +\section{Random processes in Stata} + +Most experimental designs rely directly on random processes, +particularly sampling and randomization, to be executed in code. +The fundamental econometrics behind impact evaluation +depends on establishing that the observations in the sample +and any experimental treatment assignment processes are truly random. +Therefore, understanding and programming for sampling and randomization +is essential to ensuring that planned experiments +are correctly implemented in the field, so that the results +can be interpreted according to the experimental design. +(Note that there are two distinct concepts referred to here by ``randomization'': +the conceptual process of assigning units to treatment arms, +and the technical process of assigning random numbers in statistical software, +which is a part of all tasks that include a random component.\sidenote{ + \url{https://blog.stata.com/2016/03/10/how-to-generate-random-numbers-in-stata/}}) + +Randomization is challenging. It is deeply unintuitive for the human brain. +``True'' randomization is also nearly impossible to achieve for computers, +which are inherently deterministic. There are plenty of sources to read about this.\sidenote{ + \url{https://www.random.org/randomness/}} +For our purposes, we will focus on what you need to understand +in order to produce truly random results for your project using Stata, +and how you can make sure you can get those exact results again in the future. +This takes a combination of strict rules, solid understanding, and careful programming. +This section introduces the strict rules: these are non-negotiable (but thankfully simple). +The second section provides basic introductions to the tasks of sampling and randomization, +and the third introduces common varieties encountered in the field. +The fourth section discusses more advanced topics that are used +to analyze the random processes directly in order to understand their properties. +However, the needs you will encounter in the field will inevitably +be more complex than anything we present here, +and you will need to recombine these lessons to match your project's needs. + +\subsection{Reproducibility in random Stata processes} + +Reproducibility in statistical programming means that random results +can be re-obtained at a future time. +All random methods should be reproducible.\cite{orozco2018make} +Stata, like most statistical software, uses a \textbf{pseudo-random number generator}. +Basically, it has a really long ordered list of numbers with the property that +knowing the previous one gives you precisely zero information about the next one. +Stata uses one of these numbers every time it has a task that is non-deterministic. +In ordinary use, it will cycle through these numbers starting from a fixed point +every time you restart Stata, and by the time you get to any given script, +the current state and the subsequent states will be as good as random.\sidenote{ + \url{https://www.stata.com/manuals14/rsetseed.pdf}} +However, for true reproducible randomization, we need two additional properties: +we need to be able to fix the starting point so we can come back to it later; +and we need that starting point to be independently random from our process. +In Stata, this is accomplished through three command concepts: \textbf{versioning}, \textbf{sorting}, and \textbf{seeding}. -\index{reproducible randomization} -Without these, other people running your code may get very different results in the future. \textbf{Versioning} means using the same version of the software. -(All software versions of Stata above version 13 currently operate identically on all platforms.) If anything is different, the underlying randomization algorithms may have changed, and it will be impossible to recover the original result. In Stata, the \texttt{version} command ensures that the software algorithm is fixed. -The \texttt{ieboilstart} command in \texttt{ietoolkit} provides functionality to support this requirement.\sidenote{\url{https://dimewiki.worldbank.org/wiki/ieboilstart}} -In general, you will use \texttt{ieboilstart} at the beginning of your master do-file\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master_Do-files}} -to set the version once; in this guide, we will use -\texttt{version 13.1} in examples where we expect this to already be done. - -\textbf{Sorting} means that the actual data that the random process is run on is fixed. -Most random outcomes have as their basis an algorithmic sequence of pseudorandom numbers. -This means that if the start point is set, the full sequence of numbers will not change. -A corollary of this is that the underlying data must be unchanged between runs: -to ensure that the dataset is fixed, you must make a \texttt{LOCKED} copy of it at runtime. -However, if you re-run the process with the dataset in a different order, -the same numbers will get assigned to different units, and the randomization will turn out different. -In Stata, \texttt{isid [id\_variable], sort} will ensure that order is fixed over repeat runs. - -\textbf{Seeding} means manually setting the start-point of the underlying randomization algorithm. -You can draw a standard seed randomly by visiting \url{http://bit.ly/stata-random}. -You will see in the code below that we include the timestamp for verification. -Note that there are two distinct concepts referred to here by ``randomization'': -the conceptual process of assigning units to treatment arms, -and the technical process of assigning random numbers in statistical software, -which is a part of all tasks that include a random component.\sidenote{\url{https://blog.stata.com/2016/03/10/how-to-generate-random-numbers-in-stata/}} -If the randomization seed for the statistical software is not set, -then its pseudorandom algorithm will pick up where it left off. -By setting the seed, you force it to restart from the same point. -In Stata, \texttt{set seed [seed]} will accomplish this. - -The code below code loads and sets up the \texttt{auto.dta} dataset -for any random process. Note the three components: versioning, sorting, and seeding. -Why are \texttt{check1} and \texttt{check3} the same? Why is \texttt{check2} different? - -Commands like \texttt{bys:} and \texttt{merge} will re-sort your data as part of their execution, -To reiterate: any process that includes a random component -is a random process, including sampling, randomization, power calculation, -and many algorithms like bootstrapping. -and other commands may alter the seed without you realizing it.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Randomization_in_Stata}} -Any of these things will cause the output to fail to replicate. -Therefore, each random process should be independently executed -to ensure that these three rules are followed. -Before shipping the results of any random process, +We recommend using \texttt{version 13.1} for back-compatibility; +the algorithm was changed after Stata 14 but its improvements do not matter in practice. +(Note that you will \textit{never} be able to transfer a randomization to another software such as R.) +The \texttt{ieboilstart} command in \texttt{ietoolkit} provides functionality to support this requirement.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/ieboilstart}} +We recommend, you use \texttt{ieboilstart} at the beginning of your master do-file.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Master_Do-files}} +However, note that testing your do-files without running them +via the master do-file may produce different resuls: +Stata's \texttt{version} expires after execution just like a \texttt{local}. + +\textbf{Sorting} means that the actual data that the random process is run on is fixed; +because numbers are assigned to each observation in sequence, +changing their order will change the result of the process. +A corollary is that the underlying data must be unchanged between runs: +you must make a fixed final copy of the data when you run a randomization for fieldwork. +In Stata, the only way to guarantee a unique sorting order is to use\texttt{isid [id\_variable], sort}. (The \texttt{sort , stable} command is insufficient.) +You can additional use the \texttt{datasignature} commannd to make sure the data is fixed. + +\textbf{Seeding} means manually setting the start-point of the randomization algorithm. +You can draw a six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. +There are many more seeds possible but this is a large enough set for most purposes. +In Stata, \texttt{set seed [seed]} will set the generator to that state. +You should use exactly one seed per randomization process: +what is important is that each of these seeds is truly random. +You will see in the code below that we include the source and timestamp for verification. +Any process that includes a random component is a random process, +including sampling, randomization, power calculation, and algorithms like bootstrapping. +Other commands may induce randomness or alter the seed without you realizing it, +so carefully confirm exactly how your code runs before finalizing it.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Randomization_in_Stata}} +To confirm that a randomization has worked well before finalizing its results, save the outputs of the process in a temporary location, -re-run the file, and use \texttt{cf \_all using [dataset]} targeting the saved file. -If there are any differences, the process has not reproduced, -and \texttt{cf} will return an error, as shown here. +re-run the code, and use \texttt{cf} or \texttt{datasignature} to ensure nothing has changed. %----------------------------------------------------------------------------------------------- From aee01ff00746a1f51527c5be14b5d3740e29801f Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 18:39:38 +0530 Subject: [PATCH 114/854] Sampling --- chapters/sampling-randomization-power.tex | 62 ++++++++++++++--------- 1 file changed, 38 insertions(+), 24 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index f8ba7d1dd..06dc19703 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -123,34 +123,48 @@ \subsection{Reproducibility in random Stata processes} %----------------------------------------------------------------------------------------------- -\section{Simple sampling and randomization} +\section{Sampling and randomization} \subsection{Sampling} \textbf{Sampling} is the process of randomly selecting units of observation -\index{sampling} -from a master list for survey data collection.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Sampling_\%26_Power_Calculations}} -This list may be called a \textbf{sampling universe}, a \textbf{listing frame}, or something similar. -We refer to it as a \textbf{master data set}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master_Data_Set}} -because it is the authoritative source -for the existence and fixed characteristics of each of the units that may be surveyed.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} -If data collected in the field contradicts the master data, -the master data always dominates -(unless the field data is so inconsistent that a master update is necessary). -Most importantly, the master data set indicates -how many individuals are eligible for sampling and surveying, -and therefore contains statistical information -about the likelihood that each will be chosen. - -The code below demonstrates how to take -a uniform-probability random sample -from a population using the \texttt{sample} command. -More advanced sampling techniques, -such as clustering and stratification, -are in practice identical in implementation -to the randomization section that follows -- -instead of creating a \texttt{treatment} variable, -create a \texttt{sample} variable. +from a master list for data collection.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Sampling_\%26_Power_Calculations}} + \index{sampling} +That master list may be called a \textbf{sampling universe}, a \textbf{listing frame}, or something similar. +We refer to it as a \textbf{master data set}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Master_Data_Set}} +because it is the authoritative source for the existence and fixed characteristics +of each of the units that may be surveyed.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} +The master data set indicates how many individuals are eligible for data collection, +and therefore contains statistical information about the likelihood that each will be chosen. + +The simplest form of random sampling is \textbf{uniform-probability random sampling}. +This means that every observation in the master data set +has an equal probability of being included in the sample. +The most explicit method of implementing this process +is to assign random numbers to all your potential observations, +order them by the number they are assigned, +and mark as `sampled' those with the lowest numbers, to the desired proportion. +(In general, we will talk about sampling proportions rather than numbers of observations. +Sampling specific numbers of observations is complicated and should be avoided, +because it will make the probability of selection very hard to calculate.) +There are a number of shortcuts to doing this process, +but they all use this method as the starting point, +so you should become familiar with exactly how this method works. + +Almost all of the relevant considerations for sampling come from two sources: +deciding what population, if any, a sample is meant to represent (including subgroups); +and deciding that different individuals should have different probabilities +of being included in the sample. +These should be determined in advance by the study design, +since otherwise the sampling process will not be clear, +and the interpretation of measurements is directly linked to who is included in them. +Often, data collection can be designed to keep complications to a minimum, +so long as it are carefully thought through from this perspective. +Ex post changes to the study scope using a sample drawn for a different purpose +usually involve tedious calculations of probabilities and should be avoided. \section{Randomization} From 841eebdcb9809c27577a08d068d50cc86a3dac23 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 20:15:27 +0530 Subject: [PATCH 115/854] Randomization --- chapters/sampling-randomization-power.tex | 58 +++++++++++++---------- 1 file changed, 34 insertions(+), 24 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 06dc19703..ebcf49d7f 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -169,31 +169,41 @@ \subsection{Sampling} \section{Randomization} \textbf{Randomization} is the process of assigning units to some kind of treatment program. -Many of the Stata techniques shown here can also be used for sampling, -by understanding ``being included in the sample'' as a treatment in itself. -Randomization is used to assign treatment programs in development research -because it guarantees that, \textit{on average}, +Most of the Stata commands shown for sampling can be directly transferred to randomization, +since randomization is also a process of splitting a sample into groups. +Where sampling determines whether a particular individual +will be observed at all in the course of data collection, +randomization determines what state each individual will be observed in. +Randomizing a treatment guarantees that, \textit{on average}, the treatment will not be correlated with anything it did not cause.\cite{duflo2007using} -However, as you have just seen, -any random process induces noise: so does randomization. -You may get unlucky and create important correlations by chance -- -in fact, you can almost always identify some treatment assignment that -creates the appearance of statistical relationships that are not really there. -This section will show you how to assess and control this \textbf{randomization noise}. - -To do that, we create a randomization \textbf{program}, which -\index{programming} -allows us to re-run the randomization method many times -and assess the amount of randomization noise correctly.\sidenote{\url{https://data.princeton.edu/stata/programming}} -Storing the randomization code as a program allows us to access the whole code block -with a single line of code, so we can tinker with the randomization process -separately from its application to the data. -Programming takes a few lines of code that may be new to you, -but getting used to this structure is very useful. -A simple randomization program is shown below. -This code randomizes observations into two groups by combining -\texttt{xtile} and \texttt{recode}, -which can be extended to any proportions for any number of arms. +Causal inference from randomization therefore depends on a specific counterfactual: +that the units who recieved the treatment program might not have done so. +Therefore, controlling the exact probability that each individual receives treatment +is the most important part of a randomization process, +and must be carefully worked out in more complex designs. + +Just like sampling, the simplest form of randomization is a uniform-probability process.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Randomization_in_Stata}} +Sampling typically has only two possible outcomes: observed and unobserved. +Randomization, by contrast, often involves multiple possible results +which each represent various varieties of treatments to be delivered; +in some cases, multiple treatment assignments are intended to overlap in the same sample. +Complexity can therefore grow very quickly in randomization +and it is doubly important to fully understand the conceptual process +that is described in the experimental design, +and fill in any gaps in the process before implmenting it in Stata. + +Some types of experimental designs necessitate that randomization be done live in the field. +It is possible to do this using survey software or live events. +These methods typically do not leave a record of the randomization, +so particularly when the experiment is electronic, +it is best to execute the randomization in advance and preload the results. +Even when randomization absolutely cannot be done in advance, it is still useful +to build a corresponding model of the randomization process in Stata +so that you can conduct statistical analysis later +including checking for irregularities in the field assignment. +Understanding that process will also improve the ability of the team +to ensure that the field randomization process is appropriately designed and executed. %----------------------------------------------------------------------------------------------- From 2db7e145e15e92af6578ea759b16cd548e6f9321 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 20:46:46 +0530 Subject: [PATCH 116/854] Clustering --- chapters/sampling-randomization-power.tex | 59 ++++++++++++++++------- 1 file changed, 41 insertions(+), 18 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index ebcf49d7f..3665eac78 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -209,28 +209,51 @@ \section{Randomization} \section{Clustering and stratification} -To control randomization noise, we often use techniques -\index{clustering}\index{stratification} -that reduce the likelihood of a ``bad draw''.\cite{athey2017econometrics} -These techniques can be used in any random process, -including sampling; their implementation is nearly identical in code. +For a variety of experimental and theoretical reasons, +the actual sampling and randomization processes we need to perform +are rarely as straightforward as a uniform-probability draw. +We may only be able to implement treatment on a certain group of units +(such as a school, a firm, or a market) +or we may want to ensure that minority groups appear +in either our sample or in specific treatment groups. +The most common methods used in real studies are \textbf{clustering} and \textbf{stratification}. +They allow us to control the randomization process with high precision, +which is often necessary for appropriate inference, +particularly when samples or subgroups are small.\cite{athey2017econometrics} +These techniques can be used in any random process; +their implementation is nearly identical in both sampling and randomization. \subsection{Clustering} -Many studies collect data at a different level of observation than the randomization unit.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} +Many studies collect data at a different level of observation than the randomization unit.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} For example, a policy may only be able to affect an entire village, -but you are interested in household behavior. -This type of structure is called \textbf{clustering},\sidenote{\url{https://dimewiki.worldbank.org/wiki/Multi-stage_(Cluster)_Sampling}} -because the units are assigned to treatment in clusters. -Because the treatments are assigned in clusters, however, -there are in reality fewer randomized groups than the number of units in the data. -Therefore, standard errors for clustered designs must also be clustered, -at the level at which the treatment was assigned.\sidenote{\url{https://blogs.worldbank.org/impactevaluations/when-should-you-cluster-standard-errors-new-wisdom-econometrics-oracle}} - -Clustered randomization must typically be implemented manually; -it typically relies on subsetting the data intelligently -to the desired assignment levels. -We demonstrate here. +but the study is interested in household behavior. +This type of structure is called \textbf{clustering},\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Multi-stage_(Cluster)_Sampling}} +and the groups in which units are assigned to treatment are called clusters. +The same principle extends to sampling: +it may be infeasible to decide whether to test individual children +within a single classroom, for example. + +Clustering is procedurally straightforward in Stata, +although it typically needs to be performed manually. +To cluster sampling or randomization, +\texttt{preserve} the data, keep one observation from each cluster +using a command like \texttt{bys [cluster] : keep if _n == 1}. +Then sort the data and set the seed, and generate the random assignment you need. +Save the assignment in a separate dataset or a \texttt{tempfile}, +then \texttt{restore} and \texttt{merge} the assignment back on to the original dataset. + +When sampling or randomization is conducted using clusters, +the clustering variable should be clearly identified +since it will need to be used in subsequent statistical analysis. +Namely, standard errors for these types of designs must be clustered +at the level at which the randomization was clustered.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/when-should-you-cluster-standard-errors-new-wisdom-econometrics-oracle}} +This accounts for the design covariance within the cluster -- +the information that if one individual was observed or treated there, +the other members of the clustering group were as well. \subsection{Stratification} From 493880f7c6ec5805d590fd88d03193870795bf08 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 8 Nov 2019 21:17:00 +0530 Subject: [PATCH 117/854] Stratification --- chapters/sampling-randomization-power.tex | 65 ++++++++++++++--------- 1 file changed, 41 insertions(+), 24 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 3665eac78..c6d4e9429 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -220,6 +220,7 @@ \section{Clustering and stratification} They allow us to control the randomization process with high precision, which is often necessary for appropriate inference, particularly when samples or subgroups are small.\cite{athey2017econometrics} +(By contrast, re-randomizing or resampling are never appropriate for this.) These techniques can be used in any random process; their implementation is nearly identical in both sampling and randomization. @@ -257,31 +258,47 @@ \subsection{Clustering} \subsection{Stratification} - -We mean this in a specific way: we want to exclude -randomizations with certain correlations, -or we want to improve the \textbf{balance} -of the average randomization draw.\cite{bruhn2009pursuit} -The most common is \textbf{stratification},\sidenote{\url{https://dimewiki.worldbank.org/wiki/Stratified_Random_Sample}} -which splits the sampling frame into ``similar'' groups -- \textbf{strata} -- -and randomizes \textit{within} each of these groups. -It is important to preserve the overall likelihood for each unit to be included, -otherwise statistical corrections become necessary. -For a simple stratified randomization design, -it is necessary to include strata \textbf{fixed effects}, -or an indicator variable for each strata, in any final regression. +\texttt{Stratification} is a study design component +that breaks the full set of observations into a number of subgroups +before performing randomization within each subgroup.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Stratified_Random_Sample}} +This has the effect of ensuring that members of each subgroup +are included in all groups of the randomization process, +since it is possible that a global randomization +would put all the members of a subgroup into just one of the outcomes. +In this context, the subgroups are called \textbf{strata}. + +Manually implementing stratified randomization in Stata is prone to error. +In particular, it is difficult to precisely account +for the interaction of strata sizes with multiple treatment arms. +Even for a very simple design, the method of randomly ordering the observations +will often create very skewed assignments. +This is especially true when a given stratum contains a small number of clusters, +and when there are a large number of treatment arms, +since the strata will rarely be exactly divisible by the number of arms.\cite{carril2017dealing} +The user-written \texttt{randtreat} command properly implements stratification. +However, the options and outputs (including messages) from the command should be carefully reviewed +so that you understand exactly what has been implemented. +Notably, it is extremely hard to target precise numbers of observations +in stratified designs, because exact allocations are rarely round fractions +and the process of assigning the leftover ``misfit'' observations +imposes an additional layer of randomization above the specified division. + +Whenever stratification is used for randomization, +the analysis of differences within the strata (especially treatment effects) +requires a control in the form of an indicator variable for all strata (fixed effects). This accounts for the fact that randomizations were conducted within the strata, -by comparing each unit to the others within its own strata. - -However, manually implementing stratified randomization -is prone to error: in particular, it is difficult to precisely account -for the interaction of group sizes and multiple treatment arms, -particularly when a given strata can contain a small number of clusters, -and when there are a large number of treatment arms.\cite{carril2017dealing} -The user-written \texttt{randtreat} command -properly implements stratification, -with navigable options for handling common pitfalls. -We demonstrate the use of this command here. +comparing units to the others within its own strata by correcting for the local mean. +Stratification is typically used for sampling +in order to ensure that individuals with various types will be observed; +no adjustments are necessary as long as the sampling proportion is constant across all strata. +One common pitfall is to vary the sampling or randomization \textit{probability} +across different strata (such as ``sample/treat all female heads of household''). +If this is done, you must calculate and record the exact probability +of inclusion for every unit, and re-weight observations accordingly. +The exact formula depends on the analysis being performed, +but is usually related to the inverse of the likelihood of inclusion. + %----------------------------------------------------------------------------------------------- From 0f23c3ee1fc5d9c31b933c1b83514ebd9db4fcc1 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 9 Nov 2019 11:36:34 +0530 Subject: [PATCH 118/854] Power calculation and randomization inference intro --- chapters/sampling-randomization-power.tex | 171 ++++++++++++---------- 1 file changed, 91 insertions(+), 80 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index c6d4e9429..dd238567a 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -295,7 +295,8 @@ \subsection{Stratification} One common pitfall is to vary the sampling or randomization \textit{probability} across different strata (such as ``sample/treat all female heads of household''). If this is done, you must calculate and record the exact probability -of inclusion for every unit, and re-weight observations accordingly. +of inclusion for every unit, and re-weight observations accordingly.\sidenote{ + \url{http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-when-to-use-those-sample-weights}} The exact formula depends on the analysis being performed, but is usually related to the inverse of the likelihood of inclusion. @@ -304,86 +305,37 @@ \subsection{Stratification} \section{Power calculation and randomization inference} -The fundamental contribution of sampling to the power of a research design is this: -if you randomly sample a set number of observations from a set frame, -there are a large -- but fixed -- number of sample sets which you may draw.\sidenote{\url{https://davegiles.blogspot.com/2019/04/what-is-permutation-test.html}} -From any large group, you can find some possible sample sets -that have higher-than-average values of some measure; -similarly, you can find some sets that have lower-than-average values. -The variation of these values across the range of all possible sample sets is what creates -\textbf{sampling noise}, the uncertainty in statistical estimates caused by sampling. -\index{sampling noise} +Both sampling and randomization are noisy processes: +they are random, after all, so it is impossible to predict the result in advance. +By design, we know that the exact choice of sample or treatment +will be uncorrelated with our key outcomes, +but this lack of correlation is only true ``in expectation'' -- +that is, across a large number of randomizations. +In any \textit{particular} randomization, +the correlation between the sampling or randomization and the outcome variable +is guaranteed to be \textit{nonzero}: +this is called the \textbf{in-sample} or \textbf{finite-sample correlation}. + +Since we know that the true correlation +(over the ``population'' of potential samples or randomizations) +is zero, we think of the observed correlation as an \textbf{error}. +In sampling, we call this the \textbf{sampling error}, +and it is defined as the difference between the true population parameter +and the observed mean due to chance selection of units.\sidenote{ + \url{https://economistjourney.blogspot.com/2018/06/what-is-sampling-noise.html}} +In randomization, we call this the \textbf{randomization noise}, +and define it as the difference between the true treatment effect +and the estimated effect due to placing units in groups. +The intuition for both measures is that from any group, +you can find some possible subsets that have higher-than-average values of some measure; +similarly, you can find some that have lower-than-average values. +Your sample or randomization will inevitably fall in one of these categories, +and we need to assess the likelihood and magnitude of this occurence.\sidenote{ + \url{https://davegiles.blogspot.com/2019/04/what-is-permutation-test.html}} +Power calculation and randomization inference are the two key tools to doing so. +(Going forward, this section will use ``randomization'' to refer to the whole process +of sampling and randomization: the relevant study design will often include both.) -\subsection{Sampling error and randomization noise} - -Portions of this noise can be reduced through design choices -such as clustering and stratification. -In general, all sampling requires \textbf{inverse probability weights}. -These are conceptually simple in that the weights for each individual must be precisely the inverse of the probability -with which that individual is included in the sample, but may be practically difficult to calculate for complex methods. -When the sampling probability is uniform, all the weights are equal to one. -Sampling can be structured such that subgroups are guaranteed to appear in a sample: -that is, you can pick ``half the level one facilities and half the level two facilities'' instead of -``half of all facilities''. The key here is that, \textit{for each facility}, -the probability of being chosen remains the same -- 0.5. -By contrast, a sampling design that chooses unbalanced proportions of subgroups -has changed the probability that a given individual is included in the sample, -and needs to be reweighted in case you want to calculate overall average statistics.\sidenote{\url{http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-when-to-use-those-sample-weights}} - -The sampling noise in the process that we choose -determines the size of the confidence intervals -for any estimates generated from that sample.\sidenote{\url{https://economistjourney.blogspot.com/2018/06/what-is-sampling-noise.html}} -In general, for any underlying distribution, -the Central Limit Theorem implies that -the distribution of variation across the possible samples is exactly normal. -Therefore, we can use what are called \textbf{asymptotic standard errors} -to express how far away from the true population parameters our estimates are likely to be. -These standard errors can be calculated with only two datapoints: -the sample size and the standard deviation of the value in the chosen sample. -The code below illustrates the fact that sampling noise -has a distribution in the sense that some actual executions of the sample -give estimation results far from the true value, -and others give results close to it. - -The output of the code is a distribution of means in sub-populations of the overall data. -This distribution is centered around the true population mean, -but its dispersion depends on the exact structure of the population. -We use an estimate of the population variation taken from the sample -to assess how far away from that true mean any given sample draw is: -essentially, we estimate the properties of the distribution you see now. -With that estimate, we can quantify the uncertainty in estimates due to sampling noise, -calculate precisely how far away from the true mean -our sample-based estimate is likely to be, -and report that as the standard error of our point estimates. -The interpretation of, say, a 95\% \textbf{confidence interval} -\index{confidence interval} -in this context is that, conditional on our sampling strategy, -we would anticipate that 95\% of future samples from the same distribution -would lead to parameter estimates in the indicated range. -This approach says nothing about the truth or falsehood of any hypothesis. - -With this program created and executed, -the next part of the code, shown below, -can set up for reproducibility. -Then it will call the randomization program by name, -which executes the exact randomization process we programmed -to the data currently loaded in memory. -Having pre-programmed the exact randomization does two things: -it lets us write this next code chunk much more simply, -and it allows us to reuse that precise randomization as needed. -Specifically, the user-written \texttt{ritest} command\sidenote{\url{http://hesss.org/ritest.pdf}} -\index{randomization inference} -allows us to execute a given randomization program repeatedly, -visualize how the randomization might have gone differently, -and calculate alternative p-values against null hypotheses. -These \textbf{randomization inference}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Randomization\_Inference}} significance levels may be very different -than those given by asymptotic confidence intervals, -particularly in small samples (up to several hundred clusters). - -After generating the ``true'' treatment assignment, -\texttt{ritest} illustrates the distribution of correlations -that randomization can spuriously produce -between \texttt{price} and \texttt{treatment}. \subsection{Power calculations} @@ -486,6 +438,65 @@ \subsection{Power calculations} simulation ensures you will have understood the key questions well enough to report standard measures of power once your design is decided. +\subsection{Sampling error and randomization noise} + + +The sampling noise in the process that we choose +determines the size of the confidence intervals +for any estimates generated from that sample. +In general, for any underlying distribution, +the Central Limit Theorem implies that +the distribution of variation across the possible samples is exactly normal. +Therefore, we can use what are called \textbf{asymptotic standard errors} +to express how far away from the true population parameters our estimates are likely to be. +These standard errors can be calculated with only two datapoints: +the sample size and the standard deviation of the value in the chosen sample. +The code below illustrates the fact that sampling noise +has a distribution in the sense that some actual executions of the sample +give estimation results far from the true value, +and others give results close to it. + +The output of the code is a distribution of means in sub-populations of the overall data. +This distribution is centered around the true population mean, +but its dispersion depends on the exact structure of the population. +We use an estimate of the population variation taken from the sample +to assess how far away from that true mean any given sample draw is: +essentially, we estimate the properties of the distribution you see now. +With that estimate, we can quantify the uncertainty in estimates due to sampling noise, +calculate precisely how far away from the true mean +our sample-based estimate is likely to be, +and report that as the standard error of our point estimates. +The interpretation of, say, a 95\% \textbf{confidence interval} +\index{confidence interval} +in this context is that, conditional on our sampling strategy, +we would anticipate that 95\% of future samples from the same distribution +would lead to parameter estimates in the indicated range. +This approach says nothing about the truth or falsehood of any hypothesis. + +With this program created and executed, +the next part of the code, shown below, +can set up for reproducibility. +Then it will call the randomization program by name, +which executes the exact randomization process we programmed +to the data currently loaded in memory. +Having pre-programmed the exact randomization does two things: +it lets us write this next code chunk much more simply, +and it allows us to reuse that precise randomization as needed. +Specifically, the user-written \texttt{ritest} command\sidenote{\url{http://hesss.org/ritest.pdf}} +\index{randomization inference} +allows us to execute a given randomization program repeatedly, +visualize how the randomization might have gone differently, +and calculate alternative p-values against null hypotheses. +These \textbf{randomization inference}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Randomization\_Inference}} significance levels may be very different +than those given by asymptotic confidence intervals, +particularly in small samples (up to several hundred clusters). + +After generating the ``true'' treatment assignment, +\texttt{ritest} illustrates the distribution of correlations +that randomization can spuriously produce +between \texttt{price} and \texttt{treatment}. + + \subsection{Randomization inference} % code From 0c2fb1da7a9f70836fd1d3ef81e9b1e260afbd26 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 9 Nov 2019 12:09:45 +0530 Subject: [PATCH 119/854] Power calculation --- chapters/sampling-randomization-power.tex | 120 ++++++++-------------- 1 file changed, 40 insertions(+), 80 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index dd238567a..6f71f01fd 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -336,87 +336,50 @@ \section{Power calculation and randomization inference} (Going forward, this section will use ``randomization'' to refer to the whole process of sampling and randomization: the relevant study design will often include both.) - \subsection{Power calculations} -When we have decided on a practical sampling and randomization design, -we next assess its \textbf{power}.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Power_Calculations_in_Stata}} -\index{power} -Statistical power can be described in a few ways, -each of which has different uses.\sidenote{\url{http://www.stat.columbia.edu/~gelman/stuff_for_blog/chap20.pdf}} -The purpose of power calculations is not to -demonstrate that a study is ``strong'', -but rather to identify where the strengths and weaknesses -of your design are located, so that readers -can correctly assess the evidentiary value of -any results (or null results) in the analysis. -This should be done before going to the field, -across the entire range of research questions -your study might try to answer, -so you know the relative tradeoffs you will face -by changing your sampling and randomization schemes -and can select your final strategy appropriately. - -The classic definition of power is -``the likelihood that your design detects a significant treatment effect, -given that there is a non-zero true effect in reality''. -Here we will look at two useful practical applications -of that definition and show what quantitative results can be obtained. -We suggest doing all power calculations by simulation; -you are very unlikely to be able to determine analytically -the power of your study unless you have a very simple design. -Stata has some commands that can calculate power for +Power calculations report the likelihood that your experimental design +will be able to detect the treatment effects you are interested in. +This measure of \textbf{power} can be described in various different ways, +each of which has different practical uses.\sidenote{ + \url{http://www.stat.columbia.edu/~gelman/stuff_for_blog/chap20.pdf}} +The purpose of power calculations is to identify where the strengths and weaknesses +of your design are located, so you know the relative tradeoffs you will face +by changing your randomization schemes so can select a final design appropriately. +They also allow realistic interpretations of evidence: +results low-power studies can be very interesting, +but they have a correspondingly higher likelihood +of reporting false positive results. + +The classic definition of power +is the likelihood that a design detects a significant treatment effect, +given that there is a non-zero true effect in reality. +This definition is useful retrospectively, +but it can also be re-interpreted to help in experimental design. +There are two common and useful practical applications +of that definition that give actionable, quantitative results. +The \textbf{minimum detectable effect (MDE)}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Minimum_Detectable_Effect}} +is the smallest true effect that a given study design can detect. +This is useful as a check on whether a study is worthwhile. +If, in your field, a ``large'' effect is just a few percentage points +or a fraction of a standard deviation, +then it is nonsensical to run a study whose MDE is much larger than that. +Conversely, the \textbf{minimum sample size} pre-specifies expected effects +and tells you how large a study would need to be to detect that effect. + +Stata has some commands that can calculate power analytically for very simple designs -- \texttt{power} and \texttt{clustersampsi} -- but they will not answer most of the practical questions -that complex experimental designs require. - -To determine the \textbf{minimum detectable effect}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Minimum_Detectable_Effect}} -\index{minimum detectable effect} --- the smallest true effect that your design can detect -- -conduct a simulation for your actual design. -The structure below uses fake data, -but you should use real data whenever it is available, +that complex experimental designs require.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Power_Calculations_in_Stata}} +We suggest doing more advanced power calculations by simulation, +since the interactions of experimental design, +sampling and randomization, +clustering, stratification, and treatment arms +quickly becomes very complex. +Furthermore, you should use real data whenever it is available, or you will have to make assumptions about the distribution of outcomes. -If you are willing to make even more assumptions, -you can use one of the built-in functions.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Power_Calculations_in_Stata}} - -Here, we use an outer loop to vary the size of the assumed treatment effect, -which is later used to simulate outcomes in a ``true'' -data-generating process (DGP). -The data generating process is written similarly to a -regression model, but it is a separate step. -A data generating process is the ``truth'' -that the regression model is trying to estimate. -If our regression results are close to the DGP, -then the regression is ``good'' in the sense we care about. -For each of 100 runs indexed by \texttt{i}, -we ask the question: If this DGP were true, -would our design have detected it in this draw? -We run our planned regression model including all controls and store the result, -along with an indicator of the effect size we assumed. - -When we have done this 100 times for each effect size we are interested in, -we have built a large matrix of regression results. -That can be loaded into data and manipulated directly, -where each observation represents one possible randomization result. -We flag all the runs where the p-value is significant, -then visualize the proportion of significant results -from each assumed treatment effect size. -Knowing the design's sensitivity to a variety of effect sizes -lets us calibrate whether the experiment we propose -is realistic given the constraints of the amount of data we can collect. - -Another way to think about the power of a design -is to figure out how many observations you need to include -to test various hypotheses -- the \textbf{minimum sample size}. -This is an important practical consideration -when you are negotiating funding or submitting proposals, -as it may also determine the number of treatment arms -and types of hypotheses you can test. -The basic structure of the simulation is the same. -Here, we use the outer loop to vary the sample size, -and report significance across those groups -instead of across variation in the size of the effect. Using the concepts of minimum detectable effect and minimum sample size in tandem can help answer a key question @@ -438,8 +401,7 @@ \subsection{Power calculations} simulation ensures you will have understood the key questions well enough to report standard measures of power once your design is decided. -\subsection{Sampling error and randomization noise} - +\subsection{Randomization inference} The sampling noise in the process that we choose determines the size of the confidence intervals @@ -497,8 +459,6 @@ \subsection{Sampling error and randomization noise} between \texttt{price} and \texttt{treatment}. -\subsection{Randomization inference} - % code \codeexample{replicability.do}{./code/replicability.do} From 93c6773c8025f1c7a71c9ef0a3862a93652c69ef Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 9 Nov 2019 12:30:11 +0530 Subject: [PATCH 120/854] Randomization inference --- chapters/sampling-randomization-power.tex | 95 ++++++++++------------- 1 file changed, 42 insertions(+), 53 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 6f71f01fd..ac48d2986 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -333,8 +333,6 @@ \section{Power calculation and randomization inference} and we need to assess the likelihood and magnitude of this occurence.\sidenote{ \url{https://davegiles.blogspot.com/2019/04/what-is-permutation-test.html}} Power calculation and randomization inference are the two key tools to doing so. -(Going forward, this section will use ``randomization'' to refer to the whole process -of sampling and randomization: the relevant study design will often include both.) \subsection{Power calculations} @@ -403,61 +401,52 @@ \subsection{Power calculations} \subsection{Randomization inference} -The sampling noise in the process that we choose -determines the size of the confidence intervals -for any estimates generated from that sample. -In general, for any underlying distribution, -the Central Limit Theorem implies that -the distribution of variation across the possible samples is exactly normal. -Therefore, we can use what are called \textbf{asymptotic standard errors} -to express how far away from the true population parameters our estimates are likely to be. -These standard errors can be calculated with only two datapoints: -the sample size and the standard deviation of the value in the chosen sample. -The code below illustrates the fact that sampling noise -has a distribution in the sense that some actual executions of the sample -give estimation results far from the true value, -and others give results close to it. - -The output of the code is a distribution of means in sub-populations of the overall data. -This distribution is centered around the true population mean, -but its dispersion depends on the exact structure of the population. -We use an estimate of the population variation taken from the sample -to assess how far away from that true mean any given sample draw is: -essentially, we estimate the properties of the distribution you see now. -With that estimate, we can quantify the uncertainty in estimates due to sampling noise, -calculate precisely how far away from the true mean -our sample-based estimate is likely to be, -and report that as the standard error of our point estimates. -The interpretation of, say, a 95\% \textbf{confidence interval} -\index{confidence interval} -in this context is that, conditional on our sampling strategy, -we would anticipate that 95\% of future samples from the same distribution -would lead to parameter estimates in the indicated range. -This approach says nothing about the truth or falsehood of any hypothesis. - -With this program created and executed, -the next part of the code, shown below, -can set up for reproducibility. -Then it will call the randomization program by name, -which executes the exact randomization process we programmed -to the data currently loaded in memory. -Having pre-programmed the exact randomization does two things: -it lets us write this next code chunk much more simply, -and it allows us to reuse that precise randomization as needed. -Specifically, the user-written \texttt{ritest} command\sidenote{\url{http://hesss.org/ritest.pdf}} -\index{randomization inference} -allows us to execute a given randomization program repeatedly, +Randomization inference is used to analyze the likelihood +that the randomization process, by chance, +would have created a false treatment effect as large as the one you observed. +Randomization inference is a generalization of placebo tests, +because it considers what the estimated results would have been +from a randomization that did not in fact happen in reality. +Randomization inference is particularly important +in quasi-experimental designs and in small samples, +because these conditions usually lead to the situation +where the number of possible \textit{randomizations} is itself small. +In those cases, we cannot rely on the usual conclusiong +(a consequence of the Central Limit Theorem) +that the variance of the treatment effect estimate is normal, +and we therefore cannot use the ``asymptotic'' standard errors from Stata. + +Instead, we directly simulate a large variety of possible alternative randomizations. +Specifically, the user-written \texttt{ritest} command\sidenote{ + \url{http://hesss.org/ritest.pdf}} +allows us to execute a given randomization repeatedly, visualize how the randomization might have gone differently, -and calculate alternative p-values against null hypotheses. -These \textbf{randomization inference}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Randomization\_Inference}} significance levels may be very different +and calculate empirical p-values for the effect size in our sample. +After analyzing the actual treatment assignment, +\texttt{ritest} illustrates the distribution of false correlations +that this randomization approach can produce by chance +between outcomes and treatments. +The randomization-inference p-value is the number of times +that a false effect was larger than the one you measured; +interpretable as the probability that a program with no effect +would have given you a result like the one actually observed. +These randomization inference\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Randomization\_Inference}} +significance levels may be very different than those given by asymptotic confidence intervals, particularly in small samples (up to several hundred clusters). -After generating the ``true'' treatment assignment, -\texttt{ritest} illustrates the distribution of correlations -that randomization can spuriously produce -between \texttt{price} and \texttt{treatment}. - +Randomization inference can therefore be used proactively during experimental design. +As long as there is some outcome data usable at this stage, +you can use the same procedure to examine the potential treatment effects +that your exact design is likely to produce. +The range of these effect, again, may be very different +from those predicted by standard approaches to power calculation, +and randomization inference futher allows visual inspection of results. +If there is significant heaping at particular result levels, +or results seem to depend dramatically on the placement of a small number of individuals, +randomization inference will flag those issues before the experiment is fielded +and allow adjustments to the design to be made. % code From 245add39f5c2155a6dad2375d9e219cedacaae3a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 9 Nov 2019 12:54:43 +0530 Subject: [PATCH 121/854] Sampling/randomization intro --- chapters/sampling-randomization-power.tex | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index ac48d2986..92963719e 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -125,6 +125,26 @@ \subsection{Reproducibility in random Stata processes} \section{Sampling and randomization} +The sampling and randomization processes that we choose +play an important role in determining the size of the confidence intervals +for any estimates generated from that sample, +and therefore our ability to draw conclusions with confidence. +If you randomly sample or assign a set number of observations from a set frame, +there are a large -- but fixed -- number of permutations which you may draw. +In reality, you have to work with exactly one of them, +so we put a lot of effort into making sure that one is a good one, +by reducing the probability that we observe nonexistent, or ``spurious'', results. +In large studies, we can use what are called \textbf{asymptotic standard errors} +to express how far away from the true population parameters our estimates are likely to be. +These standard errors can be calculated with only two datapoints: +the sample size and the standard deviation of the value in the chosen sample. +They are also typically the best case scenario for the population given the data structure. +In small studies, such as those we often see in development, +we have to be much more careful, particularly about practical considerations +such as determining the representative population +and fitting any constraints on study and sample design. +This section introduces universal basic principles for sampling and randomization. + \subsection{Sampling} \textbf{Sampling} is the process of randomly selecting units of observation From bdb7487afd73d7978edc8873e97d82135bfbeda9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 9 Nov 2019 12:57:45 +0530 Subject: [PATCH 122/854] Randomization (#239) --- chapters/sampling-randomization-power.tex | 3 +++ 1 file changed, 3 insertions(+) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 92963719e..e5d7d7fdd 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -129,6 +129,9 @@ \section{Sampling and randomization} play an important role in determining the size of the confidence intervals for any estimates generated from that sample, and therefore our ability to draw conclusions with confidence. +(Note that random sampling and random assignment serve different purposes: +random sampling ensures that you have unbiased population estimates, +and random assignment ensures that you have unbiased treatment estimates.) If you randomly sample or assign a set number of observations from a set frame, there are a large -- but fixed -- number of permutations which you may draw. In reality, you have to work with exactly one of them, From 3bc400ddd0ff0475962cbdafc4abf909c358d2ca Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 9 Nov 2019 13:44:52 +0530 Subject: [PATCH 123/854] Rewrite chapter introduction (#232) --- chapters/sampling-randomization-power.tex | 46 ++++++++++++++++------- 1 file changed, 32 insertions(+), 14 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index e5d7d7fdd..907d8006d 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -1,30 +1,48 @@ %----------------------------------------------------------------------------------------------- \begin{fullwidth} -Sampling, randomization, and power calculations are the core elements of experimental design. -\textbf{Sampling} and \textbf{randomization} determine -which units are observed and in which states. -Each of these processes introduces statistical noise +Sampling and randomization are two core elements of study design. +In experimental methods, sampling and randomization directly determine +the set of individuals who are going to be observed +and what their status will be for the purpose of effect estimation. +Since we only get one chance to implement a given experiment, +we need to have a detailed understanding of how these processes work +and how to implment them properly. +This allows us to ensure the field reality corresponds well to our experimental design. +In quasi-experimental methods, +sampling determines what populations the study +will be able to make meaningful inferences about, +and randomization analyses simulate counterfactual possibilities +if the events being studied had happened differently. +These needs are particularly important in the intial phases of development studies -- +typically conducted well before any actual fieldwork occurs -- +and often have implications for planning and budgeting. + +Power calculations and randomization inference methods +give us the tools to critically and quantitatively assess different +sampling and randomization designs in light of our theories of impact +and to make optimal choices when planning studies. +All random processes introduce statistical noise or uncertainty into the final estimates of effect sizes. Sampling noise produces some probability of -selection of units to measure that will produce significantly wrong estimates, and +selection of units to measure that will produce incorrect estimates, and randomization noise produces some probability of placement of units into treatment arms that does the same. -Power calculation is the method by which these probabilities of error are meaningfully assessed. -Good experimental design has high \textbf{power} -- a low likelihood that these noise parameters -will meaningfully affect estimates of treatment effects. +Power calculation and randomization inference +are the main methods by which these probabilities of error are assessed. +Good experimental design has high \textbf{power} -- a low likelihood that this noise +will substantially affect estimates of treatment effects. Not all studies are capable of achieving traditionally high power: -the possible sampling or treatment assignments may simply be fundamentally too noisy. +sufficiently precise sampling or treatment assignments may not be available. This may be especially true for novel or small-scale studies -- things that have never been tried before may be hard to fund or execute at scale. What is important is that every study includes reasonable estimates of its power, -so that the evidentiary value of its results can be honestly assessed. -Demonstrating that sampling and randomization were taken seriously into consideration +so that the evidentiary value of its results can be assessed. +Demonstrating that sampling and randomization were taken into consideration before going to field lends credibility to any research study. -Using these tools to design the most highly-powered experiments possible -is a responsible and ethical use of donor and client resources, -and maximizes the likelihood that reported effect sizes are accurate. +Using these tools to design the best experiments possible +maximizes the likelihood that reported estimates are accurate. \end{fullwidth} %----------------------------------------------------------------------------------------------- From 73d7f1e2bc8b442ad23e35723eb386ff3879fd93 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 9 Nov 2019 13:47:25 +0530 Subject: [PATCH 124/854] Clarify seed source (#230) --- chapters/sampling-randomization-power.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 907d8006d..dec06c3d6 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -125,6 +125,7 @@ \subsection{Reproducibility in random Stata processes} \textbf{Seeding} means manually setting the start-point of the randomization algorithm. You can draw a six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. +(This link is a shortcut to a specific request on \url{https://www.random.org}.) There are many more seeds possible but this is a large enough set for most purposes. In Stata, \texttt{set seed [seed]} will set the generator to that state. You should use exactly one seed per randomization process: From c48ac1fdf71484a9e8d5f02b5acb4966a4a85cbc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 9 Nov 2019 13:50:03 +0530 Subject: [PATCH 125/854] Clarification (#135) --- chapters/sampling-randomization-power.tex | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index dec06c3d6..8c37df678 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -24,10 +24,10 @@ and to make optimal choices when planning studies. All random processes introduce statistical noise or uncertainty into the final estimates of effect sizes. -Sampling noise produces some probability of -selection of units to measure that will produce incorrect estimates, and -randomization noise produces some probability of -placement of units into treatment arms that does the same. +Choosing one sample from all the possibilities produces some probability of +choosing a group of units that are not, in fact representative. +Choosing one final randomization assignment similarly produces some probability of +creating groups that are not good counterfactuals for each other. Power calculation and randomization inference are the main methods by which these probabilities of error are assessed. Good experimental design has high \textbf{power} -- a low likelihood that this noise From 037d75aaa8e5526224eb92d6a9d88ca9889d6429 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 9 Nov 2019 14:16:27 +0530 Subject: [PATCH 126/854] Typos and cleanup --- chapters/sampling-randomization-power.tex | 59 ++++++++++++----------- 1 file changed, 31 insertions(+), 28 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 8c37df678..06e0bdcc1 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -25,7 +25,7 @@ All random processes introduce statistical noise or uncertainty into the final estimates of effect sizes. Choosing one sample from all the possibilities produces some probability of -choosing a group of units that are not, in fact representative. +choosing a group of units that are not, in fact, representative. Choosing one final randomization assignment similarly produces some probability of creating groups that are not good counterfactuals for each other. Power calculation and randomization inference @@ -66,7 +66,7 @@ \section{Random processes in Stata} Randomization is challenging. It is deeply unintuitive for the human brain. ``True'' randomization is also nearly impossible to achieve for computers, -which are inherently deterministic. There are plenty of sources to read about this.\sidenote{ +which are inherently deterministic.\sidenote{ \url{https://www.random.org/randomness/}} For our purposes, we will focus on what you need to understand in order to produce truly random results for your project using Stata, @@ -96,7 +96,7 @@ \subsection{Reproducibility in random Stata processes} \url{https://www.stata.com/manuals14/rsetseed.pdf}} However, for true reproducible randomization, we need two additional properties: we need to be able to fix the starting point so we can come back to it later; -and we need that starting point to be independently random from our process. +and we need to ensure that the starting point is independently random from our process. In Stata, this is accomplished through three command concepts: \textbf{versioning}, \textbf{sorting}, and \textbf{seeding}. @@ -109,31 +109,33 @@ \subsection{Reproducibility in random Stata processes} (Note that you will \textit{never} be able to transfer a randomization to another software such as R.) The \texttt{ieboilstart} command in \texttt{ietoolkit} provides functionality to support this requirement.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/ieboilstart}} -We recommend, you use \texttt{ieboilstart} at the beginning of your master do-file.\sidenote{ +We recommend you use \texttt{ieboilstart} at the beginning of your master do-file.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Master_Do-files}} However, note that testing your do-files without running them -via the master do-file may produce different resuls: -Stata's \texttt{version} expires after execution just like a \texttt{local}. +via the master do-file may produce different results, +since Stata's \texttt{version} expires after execution just like a \texttt{local}. \textbf{Sorting} means that the actual data that the random process is run on is fixed; because numbers are assigned to each observation in sequence, changing their order will change the result of the process. A corollary is that the underlying data must be unchanged between runs: you must make a fixed final copy of the data when you run a randomization for fieldwork. -In Stata, the only way to guarantee a unique sorting order is to use\texttt{isid [id\_variable], sort}. (The \texttt{sort , stable} command is insufficient.) -You can additional use the \texttt{datasignature} commannd to make sure the data is fixed. +In Stata, the only way to guarantee a unique sorting order is to use +\texttt{isid [id\_variable], sort}. (The \texttt{sort, stable} command is insufficient.) +You can additionally use the \texttt{datasignature} commannd to make sure the data is unchanged. \textbf{Seeding} means manually setting the start-point of the randomization algorithm. You can draw a six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. (This link is a shortcut to a specific request on \url{https://www.random.org}.) There are many more seeds possible but this is a large enough set for most purposes. In Stata, \texttt{set seed [seed]} will set the generator to that state. -You should use exactly one seed per randomization process: -what is important is that each of these seeds is truly random. +You should use exactly one seed per randomization process. +The most important thing is that each of these seeds is truly random, +so do not use shortcuts such as the current date or a fixed seed. You will see in the code below that we include the source and timestamp for verification. Any process that includes a random component is a random process, including sampling, randomization, power calculation, and algorithms like bootstrapping. -Other commands may induce randomness or alter the seed without you realizing it, +Other commands may induce randomness in the data or alter the seed without you realizing it, so carefully confirm exactly how your code runs before finalizing it.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Randomization_in_Stata}} To confirm that a randomization has worked well before finalizing its results, @@ -144,17 +146,18 @@ \subsection{Reproducibility in random Stata processes} \section{Sampling and randomization} -The sampling and randomization processes that we choose +The sampling and randomization processes we choose play an important role in determining the size of the confidence intervals for any estimates generated from that sample, -and therefore our ability to draw conclusions with confidence. -(Note that random sampling and random assignment serve different purposes: -random sampling ensures that you have unbiased population estimates, +and therefore our ability to draw conclusions. +(Note that random sampling and random assignment serve different purposes. +Random sampling ensures that you have unbiased population estimates, and random assignment ensures that you have unbiased treatment estimates.) If you randomly sample or assign a set number of observations from a set frame, there are a large -- but fixed -- number of permutations which you may draw. + In reality, you have to work with exactly one of them, -so we put a lot of effort into making sure that one is a good one, +so we put a lot of effort into making sure that one is a good one by reducing the probability that we observe nonexistent, or ``spurious'', results. In large studies, we can use what are called \textbf{asymptotic standard errors} to express how far away from the true population parameters our estimates are likely to be. @@ -188,7 +191,7 @@ \subsection{Sampling} The most explicit method of implementing this process is to assign random numbers to all your potential observations, order them by the number they are assigned, -and mark as `sampled' those with the lowest numbers, to the desired proportion. +and mark as `sampled' those with the lowest numbers, up to the desired proportion. (In general, we will talk about sampling proportions rather than numbers of observations. Sampling specific numbers of observations is complicated and should be avoided, because it will make the probability of selection very hard to calculate.) @@ -208,10 +211,10 @@ \subsection{Sampling} Ex post changes to the study scope using a sample drawn for a different purpose usually involve tedious calculations of probabilities and should be avoided. -\section{Randomization} +\subsection{Randomization} \textbf{Randomization} is the process of assigning units to some kind of treatment program. -Most of the Stata commands shown for sampling can be directly transferred to randomization, +Most of the Stata commands used for sampling can be directly transferred to randomization, since randomization is also a process of splitting a sample into groups. Where sampling determines whether a particular individual will be observed at all in the course of data collection, @@ -239,7 +242,7 @@ \section{Randomization} It is possible to do this using survey software or live events. These methods typically do not leave a record of the randomization, so particularly when the experiment is electronic, -it is best to execute the randomization in advance and preload the results. +it is best to execute the randomization in advance and preload the results if possible. Even when randomization absolutely cannot be done in advance, it is still useful to build a corresponding model of the randomization process in Stata so that you can conduct statistical analysis later @@ -283,7 +286,7 @@ \subsection{Clustering} although it typically needs to be performed manually. To cluster sampling or randomization, \texttt{preserve} the data, keep one observation from each cluster -using a command like \texttt{bys [cluster] : keep if _n == 1}. +using a command like \texttt{bys [cluster] : keep if \_n == 1}. Then sort the data and set the seed, and generate the random assignment you need. Save the assignment in a separate dataset or a \texttt{tempfile}, then \texttt{restore} and \texttt{merge} the assignment back on to the original dataset. @@ -300,7 +303,7 @@ \subsection{Clustering} \subsection{Stratification} -\texttt{Stratification} is a study design component +\textbf{Stratification} is a study design component that breaks the full set of observations into a number of subgroups before performing randomization within each subgroup.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Stratified_Random_Sample}} @@ -385,7 +388,7 @@ \subsection{Power calculations} \url{http://www.stat.columbia.edu/~gelman/stuff_for_blog/chap20.pdf}} The purpose of power calculations is to identify where the strengths and weaknesses of your design are located, so you know the relative tradeoffs you will face -by changing your randomization schemes so can select a final design appropriately. +by changing your randomization scheme for the final design. They also allow realistic interpretations of evidence: results low-power studies can be very interesting, but they have a correspondingly higher likelihood @@ -453,7 +456,7 @@ \subsection{Randomization inference} in quasi-experimental designs and in small samples, because these conditions usually lead to the situation where the number of possible \textit{randomizations} is itself small. -In those cases, we cannot rely on the usual conclusiong +In those cases, we cannot rely on the usual assertion (a consequence of the Central Limit Theorem) that the variance of the treatment effect estimate is normal, and we therefore cannot use the ``asymptotic'' standard errors from Stata. @@ -469,8 +472,8 @@ \subsection{Randomization inference} that this randomization approach can produce by chance between outcomes and treatments. The randomization-inference p-value is the number of times -that a false effect was larger than the one you measured; -interpretable as the probability that a program with no effect +that a false effect was larger than the one you measured, +and it is interpretable as the probability that a program with no effect would have given you a result like the one actually observed. These randomization inference\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Randomization\_Inference}} @@ -482,11 +485,11 @@ \subsection{Randomization inference} As long as there is some outcome data usable at this stage, you can use the same procedure to examine the potential treatment effects that your exact design is likely to produce. -The range of these effect, again, may be very different +The range of these effects, again, may be very different from those predicted by standard approaches to power calculation, and randomization inference futher allows visual inspection of results. If there is significant heaping at particular result levels, -or results seem to depend dramatically on the placement of a small number of individuals, +or if results seem to depend dramatically on the placement of a small number of individuals, randomization inference will flag those issues before the experiment is fielded and allow adjustments to the design to be made. From 7bd16f95747c2cc1b894ef49d01c0f1e94290e7f Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 10 Dec 2019 16:08:04 -0800 Subject: [PATCH 127/854] Dynamic documents --- chapters/publication.tex | 90 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 85 insertions(+), 5 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index a9496440c..7f729f772 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -34,10 +34,86 @@ %------------------------------------------------ -\section{Collaborating on academic writing} +\section{Collaborating on technical writing} -The gold standard for academic writing is \LaTeX.\sidenote{\url{https://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/GSWLaTeX.pdf}} -\index{\LaTeX} +It is increasingly rare that a single author will prepare an entire manuscript alone. +More often than not, documents will pass back and forth between several writers +before they are prepared for publication, +so it is essential to use technology and workflows that avoid conflicts. +Just as with the preparation of analytical outputs, +this means adopting tools practices that enable tasks +such as version control and simultaneous contribution. +Furthermore, it means preparing documents that are \textbf{dynamic} -- +meaning that updates to the analytical outputs that constitute them +can be updated in the final output with a single process, +rather than copy-and-pasted or otherwise handled individually. +Thinking of the writing process in this way +is intended to improve organization and reduce error, +such that there is no risk of materials being compiled +with out-of-date results, or of completed work being lost or redundant. + +\subsection{Dynamic documents} + +Dynamic documents are a broad class of tools that enable such a workflow. +The term ``dynamic'' can refer to any document-creation technology +that allows the creation of explicit references to raw output files. +This means that, whenever outputs are updated, +the next iteration of the document will automatically include +all changes made to all outputs without any additional intervention from the writer. +This means that updates will never be accidentally excluded, +and it further means that updating results will never become more difficult +as the number of inputs grows, +because they are all managed by a single integrated process. + +You will note that this is not possible in tools like Microsoft Office. +In Word, for example, you have to copy and paste each object individually +whenever there are materials that have to be updated. +This means that both the features above are not available: +fully updating the document becomes more and more time-consuming +as the number of inputs increases, +and it therefore becomes more and more likely +that a mistake will be made or something will be missed. +Furthermore, it is very hard to simultaneously edit or track changes +in a Microsoft Word document. +It is usually the case that a file needs to be passed back and forth +and the order of contributions strictly controlled +so that time-consuming resolutions of differences can be avoided. +Therefore this is a broadly unsuitable way to prepare technical documents. + +There are a number of tools that can be used for dynamic documents. +They fall into two broad groups -- +the first which compiles a document as part of code execution, +and the second which operates a separate document compiler. +In the first group are tools such as R's RMarkdown and Stata's \texttt{dyndoc}. +These tools ``knit'' or ``weave'' text and code together, +and are programmed to insert code outputs in pre-specified locations. +Documents called ``notebooks'' (such as Jupyter) work similarly, +as they also use the underlying analytical software to create the document. +These types of dynamic documents are usually appropriate for short or informal materials +because they tend to offer limited editability outside the base software +and often have limited abilities to incorporate precision formatting. + +On the other hand, some dynamic document tools do not require +operation of any underlying software, but simply require +that the writer have access to the updated outputs. +One very simple one is Dropbox Paper, a free online writing tool +that allows linkages to files in Dropbox, +which are then automatically updated anytime the file is replaced. +Like the first class of tools, Dropbox Paper has very limited formatting options, +but it is appropriate for work with collaborators who are not using statistical software. +However, the most widely utilized software +for dynamically managing both text and results is \LaTeX.\sidenote{ + \url{https://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/GSWLaTeX.pdf}} + \index{\LaTeX} +(\LaTeX also operates behind-the-scenes in many other tools.) +While this tool has a significant learning curve, +its enormous flexibility in terms of operation, collaboration, +and output formatting and styling +makes it the primary choice for most large technical outputs today. + +\subsection{Technical writing with \LaTeX} + +The gold standard for academic writing is \LaTeX. \LaTeX\ allows automatically-organized sections like titles and bibliographies, imports tables and figures in a dynamic fashion, and can be version controlled using Git. @@ -55,7 +131,7 @@ \section{Collaborating on academic writing} for someone new to \LaTeX\ to be able to ``just write'' is often the web-based Overleaf suite.\sidenote{\url{https://www.overleaf.com}} Overleaf offers a \textbf{rich text editor} -that behaves pretty similarly to familiar tools like Word. +that behaves pretty similarly to familiar tools like Word. With minimal workflow adjustments, you can to show coauthors how to write and edit in Overleaf, so long as you make sure you are always available to troubleshoot @@ -120,7 +196,11 @@ \section{Collaborating on academic writing} %------------------------------------------------ -\section{Publishing data and code for replication} +\section{Preparing a complete replication package} + +\section{Publishing data for replication} + +\section{Publishing code for replication} Data and code should always be released with any publication.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} \index{data publication} From d52518b2514581ecf710db8aea9cf675850018b8 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 10 Dec 2019 16:30:41 -0800 Subject: [PATCH 128/854] Get started on LaTeX --- chapters/publication.tex | 40 +++++++++++++++++++++++++++++----------- 1 file changed, 29 insertions(+), 11 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 7f729f772..52f97976c 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -102,10 +102,10 @@ \subsection{Dynamic documents} Like the first class of tools, Dropbox Paper has very limited formatting options, but it is appropriate for work with collaborators who are not using statistical software. However, the most widely utilized software -for dynamically managing both text and results is \LaTeX.\sidenote{ +for dynamically managing both text and results is \LaTeX\.\sidenote{ \url{https://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/GSWLaTeX.pdf}} \index{\LaTeX} -(\LaTeX also operates behind-the-scenes in many other tools.) +(\LaTeX\ also operates behind-the-scenes in many other tools.) While this tool has a significant learning curve, its enormous flexibility in terms of operation, collaboration, and output formatting and styling @@ -113,19 +113,37 @@ \subsection{Dynamic documents} \subsection{Technical writing with \LaTeX} -The gold standard for academic writing is \LaTeX. -\LaTeX\ allows automatically-organized sections like titles and bibliographies, -imports tables and figures in a dynamic fashion, -and can be version controlled using Git. +\LaTeX\ is billed as a ``document preparation system''. +What this means is worth unpacking. +In \LaTeX\, instead of writing in a ``what-you-see-is-what-you-get'' mode +as you do in Word or the equivalent, +you write plain text interlaced with specific instructions for formatting +(similar in concept to HTML). +The \LaTeX\ system includes commands for simple markup +like font styles, paragraph formatting, section headers and the like. +But it also includes special controls for including tables and figures, +footnotes and endnotes, complex mathematics, and automated bibliography preparation. +It also allows publishers to apply global styles and templates +to already-written material, allowing them to reformat entire documents in house styles +with only a few keystrokes. +In sum, \LaTeX\ enables automatically-organized documents, +manages tables and figures dynamically, +and (because it is written in plain text) can be version-controlled using Git. +This is why it has become the dominant ``document preparation system'' in technical writing. + Unfortunately, \LaTeX\ can be a challenge to set up and use at first, -particularly for people who are unfamiliar with plaintext, code, or file management. +particularly if you are new to working with plain text code and file management. \LaTeX\ requires that all formatting be done in its special code language, and it is not particularly informative when you do something wrong. This can be off-putting very quickly for people -who simply want to get to writing, like lead authors. -Therefore, if we want to take advantage of the features of \LaTeX, -without getting stuck in the weeds of it, -we will need to adopt a few tools and tricks to make it effective. +who simply want to get to writing, like senior researchers. +While integrated editing and compiling tools like TeXStudio and Atom +offer the most flexibility to work with \LaTeX\ on your computer, +they can require a lot of troubleshooting at a basic level at first, +and non-technical staff may not be willing or able to acquire the required knowledge. +Therefore, to take advantage of the features of \LaTeX, +while making it easy and accessible to the entire writing team, +we need to abstract away from the technical details where possible. The first is choice of software. The easiest way for someone new to \LaTeX\ to be able to ``just write'' From d36de6a3bfa7158b7abcc01bd6f911c878c10f0d Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 10 Dec 2019 16:43:52 -0800 Subject: [PATCH 129/854] Some Overleaf material --- chapters/publication.tex | 64 ++++++++++++++++++++++------------------ 1 file changed, 36 insertions(+), 28 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 52f97976c..d4a82195a 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -137,40 +137,48 @@ \subsection{Technical writing with \LaTeX} and it is not particularly informative when you do something wrong. This can be off-putting very quickly for people who simply want to get to writing, like senior researchers. -While integrated editing and compiling tools like TeXStudio and Atom +While integrated editing and compiling tools like TeXStudio\sidenote{ + \url{https://www.texstudio.org}} +and \texttt{atom-latex}\sidenote{ + \url{https://atom.io/packages/atom-latex}} offer the most flexibility to work with \LaTeX\ on your computer, -they can require a lot of troubleshooting at a basic level at first, +such as advanced integration with Git, +the entire team needs to be comfortable +with \LaTeX\ before adopting one of these tools. +They can require a lot of troubleshooting at a basic level at first, and non-technical staff may not be willing or able to acquire the required knowledge. Therefore, to take advantage of the features of \LaTeX, while making it easy and accessible to the entire writing team, we need to abstract away from the technical details where possible. -The first is choice of software. The easiest way -for someone new to \LaTeX\ to be able to ``just write'' -is often the web-based Overleaf suite.\sidenote{\url{https://www.overleaf.com}} -Overleaf offers a \textbf{rich text editor} -that behaves pretty similarly to familiar tools like Word. -With minimal workflow adjustments, you can -to show coauthors how to write and edit in Overleaf, -so long as you make sure you are always available to troubleshoot -\LaTeX\ crashes and errors. It also offers a convenient selection of templates -so it is easy to start up a project -and replicate a lot of the underlying setup code. -One of the most common issues you will face while using Overleaf will be special characters, namely -\texttt{\&}, \texttt{\%}, and \texttt{\_}, -which need to be \textbf{escaped} (instructed to interpret literally) -by writing a backslash (\texttt{\textbackslash}) before them, -such as \texttt{40\textbackslash\%} for the percent sign to function. -Another issue is that you need to upload input files -(such as figures and tables) manually. -This can create conflicts when these inputs are still being updated -- -namely, the main document not having the latest results. -One solution is to move to Overleaf only once there will not be substantive changes to results. - -Other popular desktop-based tools for writing \LaTeX are TeXstudio\sidenote{\url{https://www.texstudio.org}} and atom-latex\sidenote{\url{https://atom.io/packages/atom-latex}}. -They allow more advanced integration with Git, -among other advantages, but the entire team needs to be comfortable -with \LaTeX\ before adopting one of these tools. +The easiest way for someone new to \LaTeX\ to be able to ``just write'' +is often the web-based Overleaf suite.\sidenote{ + \url{https://www.overleaf.com}} +Overleaf offers a text editor that behaves pretty similarly to familiar tools like Word, +and it is free-to-use for a broad variety of basic applications. +It allows simultaneous online editing and invitations similarly to Google Docs, +handles most of the basic under-the-hood technical requirements. +Overleaf also offers a convenient selection of templates +so it is easy to start up a project and see results right away. +On the downside, there is a small amount of up-front learning required, +continous access to the Internet is necessary, +and updating figures and tables requires a bulk file upload that is tough to automate. + +One of the most common issues you will face using Overleaf will be special characters +which, because of code functions, need to be handled differently than in Word. +Most critically, the ampersand (\texttt{\&}), percent (\texttt{\%}), and underscore (\texttt{\_}) +need to be ``escaped'' (interpreted as text and not code) in order to render. +This is done by by writing a backslash (\texttt{\textbackslash}) before them, +such as writing \texttt{40\textbackslash\%} for the percent sign to appear in text. +Despite this, we believe that with minimal learning and workflow adjustments, +Overleaf is often the easiest way to allow coauthors to write and edit in \LaTeX\, +so long as you make sure you are available to troubleshoot minor issues like these. + + + + + + One of the important tools available in \LaTeX\ is the \textbf{BibTeX bibliography manager}.\sidenote{\url{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.3194&rep=rep1&type=pdf}} This tool stores unformatted references From 0b71ba96c053697ae2e232eeec20620ea5d59bb4 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 10 Dec 2019 16:54:11 -0800 Subject: [PATCH 130/854] Finish LaTeX, separate Overleaf --- chapters/publication.tex | 90 +++++++++++++++++++++++----------------- 1 file changed, 52 insertions(+), 38 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index d4a82195a..73027296b 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -151,6 +151,53 @@ \subsection{Technical writing with \LaTeX} while making it easy and accessible to the entire writing team, we need to abstract away from the technical details where possible. +One of the most important tools available in \LaTeX\ is the BibTeX bibliography manager.\sidenote{ + \url{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.3194&rep=rep1&type=pdf}} +BibTeX keeps all the references you might use in an auxiliary file, +then references them as plain text in the document using a \LaTeX\ command. +The same principles that apply to figures and tables are therefore applied here: +You can make changes to the references in one place (the \texttt{.bib} file), +and then everywhere they are used they are updated correctly with one process. +Specifically, \LaTeX\ inserts references in text using the \texttt{\textbackslash cite\{\}} command. +Once this is written, \LaTeX\ automatically pulls all the citations into text +and creates a complete bibliography based on the citations you use when you compile the document. +The system allows you to specify exactly how references should be displayed in text +(such as superscripts, inline references, etc.) +as well as how the bibliography should be styled and in what order +(such as Chicago, MLA, Harvard, or other common styles). +To obtain the references for the \texttt{.bib} file, +you can copy the specification directly from Google Scholar +by clicking ``BibTeX'' at the bottom of the Cite window. +When pasted into the \texttt{.bib} file they look like the following: + +\codeexample{sample.bib}{./code/sample.bib} + +\noindent BibTeX citations are then used as follows: + +\codeexample{citation.tex}{./code/citation.tex} + +With these tools, you can ensure that references are handled +in a format you can manage and control.\cite{flom2005latex} +Finally, \LaTeX\ has one more useful trick: +and use \textbf{\texttt{pandoc}},\sidenote{\url{http://pandoc.org/}} +you can translate the raw document into Word +(or a number of other formats) +by running the following code from the command line: + +\codeexample{pandoc.sh}{./code/pandoc.sh} + +\noindent The last portion after \texttt{--csl=} specifies the bibliography style. +You can download a CSL (Citation Styles Library) file\sidenote{ + \url{https://github.com/citation-style-language/styles}} +for nearly any journal and have it applied automatically in this process. +Therefore, even in the case where you are requested to provide +\texttt{.docx} versions of materials to others, or tracked-changes versions, +you can create them effortlessly, +and use external tools like Word's compare feature +to generate integrated tracked versions when needed. + +\subsection{Getting started with \LaTeX via Overleaf} + The easiest way for someone new to \LaTeX\ to be able to ``just write'' is often the web-based Overleaf suite.\sidenote{ \url{https://www.overleaf.com}} @@ -164,6 +211,11 @@ \subsection{Technical writing with \LaTeX} continous access to the Internet is necessary, and updating figures and tables requires a bulk file upload that is tough to automate. +The purpose of this setup, just like with other synced folders, +is to avoid there ever being more than one master copy of the document. +This means that people can edit simultaneously without fear of conflicts, +and it is never necessary to manually resolve differences in the document. + One of the most common issues you will face using Overleaf will be special characters which, because of code functions, need to be handled differently than in Word. Most critically, the ampersand (\texttt{\&}), percent (\texttt{\%}), and underscore (\texttt{\_}) @@ -180,45 +232,7 @@ \subsection{Technical writing with \LaTeX} -One of the important tools available in \LaTeX\ is the \textbf{BibTeX bibliography manager}.\sidenote{\url{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.3194&rep=rep1&type=pdf}} -This tool stores unformatted references -in an accompanying \texttt{.bib} file, -and \LaTeX\ then inserts them in text -using the \texttt{\textbackslash cite\{\}} command. -With this structure, \LaTeX\ automatically pulls -all the citations into text. \LaTeX\ allows you to specify -how they should be displayed in text -(ie, as superscripts, inline references, etc.) -and how the bibliography should be styled and in what order. -A reference in the \texttt{.bib} file -can be copied directly from Google Scholar -by clicking "BibTeX" at the bottom of the Cite window. -When pasted into the \texttt{.bib} file they look like the following: - -\codeexample{sample.bib}{./code/sample.bib} - -\noindent BibTeX citations are then used as follows: - -\codeexample{citation.tex}{./code/citation.tex} - -With these tools, you can ensure that co-authors are writing -in a format you can manage and control.\cite{flom2005latex} -The purpose of this setup, just like with other synced folders, -is to avoid there ever being more than one master copy of the document. -This means that people can edit simultaneously without fear of conflicts, -and it is never necessary to manually resolve differences in the document. -Finally, \LaTeX\ has one more useful trick: -if you download a journal styler from the \textbf{Citation Styles Library}\sidenote{ -\url{https://github.com/citation-style-language/styles}} -and use \textbf{\texttt{pandoc}},\sidenote{\url{http://pandoc.org/}} -you can translate the raw document into Word by running the following code from the command line: - -\codeexample{pandoc.sh}{./code/pandoc.sh} -Therefore, even in the case where you are requested to provide -\texttt{.docx} versions of materials to others, or tracked-changes versions, -you can create them effortlessly, -and use Word's compare feature to generate a single integrated tracked version. %------------------------------------------------ From 55045326ae07d91e2bd9e3f43477c7925edc2b8c Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 10 Dec 2019 17:02:44 -0800 Subject: [PATCH 131/854] Small fixes --- chapters/publication.tex | 4 ++-- code/citation.tex | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 73027296b..c1792211e 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -193,10 +193,10 @@ \subsection{Technical writing with \LaTeX} Therefore, even in the case where you are requested to provide \texttt{.docx} versions of materials to others, or tracked-changes versions, you can create them effortlessly, -and use external tools like Word's compare feature +and use external tools like Word's compare feature to generate integrated tracked versions when needed. -\subsection{Getting started with \LaTeX via Overleaf} +\subsection{Getting started with \LaTeX\ via Overleaf} The easiest way for someone new to \LaTeX\ to be able to ``just write'' is often the web-based Overleaf suite.\sidenote{ diff --git a/code/citation.tex b/code/citation.tex index 7041217be..d9a4681a3 100644 --- a/code/citation.tex +++ b/code/citation.tex @@ -1,2 +1,2 @@ -With these tools, you can ensure that co-authors are writing +With these tools, you can ensure that references are handled in a format you can manage and control.\cite{flom2005latex} From 7914a02c9ff7f18b633288b6325dee854019db5e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 10 Dec 2019 17:05:15 -0800 Subject: [PATCH 132/854] Small fixes --- chapters/publication.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index c1792211e..cd4d35cbe 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -179,7 +179,8 @@ \subsection{Technical writing with \LaTeX} With these tools, you can ensure that references are handled in a format you can manage and control.\cite{flom2005latex} Finally, \LaTeX\ has one more useful trick: -and use \textbf{\texttt{pandoc}},\sidenote{\url{http://pandoc.org/}} +using \textbf{\texttt{pandoc}},\sidenote{ + \url{http://pandoc.org/}} you can translate the raw document into Word (or a number of other formats) by running the following code from the command line: From b2d4a55f49b342efa20592f35df22030a8b5139d Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 11 Dec 2019 13:40:29 -0800 Subject: [PATCH 133/854] Overleaf draft --- chapters/publication.tex | 49 +++++++++++++++++++++------------------- 1 file changed, 26 insertions(+), 23 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index cd4d35cbe..1e8b76b3c 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -102,7 +102,7 @@ \subsection{Dynamic documents} Like the first class of tools, Dropbox Paper has very limited formatting options, but it is appropriate for work with collaborators who are not using statistical software. However, the most widely utilized software -for dynamically managing both text and results is \LaTeX\.\sidenote{ +for dynamically managing both text and results is \LaTeX\ (pronounced ``lah-tek'').\sidenote{ \url{https://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/GSWLaTeX.pdf}} \index{\LaTeX} (\LaTeX\ also operates behind-the-scenes in many other tools.) @@ -117,7 +117,7 @@ \subsection{Technical writing with \LaTeX} What this means is worth unpacking. In \LaTeX\, instead of writing in a ``what-you-see-is-what-you-get'' mode as you do in Word or the equivalent, -you write plain text interlaced with specific instructions for formatting +you write plain text interlaced with coded instructions for formatting (similar in concept to HTML). The \LaTeX\ system includes commands for simple markup like font styles, paragraph formatting, section headers and the like. @@ -199,25 +199,36 @@ \subsection{Technical writing with \LaTeX} \subsection{Getting started with \LaTeX\ via Overleaf} -The easiest way for someone new to \LaTeX\ to be able to ``just write'' -is often the web-based Overleaf suite.\sidenote{ +\LaTeX\ is a challenging tool to get started using, +but the control it offers over the writing process is invaluable. +In order to make it as easy as possible for your team +to use \LaTeX\ without all members having to invest in new skills, +we suggest using the web-based Overleaf implementation as your first foray into \LaTeX\ writing.\sidenote{ \url{https://www.overleaf.com}} -Overleaf offers a text editor that behaves pretty similarly to familiar tools like Word, -and it is free-to-use for a broad variety of basic applications. -It allows simultaneous online editing and invitations similarly to Google Docs, -handles most of the basic under-the-hood technical requirements. +While the Overleaf site has a subscription feature that offers some useful extensions, +its free-to-use version offers basic tools that are sufficient +for a broad variety of basic applications, +up to and including writing a complete academic paper with coauthors. + +Overleaf's implementation of \LaTeX\ is suggested here for several reasons. +Since it is completely hosted online, +it avoids the inevitable troubleshooting of setting up a \LaTeX\ installation +on various personal computers run by the different members of your team. +It also automatically maintains a single master copy of the document +so that different writers do not create conflicted or out-of-sync copies, +and allows inviting collaborators to edit in a fashion similar to Google Docs. +Overleaf also offers a basic version history tool that avoids having to use separate software. +Most importantly, it provides a `` rich text'' editor +that behaves pretty similarly to familiar tools like Word, +so that people can write into the document without worrying too much +about the underlying \LaTeX\ coding. Overleaf also offers a convenient selection of templates so it is easy to start up a project and see results right away. + On the downside, there is a small amount of up-front learning required, continous access to the Internet is necessary, and updating figures and tables requires a bulk file upload that is tough to automate. - -The purpose of this setup, just like with other synced folders, -is to avoid there ever being more than one master copy of the document. -This means that people can edit simultaneously without fear of conflicts, -and it is never necessary to manually resolve differences in the document. - -One of the most common issues you will face using Overleaf will be special characters +One of the most common issues you will face using Overleaf's `` rich text'' editor will be special characters which, because of code functions, need to be handled differently than in Word. Most critically, the ampersand (\texttt{\&}), percent (\texttt{\%}), and underscore (\texttt{\_}) need to be ``escaped'' (interpreted as text and not code) in order to render. @@ -227,14 +238,6 @@ \subsection{Getting started with \LaTeX\ via Overleaf} Overleaf is often the easiest way to allow coauthors to write and edit in \LaTeX\, so long as you make sure you are available to troubleshoot minor issues like these. - - - - - - - - %------------------------------------------------ \section{Preparing a complete replication package} From 86e262af3b6a91c7f22f6f659f0f059e2a8ede28 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 11 Dec 2019 13:42:54 -0800 Subject: [PATCH 134/854] Small fix --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 1e8b76b3c..51f1b3a80 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -187,7 +187,7 @@ \subsection{Technical writing with \LaTeX} \codeexample{pandoc.sh}{./code/pandoc.sh} -\noindent The last portion after \texttt{--csl=} specifies the bibliography style. +\noindent The last portion after \texttt{csl=} specifies the bibliography style. You can download a CSL (Citation Styles Library) file\sidenote{ \url{https://github.com/citation-style-language/styles}} for nearly any journal and have it applied automatically in this process. From 24302f4b59974e8c896f29734b158ceafe6841e8 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 11 Dec 2019 14:19:00 -0800 Subject: [PATCH 135/854] Code and data introduction --- chapters/publication.tex | 145 ++++++++++++++++++++++++--------------- 1 file changed, 88 insertions(+), 57 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 51f1b3a80..7bc0c0778 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -242,74 +242,105 @@ \subsection{Getting started with \LaTeX\ via Overleaf} \section{Preparing a complete replication package} -\section{Publishing data for replication} - -\section{Publishing code for replication} - -Data and code should always be released with any publication.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} -\index{data publication} -Many journals and funders have strict open data policies, -and providing these materials to other researchers -allows them to evaluate the credibility of your work -as well as to re-use your materials for further research.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Exporting_Analysis}} -If you have followed the steps in this book carefully, -you will find that this is a very easy requirement to fulfill.\sidenote{ -\url{https://www.bitss.org/2016/05/23/out-of-the-file-drawer-tips-on-prepping-data-for-publication/}} -You will already have a publishable (either de-identified or constructed) -version of your research dataset. -You will already have your analysis code well-ordered, -and you won't have any junk lying around for cleanup. -You will have written your code so that others can read it, -and you will have documentation for all your work, -as well as a well-structured directory containing the code and data. +While we have focused so far on the preparation of written materials for publication, +is is increasingly important for you to consider how you will publish +the data and code you used for your research as well. +Increasingly, major journals are requiring that publications +provide direct links to both the code and data used to create the results, +and some even require that they are able to reproduce the results themselves +before they will approve a paper for publication.\sidenote{ + \url{https://www.aeaweb.org/journals/policies/data-code/}} +If your materials has been well-structured throughout the analytical process, +this will only require a small amount of extra work; +if not, paring it down to the ``replication package'' may take some time. +A complete replication package should accomplish several core functions. +It must provide the exact data and code that is used for a paper, +all necessary de-identified data for the analysis, +and all code necessary for the analysis. +The code should exactly reproduce the raw outputs you have used for the paper, +and should include no extraneous documentation or PII data you would not share publicly. If you are at this stage, -all you need to do is find a place to publish your work! -GitHub provides one of the easiest solutions here, -since it is completely free for static, public projects -and it is straightforward to simply upload a fixed directory -and obtain a permanent URL for it. -The Open Science Framework also provides a good resource, -as does ResearchGate (which can also assign a permanent -digital object identifier link for your work). +all you need to do is find a place to publish your materials. +This is slightly easier said than done, +as there are a few variables to take into consideration +and no global consensus on the best solution. +The technologies available are likely to change dramatically +over the next few years; +the specific solutions we mention here highlight some current approaches +as well as their strengths and weaknesses. +GitHub provides one solution. +Making your GitHub repository public +is completely free for finalized projects. +The site can hold any file types, +provide a structured download of your whole project, +and allow others to look at alternate versions or histories easily. +It is straightforward to simply upload a fixed directory to GitHub +apply a sharing license, and obtain a URL for the whole package. + +However, GitHub is not ideal for other reasons. +It is not built to hold data in an efficient way +or to manage licenses or citations for datasets. +It does not provide a true archive service -- +you can change or remove the contents at any time. +A repository such as the Harvard Dataverse\sidenote{ + \url{https://dataverse.harvard.edu}} +addresses these issues, but is a poor place to store code. +The Open Science Framework\sidenote{ + \url{https://osf.io}} +provides a balanced implementation +that holds both code and data (as well as simple version histories), +as does ResearchGate\sidenote{ + \url{https://https://www.researchgate.net}} +(both of which can also assign a permanent digital object identifier link for your work). Any of these locations is acceptable -- the main requirement is that the system can handle the structured directory that you are submitting, and that it can provide a stable, structured URL for your project. - -You should release a structured directory that allows a user -to immediately run your code after changing the project directory. -The folders should include: -all necessary de-identified data for the analysis -(including only the data needed for analysis); -the data documentation for the analysis code -(describing the source and construction of variables); -the ready-to-run code necessary for the analysis; and -the raw outputs you have used for the paper. -Using \texttt{iefolder} from our \texttt{ietoolkit} can help standardize this in Stata. -In either the \texttt{/dofiles/} folder or in the root directory, -include a master script (\texttt{.do} or \texttt{.r} for example). -The master script should allow the reviewer to change -one line of code setting the directory path. -Then, running the master script should run the entire project -and re-create all the raw outputs exactly as supplied. -Check that all your code will run completely on a new computer. -This means installing any required user-written commands in the master script -(for example, in Stata using \texttt{ssc install} or \texttt{net install} -and in R include code for installing packages, -including installing the appropriate version of the package if necessary). -Make sure settings like \texttt{version}, \texttt{matsize}, and \texttt{varabbrev} are set.\sidenote{In Stata, \texttt{ietoolkit}'s \texttt{ieboilstart} command will do this for you.} -All outputs should clearly correspond by name to an exhibit in the paper, and vice versa. - -Finally, you may also want to release an author's copy or preprint. +You can even combine more than one tool if you prefer, +as long as they clearly point to each other. https://codeocean.com +Emerging technologies such as CodeOcean\sidenote{ + \url{https://codeocean.com}} +offer to store both code and data, +and also provide an online workspace in which others +can execute and modify your code +without having to download your tools and match your local environment +when packages and other underlying softwares may have changed since publication. + +In addition to the code and data, +you may also want to release an author's copy or preprint +of the article itself along with these raw materials. Check with your publisher before doing so; not all journals will accept material that has been released. Therefore you may need to wait until acceptance is confirmed. -This can be done on a number of pre-print websites, -many of which are topically-specific. -You can also use GitHub and link the file directly +This can be done on a number of preprint websites, +many of which are topic-specific.\sidenote{ + \url{https://en.wikipedia.org/wiki/ArXiv}} +You can also use GitHub and link to the PDF file directly on your personal website or whatever medium you are sharing the preprint through. Do not use Dropbox or Google Drive for this purpose: many organizations do not allow access to these tools, and that includes blocking staff from accessing your material. + +\section{Publishing data for replication} + +\section{Publishing code for replication} + +In either a scripts folder or in the root directory, +include a master script (dofile or Rscript for example). +The master script should allow the reviewer +to change one line of code setting the directory path. +Then, running the master script should run the entire project +and re-create all the raw outputs exactly as supplied. +Indicate the filename and line to change. +Check that all your code will run completely on a new computer: +Install any required user-written commands in the master script +(for example, in Stata using \texttt{ssc install} or \texttt{net install} +and in R include code for installing packages, +including selecting a specific version of the package if necessary). +Make sure system settings like \texttt{version}, \texttt{matsize}, and texttt{varabbrev} are set. +All outputs should clearly correspond by name to an exhibit in the paper, and vice versa. +(Supplying a compiling \LaTeX\ document can support this.) +The submission package should include these outputs in the location they are produced, and +code and outputs which are not used should be removed. From c21e73841567c9eee413bc995b5b12d9435a0b03 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 11 Dec 2019 14:39:48 -0800 Subject: [PATCH 136/854] Start data section --- chapters/publication.tex | 41 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/chapters/publication.tex b/chapters/publication.tex index 7bc0c0778..1d1a64d0a 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -325,8 +325,49 @@ \section{Preparing a complete replication package} \section{Publishing data for replication} +Privacy + +Make sure you have a clear understanding of the rights associated with the data release +and communicate them to any future users of the data. +You must provide a license with any data release.\sidenote{ + \url{https://iatistandard.org/en/guidance/preparing-organisation/organisation-data-publication/how-to-license-your-data/}} +This document need not be extremely detailed, +but it should clearly communicate to the reader what they are allowed to do with your data and +how credit should be given and to whom in further work that uses it. +Keep in mind that you may or may not own your data, +depending on how it was collected, +and the best time to resolve any questions about these rights +is at the time that data collection and/or transfer agreements are signed. +Even if you cannot release data immediately or publicly, +there are often options to catalog the data without open publication. +These may take the form of metadata catalogs or embargoed releases. +Such setups allow you to hold an archival version of your data +which your publication can reference, +as well as provide information about the contents of the datasets +and how future users might request permission to access them +(even if you are not the person who can grant that permission). +They can also provide for timed future releases of datasets +once the need for exclusive access has ended. + +Data publication should release the dataset in a widely recognized format. +While software-specific datasets are acceptable accompaniments to the code +(since those precise materials are probably necessary), +you should also consider releasing generic datasets +such as CSV files with accompanying codebooks, +since these will be re-adaptable by any researcher. +Additionally, when possible, you should also release +the data collection instrument or survey used to gather the information +so that readers can understand which data components are +collected directly in the field and which are derived. +Wherever possible, you should also release the code +that constructs any derived measures, +particularly where definitions may vary, +so that others can learn from your work and adapt it as they like. + \section{Publishing code for replication} + + In either a scripts folder or in the root directory, include a master script (dofile or Rscript for example). The master script should allow the reviewer From 8157278bffd6b22c8453fe833f461c10dc6a2630 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 13:53:40 -0800 Subject: [PATCH 137/854] Data privacy --- chapters/publication.tex | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 1d1a64d0a..75e2f0ce8 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -325,7 +325,21 @@ \section{Preparing a complete replication package} \section{Publishing data for replication} -Privacy +Enabling permanent access to the data used in your study +is an important contribution you can make along with the publication of results. +It allows other researchers to validate the mechanical construction of your results, +to investigate what other results might be obtained from the same population, +and test alternative approaches to other questions. +Therefore you should make clear in your study +where and how data are stored and how it might be accessed. +You do not have to publish data yourself, +although in many cases you will have the right to release +at least some subset of your analytical dataset. +You should only directly publish data which is fully de-identified +and, to the extent required to ensure reasonable privacy, +potential identifying characteristics are futher masked or removed. +In all other cases, you should contact an appropriate data catalog +to determine what privacy and licensing options are available. Make sure you have a clear understanding of the rights associated with the data release and communicate them to any future users of the data. @@ -337,9 +351,9 @@ \section{Publishing data for replication} Keep in mind that you may or may not own your data, depending on how it was collected, and the best time to resolve any questions about these rights -is at the time that data collection and/or transfer agreements are signed. +is at the time that data collection or transfer agreements are signed. Even if you cannot release data immediately or publicly, -there are often options to catalog the data without open publication. +there are often options to catalog or archive the data without open publication. These may take the form of metadata catalogs or embargoed releases. Such setups allow you to hold an archival version of your data which your publication can reference, @@ -359,6 +373,9 @@ \section{Publishing data for replication} the data collection instrument or survey used to gather the information so that readers can understand which data components are collected directly in the field and which are derived. +You should provide a clean version of the data +which corresponds exactly to the original database or instrument +as well as the constructed or derived dataset used for analysis. Wherever possible, you should also release the code that constructs any derived measures, particularly where definitions may vary, From d142d9470c5ddf97875e93016c0f124ddd9ae128 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 14:04:34 -0800 Subject: [PATCH 138/854] Start code publication --- chapters/publication.tex | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 75e2f0ce8..1e7b841fa 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -383,8 +383,17 @@ \section{Publishing data for replication} \section{Publishing code for replication} +Publishing code for replication has fewer legal and privacy constraints. +In most cases code will not contain identifying information; +check carefully that it does not. +Pubishing code also requires assigning a license to it; +in a majority of cases code publishers like GitHub +offer extremely permissive licensing options by default. +(If you do not provide a license, nobody can use your code!) +Make sure the code functions identically on a fresh install of your chosen software. +A new user should have no problem getting the code to execute perfectly. In either a scripts folder or in the root directory, include a master script (dofile or Rscript for example). The master script should allow the reviewer @@ -397,8 +406,16 @@ \section{Publishing code for replication} (for example, in Stata using \texttt{ssc install} or \texttt{net install} and in R include code for installing packages, including selecting a specific version of the package if necessary). +In many cases you can even directly provide the underlying code +for any user-installed packages that are needed to ensure forward-compatibility. Make sure system settings like \texttt{version}, \texttt{matsize}, and texttt{varabbrev} are set. -All outputs should clearly correspond by name to an exhibit in the paper, and vice versa. + +Finally, make sure that the code and its inputs and outputs are clearly identified. +A new user should, for example, be able to easily identify and remove +any files created by the code so that they can be recreated quickly. +They should also be able to quickly map all the outputs of the code +to the locations where they are placed in the associated published material, +such as ensuring that the raw components of figures or tables are clearly identified. +For example, outputs should clearly correspond by name to an exhibit in the paper, and vice versa. (Supplying a compiling \LaTeX\ document can support this.) -The submission package should include these outputs in the location they are produced, and -code and outputs which are not used should be removed. +Code and outputs which are not used should be removed. From 5aa73147520780e3ac1fe6249e48937e25210905 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 14:11:24 -0800 Subject: [PATCH 139/854] Typos --- chapters/publication.tex | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 1e7b841fa..35b34f02b 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -250,7 +250,7 @@ \section{Preparing a complete replication package} and some even require that they are able to reproduce the results themselves before they will approve a paper for publication.\sidenote{ \url{https://www.aeaweb.org/journals/policies/data-code/}} -If your materials has been well-structured throughout the analytical process, +If your material has been well-structured throughout the analytical process, this will only require a small amount of extra work; if not, paring it down to the ``replication package'' may take some time. A complete replication package should accomplish several core functions. @@ -298,7 +298,7 @@ \section{Preparing a complete replication package} the structured directory that you are submitting, and that it can provide a stable, structured URL for your project. You can even combine more than one tool if you prefer, -as long as they clearly point to each other. https://codeocean.com +as long as they clearly point to each other. Emerging technologies such as CodeOcean\sidenote{ \url{https://codeocean.com}} offer to store both code and data, @@ -323,7 +323,7 @@ \section{Preparing a complete replication package} many organizations do not allow access to these tools, and that includes blocking staff from accessing your material. -\section{Publishing data for replication} +\subsection{Publishing data for replication} Enabling permanent access to the data used in your study is an important contribution you can make along with the publication of results. @@ -381,7 +381,7 @@ \section{Publishing data for replication} particularly where definitions may vary, so that others can learn from your work and adapt it as they like. -\section{Publishing code for replication} +\subsection{Publishing code for replication} Publishing code for replication has fewer legal and privacy constraints. In most cases code will not contain identifying information; From 03fbf0a9490eb6164f6124e51340d2386350a3e9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 14:18:35 -0800 Subject: [PATCH 140/854] Publishing code --- chapters/publication.tex | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 35b34f02b..8b0cc8998 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -383,10 +383,21 @@ \subsection{Publishing data for replication} \subsection{Publishing code for replication} -Publishing code for replication has fewer legal and privacy constraints. +Before publishing your code, you should edit it for content and clarity +just as if it were written material. +The purpose of releasing code is to allow others to understand +exactly what you have done in order to obtain your results, +as well as to apply similar methods in future projects. +Therefore it should both be functional and readable. +Code is often not written this way when it is first prepared, +so it is important for you to review the content and organization +so that a new reader can figure out what and how your code should do. +Therefore, whereas your data should already be very clean at this stage, +your code is much less likely to be so, and this is where you need to make +time investments prior to releasing your replication package. +By contrast, replication code usually has few legal and privacy constraints. In most cases code will not contain identifying information; check carefully that it does not. - Pubishing code also requires assigning a license to it; in a majority of cases code publishers like GitHub offer extremely permissive licensing options by default. From 4e74eb7975affe962cc9ae207020fedb016b7af8 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 14:39:23 -0800 Subject: [PATCH 141/854] Introduction --- chapters/publication.tex | 47 +++++++++++++++++++--------------------- 1 file changed, 22 insertions(+), 25 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 8b0cc8998..2bdbbd915 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -1,35 +1,32 @@ %------------------------------------------------ \begin{fullwidth} -Increasingly, research assistants are relied on to manage some or all -of the publication process. This can include -managing the choice of software, -coordinating referencing and bibliography, -tracking changes across various authors and versions, -and preparing final reports or papers for release or submission. -Modern software tools can make a lot of these processes easier. -Unfortunately there is some learning curve, -particularly for lead authors who have been publishing for a long time. -This chapter suggests some tools and processes -that can make writing and publishing in a team significantly easier. -It will provide resources -to judge how best to adapt your team to the tools you agree upon, -since all teams vary in composition and technical experience. - +Publishing academic research today extends well beyond writing up a Word document alone. +There are often various contributors making specialized inputs to a single output, +a large number of iterations, versions, and rervisions, +and a wide variety of raw materials and results to be published together. Ideally, your team will spend as little time as possible fussing with the technical requirements of publication. -It is in nobody's interest for a skilled and busy research assistant +It is in nobody's interest for a skilled and busy researcher to spend days re-numbering references (and it can take days) if a small amount of up-front effort could automate the task. -However, experienced academics will likely have a workflow -with which they are already comfortable, -and since they have worked with many others in the past, -that workflow is likely to be the least-common-denominator: -Microsoft Word with tracked changes. -This chapter will show you how you can avoid at least some -of the pain of Microsoft Word, -while still providing materials in the format -that co-authors prefer and journals request. +In this section we suggest several methods -- +collectively refered to as ``dynamic documents'' -- +for managing the process of collaboration on any technical product. + +For most research projects, completing a written piece is not the end of the task. +In almost all cases, you will be required to release a replication package, +which contains the code and materials needed to create the results. +These represent an intellectual contribution in their own right, +because they enable others to learn from your process +and better understand the results you have obtained. +Holding code and data to the same standards a written work +is a new discpline for many researchers, +and here we provide some basic guidelines and basic responsibilities for both +that will help you to prepare a functioning and informative replication package. +In all cases, we note that technology is rapidly evolving +and that the specific tools noted here may not remain cutting-edge, +but the core principles of materials publication and transparency will endure. \end{fullwidth} %------------------------------------------------ From 14e3c204449abc6c5cd91981a07c91ce4af77b05 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 15:24:15 -0800 Subject: [PATCH 142/854] Fix #287 --- chapters/publication.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 2bdbbd915..bb008e6b6 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -130,7 +130,8 @@ \subsection{Technical writing with \LaTeX} Unfortunately, \LaTeX\ can be a challenge to set up and use at first, particularly if you are new to working with plain text code and file management. -\LaTeX\ requires that all formatting be done in its special code language, +It is also unfortunately weak with spelling and grammar checking. +This is because \LaTeX\ requires that all formatting be done in its special code language, and it is not particularly informative when you do something wrong. This can be off-putting very quickly for people who simply want to get to writing, like senior researchers. From 37abec7aa546e2c1e123f7b34cdc98ccf1536fc8 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 15:24:57 -0800 Subject: [PATCH 143/854] Fix #286 --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index bb008e6b6..36b217dd1 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -141,7 +141,7 @@ \subsection{Technical writing with \LaTeX} \url{https://atom.io/packages/atom-latex}} offer the most flexibility to work with \LaTeX\ on your computer, such as advanced integration with Git, -the entire team needs to be comfortable +the entire group of writers needs to be comfortable with \LaTeX\ before adopting one of these tools. They can require a lot of troubleshooting at a basic level at first, and non-technical staff may not be willing or able to acquire the required knowledge. From 6a98d329d3b63e44da255441d9d0d864cea0a339 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 15:55:30 -0800 Subject: [PATCH 144/854] Subsection replication package --- chapters/publication.tex | 2 ++ 1 file changed, 2 insertions(+) diff --git a/chapters/publication.tex b/chapters/publication.tex index 36b217dd1..605e7990f 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -258,6 +258,8 @@ \section{Preparing a complete replication package} The code should exactly reproduce the raw outputs you have used for the paper, and should include no extraneous documentation or PII data you would not share publicly. +\subsection{Releasing a replication package} + If you are at this stage, all you need to do is find a place to publish your materials. This is slightly easier said than done, From 44322be39855cc20c8fe1c294351bf52360d7631 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 15:56:08 -0800 Subject: [PATCH 145/854] Move to end --- chapters/publication.tex | 130 +++++++++++++++++++-------------------- 1 file changed, 65 insertions(+), 65 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 605e7990f..1f0373357 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -258,71 +258,6 @@ \section{Preparing a complete replication package} The code should exactly reproduce the raw outputs you have used for the paper, and should include no extraneous documentation or PII data you would not share publicly. -\subsection{Releasing a replication package} - -If you are at this stage, -all you need to do is find a place to publish your materials. -This is slightly easier said than done, -as there are a few variables to take into consideration -and no global consensus on the best solution. -The technologies available are likely to change dramatically -over the next few years; -the specific solutions we mention here highlight some current approaches -as well as their strengths and weaknesses. -GitHub provides one solution. -Making your GitHub repository public -is completely free for finalized projects. -The site can hold any file types, -provide a structured download of your whole project, -and allow others to look at alternate versions or histories easily. -It is straightforward to simply upload a fixed directory to GitHub -apply a sharing license, and obtain a URL for the whole package. - -However, GitHub is not ideal for other reasons. -It is not built to hold data in an efficient way -or to manage licenses or citations for datasets. -It does not provide a true archive service -- -you can change or remove the contents at any time. -A repository such as the Harvard Dataverse\sidenote{ - \url{https://dataverse.harvard.edu}} -addresses these issues, but is a poor place to store code. -The Open Science Framework\sidenote{ - \url{https://osf.io}} -provides a balanced implementation -that holds both code and data (as well as simple version histories), -as does ResearchGate\sidenote{ - \url{https://https://www.researchgate.net}} -(both of which can also assign a permanent digital object identifier link for your work). -Any of these locations is acceptable -- -the main requirement is that the system can handle -the structured directory that you are submitting, -and that it can provide a stable, structured URL for your project. -You can even combine more than one tool if you prefer, -as long as they clearly point to each other. -Emerging technologies such as CodeOcean\sidenote{ - \url{https://codeocean.com}} -offer to store both code and data, -and also provide an online workspace in which others -can execute and modify your code -without having to download your tools and match your local environment -when packages and other underlying softwares may have changed since publication. - -In addition to the code and data, -you may also want to release an author's copy or preprint -of the article itself along with these raw materials. -Check with your publisher before doing so; -not all journals will accept material that has been released. -Therefore you may need to wait until acceptance is confirmed. -This can be done on a number of preprint websites, -many of which are topic-specific.\sidenote{ - \url{https://en.wikipedia.org/wiki/ArXiv}} -You can also use GitHub and link to the PDF file directly -on your personal website or whatever medium you are -sharing the preprint through. -Do not use Dropbox or Google Drive for this purpose: -many organizations do not allow access to these tools, -and that includes blocking staff from accessing your material. - \subsection{Publishing data for replication} Enabling permanent access to the data used in your study @@ -430,3 +365,68 @@ \subsection{Publishing code for replication} For example, outputs should clearly correspond by name to an exhibit in the paper, and vice versa. (Supplying a compiling \LaTeX\ document can support this.) Code and outputs which are not used should be removed. + +\subsection{Releasing a replication package} + +If you are at this stage, +all you need to do is find a place to publish your materials. +This is slightly easier said than done, +as there are a few variables to take into consideration +and no global consensus on the best solution. +The technologies available are likely to change dramatically +over the next few years; +the specific solutions we mention here highlight some current approaches +as well as their strengths and weaknesses. +GitHub provides one solution. +Making your GitHub repository public +is completely free for finalized projects. +The site can hold any file types, +provide a structured download of your whole project, +and allow others to look at alternate versions or histories easily. +It is straightforward to simply upload a fixed directory to GitHub +apply a sharing license, and obtain a URL for the whole package. + +However, GitHub is not ideal for other reasons. +It is not built to hold data in an efficient way +or to manage licenses or citations for datasets. +It does not provide a true archive service -- +you can change or remove the contents at any time. +A repository such as the Harvard Dataverse\sidenote{ + \url{https://dataverse.harvard.edu}} +addresses these issues, but is a poor place to store code. +The Open Science Framework\sidenote{ + \url{https://osf.io}} +provides a balanced implementation +that holds both code and data (as well as simple version histories), +as does ResearchGate\sidenote{ + \url{https://https://www.researchgate.net}} +(both of which can also assign a permanent digital object identifier link for your work). +Any of these locations is acceptable -- +the main requirement is that the system can handle +the structured directory that you are submitting, +and that it can provide a stable, structured URL for your project. +You can even combine more than one tool if you prefer, +as long as they clearly point to each other. +Emerging technologies such as CodeOcean\sidenote{ + \url{https://codeocean.com}} +offer to store both code and data, +and also provide an online workspace in which others +can execute and modify your code +without having to download your tools and match your local environment +when packages and other underlying softwares may have changed since publication. + +In addition to the code and data, +you may also want to release an author's copy or preprint +of the article itself along with these raw materials. +Check with your publisher before doing so; +not all journals will accept material that has been released. +Therefore you may need to wait until acceptance is confirmed. +This can be done on a number of preprint websites, +many of which are topic-specific.\sidenote{ + \url{https://en.wikipedia.org/wiki/ArXiv}} +You can also use GitHub and link to the PDF file directly +on your personal website or whatever medium you are +sharing the preprint through. +Do not use Dropbox or Google Drive for this purpose: +many organizations do not allow access to these tools, +and that includes blocking staff from accessing your material. From fd5ee6e6be9d1de841f34284478a80401d2c42d6 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 16:38:33 -0800 Subject: [PATCH 146/854] Appendix introduction and comments --- appendix/stata-guide.tex | 81 +++++++++++++++++++++++++++------------- code/stata-comments.do | 5 ++- 2 files changed, 59 insertions(+), 27 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 48645affa..d4bd04260 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -2,24 +2,36 @@ \begin{fullwidth} -Most academic programs that prepare students for a career in the type of work discussed in this book -spend a disproportionately small amount of time teaching their students coding skills, in relation to the share of -their professional time they will spend writing code their first years after graduating. Recent -Masters' program graduates that have joined our team tended to have very good knowledge in the theory of our -trade, but tended to require a lot of training in its practical skills. To us, it is like hiring architects -that can sketch, describe, and discuss the concepts and requirements of a new building very well, but do -not have the technical skill set to actually contribute to a blueprint using professional standards that can be used -and understood by other professionals during construction. The reasons for this are probably a topic -for another book, but in today's data-driven world, people working in quantitative economics research -must be proficient programmers, and that includes more than being able to compute the correct numbers. - -This appendix first has a short section with instructions on how to access and use the code shared in -this book. The second section contains a the current DIME Analytics style guide for Stata code. +Most academic programs that prepare students for a career +in the type of work discussed in this book +spend a disproportionately small amount of time teaching their students coding skills +in relation to the share of their professional time they will spend writing code +their first years after graduating. +Recent Masters' program graduates that have joined our team +tended to have very good knowledge in the theory of our +trade, but tended to require a lot of training in its practical skills. +To us, it is like hiring architects that can sketch, describe, and discuss +the concepts and requirements of a new building very well, +but who do not have the technical skillset +to actually contribute to a blueprint using professional standards +that can be used and understood by other professionals during construction. +The reasons for this are probably a topic for another book, +but in today's data-driven world, +people working in quantitative economics research must be proficient programmers, +and that includes more than being able to compute the correct numbers. + +This appendix begins with a short section with instructions +on how to access and use the code examples shared in this book. +The second section contains a the current DIME Analytics style guide for Stata code. +No matter your technical proficiency in writing Stata code, +we believe these resources can help any person write more understandable code. Widely accepted and used style guides are common in most programming languages, and we think that using such a style guide greatly improves the quality -of research projects coded in Stata. We hope that this guide can help to increase the emphasis in the Stata community on using, -improving, sharing and standardizing code style. Style guides are the most important tool in how -you, like an architect, draw a blueprint that can be understood and used by everyone in your trade. +of research projects coded in Stata. +We hope that this guide can help to increase the emphasis in the Stata community +on using, improving, sharing and standardizing Stata code style. +Style guides are the most important tool in how you, like an architect, +can draw a blueprint that can be understood and used by everyone in your trade. \end{fullwidth} @@ -102,25 +114,44 @@ \subsection{Why we use a Stata style guide} \section{The DIME Analytics Stata style guide} -While this section is called a \textit{Stata} Style Guide, many of these practices are agnostic to which -programming language you are using: best practices often relate to concepts that are common across many -languages. If you are coding in a different language, then you might still use many of the guidelines -listed in this section, but you should use your judgment when doing so. +While this section is called a \textit{Stata} Style Guide, +many of these practices are agnostic to which programming language you are using: +best practices often relate to concepts that are common across many languages. +If you are coding in a different language, +then you might still use many of the guidelines listed in this section, +but you should use your judgment when doing so. All style rules introduced in this section are the way we suggest to code, but the most important thing is that the way you style your code is \textit{consistent}. This guide allows our team to have a consistent code style. \subsection{Commenting code} -Comments do not change the output of code, but without them, your code will not be accessible to your colleagues. +Comments do not change the output of code, but without them, +your code will not be accessible to your colleagues. It will also take you a much longer time to edit code you wrote in the past if you did not comment it well. -So, comment a lot: do not only write \textit{what} your code is doing but also \textit{why} you wrote it like that. +So, comment a lot: do not only write \textit{what} your code is doing +but also \textit{why} you wrote it like the way you did. +As a corollary, try to write simpler code that needs less explanation, +even if you could use an elegant and complex method in less space, +unless the advanced method is a widely accepted one. There are three types of comments in Stata and they have different purposes: \begin{enumerate} - \item \texttt{/* */} indicates narrative, multi-line comments at the beginning of files or sections. - \item \texttt{*} indicates a change in task or a code sub-section and should be multi-line only if necessary. - \item \texttt{//} is used for inline clarification a single line of code. + \item \texttt{/*} + + \texttt{COMMENT} + + \texttt{*/} + + is used to insert narrative, multi-line comments at the beginning of files or sections. + \item \texttt{* COMMENT} + + \texttt{* POSSIBLY MORE COMMENT} + + indicates a change in task or a code sub-section and should be multi-line only if necessary. + \item \texttt{// COMMENT} + + is used for inline clarification after a single line of code. \end{enumerate} \codeexample{stata-comments.do}{./code/stata-comments.do} diff --git a/code/stata-comments.do b/code/stata-comments.do index e73c67115..81fa7a1cb 100644 --- a/code/stata-comments.do +++ b/code/stata-comments.do @@ -4,8 +4,9 @@ section of it */ -* Standardize settings (This comment is used to document a task -* covering at maximum a few lines of code) +* Standardize settings, explicitly set the version, and +* clear all previous information from memory +* (This comment is used to document a task covering at maximum a few lines of code) ieboilstart, version(13.1) `r(version)' From 82eb9ee3a955655e3d7523f1c618b17084ddbc3b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 16:50:09 -0800 Subject: [PATCH 147/854] Loops and varabbrev --- appendix/stata-guide.tex | 8 ++++++++ code/stata-loops.do | 31 +++++++++++++++++++------------ 2 files changed, 27 insertions(+), 12 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index d4bd04260..e5124ed7a 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -191,6 +191,14 @@ \subsection{Abbreviating commands} \end{tabular} \end{center} +\subsection{Abbreviating variables} + +Never abbreviate variable names. Instead, write them out completely. +Your code may change if a variable is later introduced +that has a name exactly as in the abbreviation. +\texttt{ieboilstart} executes the command \texttt{set varabbrev off} by default, +and will therefore break any code using variable abbreviations. + \subsection{Writing loops} In Stata examples and other code languages, it is common that the name of the local generated by \texttt{foreach} or \texttt{forvalues} diff --git a/code/stata-loops.do b/code/stata-loops.do index 7863ce107..7091a3a6b 100644 --- a/code/stata-loops.do +++ b/code/stata-loops.do @@ -1,17 +1,24 @@ -* This is BAD - foreach i in potato cassava maize { - } +BAD: -* These are GOOD - foreach crop in potato cassava maize { - } +* Loop over crops +foreach i in potato cassava maize { + do something to `i' +} - * or +GOOD: - local crops potato cassava maize - * Loop over crops - foreach crop of local crops { +* Loop over crops +foreach crop in potato cassava maize { + do something to `crop' +} + +GOOD: + +* Loop over crops +local crops potato cassava maize + foreach crop of local crops { * Loop over plot number forvalues plot_num = 1/10 { - } - } + do something to `crop' in `plot_num' + } // End plot loop + } // End crop loop From 88ee55a4c8c8e49387189d9a299fc534be56c1d4 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 16:51:07 -0800 Subject: [PATCH 148/854] Comments notes --- code/stata-comments.do | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/code/stata-comments.do b/code/stata-comments.do index 81fa7a1cb..bc9a5d374 100644 --- a/code/stata-comments.do +++ b/code/stata-comments.do @@ -1,14 +1,20 @@ +TYPE 1: + /* This is a do-file with examples of comments in Stata. This type of comment is used to document all of the do-file or a large section of it */ +TYPE 2: + * Standardize settings, explicitly set the version, and * clear all previous information from memory * (This comment is used to document a task covering at maximum a few lines of code) ieboilstart, version(13.1) `r(version)' +TYPE 3: + * Open the dataset sysuse auto.dta // Built in dataset (This comment is used to document a single line) From 84f747c7f77ac217bb284493e5c8c954997b80a6 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 16:59:25 -0800 Subject: [PATCH 149/854] Whitespace --- appendix/stata-guide.tex | 33 +++++++++++++++------------- code/stata-whitespace-columns.do | 24 ++++++++++---------- code/stata-whitespace-indentation.do | 6 +++-- 3 files changed, 35 insertions(+), 28 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index e5124ed7a..eb4e81e5c 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -216,27 +216,30 @@ \subsection{Writing loops} \subsection{Using whitespace} -In Stata, one space or many spaces does not make a difference, and this can be used to make the code -much more readable. In the example below the exact same code is written twice, but in the good example whitespace -is used to signal to the reader that the central object of this segment of code is the variable -\texttt{employed}. Organizing the code like this makes the code much quicker to read, and small typos +In Stata, one space or many spaces does not make a difference to code execution, +and this can be used to make the code much more readable. +We are all very well trained in using whitespace in software like PowerPoint and Excel: +we would never present a PowerPoint presentation where the text does not align +or submit an Excel table with unstructured rows and columns, and the same principles apply to coding. +In the example below the exact same code is written twice, +but in the better example whitespace is used to signal to the reader +that the central object of this segment of code is the variable \texttt{employed}. +Organizing the code like this makes the code much quicker to read, and small typos stand out much more, making them easier to spot. -We are all very well trained in using whitespace in software like PowerPoint and Excel: we would never -present a PowerPoint presentation where the text does not align or submit an Excel table with unstructured -rows and columns, and the same principles apply to coding. - \codeexample{stata-whitespace-columns.do}{./code/stata-whitespace-columns.do} -Indentation is another type of whitespace that makes code more readable. Any segment of code that is -repeated in a loop or conditional on an if-statement should have indentation of 4 spaces relative +Indentation is another type of whitespace that makes code more readable. +Any segment of code that is repeated in a loop or conditional on an +\texttt{if}-statement should have indentation of 4 spaces relative to both the loop or conditional statement as well as the closing curly brace. Similarly, continuing lines of code should be indented under the initial command. -If a segment is in a loop inside a loop, then it should be indented another 4 spaces, making it 8 spaces more -indented than the main code. In some code editors this indentation can be achieved by using the tab button on -your keyboard. However, the type of tab used in the Stata do-file editor does not always display the same across -platforms, such as when publishing the code on GitHub. Therefore we recommend that indentation be 4 manual spaces -instead of a tab. +If a segment is in a loop inside a loop, then it should be indented another 4 spaces, +making it 8 spaces more indented than the main code. +In some code editors this indentation can be achieved by using the tab button. +However, the type of tab used in the Stata do-file editor does not always display the same across platforms, +such as when publishing the code on GitHub. +Therefore we recommend that indentation be 4 manual spaces instead of a tab. \codeexample{stata-whitespace-indentation.do}{./code/stata-whitespace-indentation.do} diff --git a/code/stata-whitespace-columns.do b/code/stata-whitespace-columns.do index 81a09eb2d..adcb8eec3 100644 --- a/code/stata-whitespace-columns.do +++ b/code/stata-whitespace-columns.do @@ -1,15 +1,17 @@ -* This is BAD +ACCEPTABLE: + * Create dummy for being employed generate employed = 1 - replace employed = 0 if (_merge == 2) - label variable employed "Person exists in employment data" - label define yesno 1 "Yes" 0 "No" - label value employed yesno + replace employed = 0 if (_merge == 2) + lab var employed "Person exists in employment data" + lab def yesno 1 "Yes" 0 "No" + lab val employed yesno + +BETTER: -* This is GOOD * Create dummy for being employed - generate employed = 1 - replace employed = 0 if (_merge == 2) - label variable employed "Person exists in employment data" - label define yesno 1 "Yes" 0 "No" - label value employed yesno + generate employed = 1 + replace employed = 0 if (_merge == 2) + lab var employed "Person exists in employment data" + lab def yesno 1 "Yes" 0 "No" + lab val employed yesno diff --git a/code/stata-whitespace-indentation.do b/code/stata-whitespace-indentation.do index 5c72f688a..3e6ebb406 100644 --- a/code/stata-whitespace-indentation.do +++ b/code/stata-whitespace-indentation.do @@ -1,4 +1,5 @@ -* This is GOOD +GOOD: + * Loop over crops foreach crop in potato cassava maize { * Loop over plot number @@ -16,7 +17,8 @@ gen use_sample = 1 } -* This is BAD +BAD: + * Loop over crops foreach crop in potato cassava maize { * Loop over plot number From 6939f243cdb4e3561b446daa58ab16d07e2a583c Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 13 Dec 2019 17:07:34 -0800 Subject: [PATCH 150/854] Conditional expressions --- appendix/stata-guide.tex | 12 ++++++++---- code/stata-conditional-expressions1.do | 6 ++++-- code/stata-conditional-expressions2.do | 6 +++--- 3 files changed, 15 insertions(+), 9 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index eb4e81e5c..ca84b9706 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -247,13 +247,17 @@ \subsection{Writing conditional expressions} All conditional (true/false) expressions should be within at least one set of parentheses. The negation of logical expressions should use bang (\texttt{!}) and not tilde (\texttt{\~}). +Always use explicit truth checks (\texttt{if `value'==1}) rather than implicits (\texttt{if `value'}). +Always use the \texttt{missing(`var')} function instead of arguments like (\texttt{if `var'<=.}), +and always consider whether missing values will affect the evaluation conditional expressions. \codeexample{stata-conditional-expressions1.do}{./code/stata-conditional-expressions1.do} -You should also always use \texttt{if-else} statements when applicable even if you can express the same -thing with two separate \texttt{if} statements. When using \texttt{if-else} statements you are -communicating to anyone reading your code that the two cases are mutually exclusive in an \texttt{if-else} statement -which makes your code more readable. It is also less error-prone and easier to update. +Use \texttt{if-else} statements when applicable +even if you can express the same thing with two separate \texttt{if} statements. +When using \texttt{if-else} statements you are communicating to anyone reading your code +that the two cases are mutually exclusive which makes your code more readable. +It is also less error-prone and easier to update if you want to change the condition. \codeexample{stata-conditional-expressions2.do}{./code/stata-conditional-expressions2.do} diff --git a/code/stata-conditional-expressions1.do b/code/stata-conditional-expressions1.do index 37bd117b2..ed0b75f59 100644 --- a/code/stata-conditional-expressions1.do +++ b/code/stata-conditional-expressions1.do @@ -1,7 +1,9 @@ -* These examples are GOOD +GOOD: + replace gender_string = "Female" if (gender == 1) replace gender_string = "Male" if ((gender != 1) & !missing(gender)) -* These examples are BAD +BAD: + replace gender_string = "Female" if gender == 1 replace gender_string = "Male" if (gender ~= 1) diff --git a/code/stata-conditional-expressions2.do b/code/stata-conditional-expressions2.do index 8262ea6e3..bd47d48c3 100644 --- a/code/stata-conditional-expressions2.do +++ b/code/stata-conditional-expressions2.do @@ -1,12 +1,12 @@ - local sampleSize = _N // Get the number of observations in dataset +GOOD: -* This example is GOOD if (`sampleSize' <= 100) { } else { } -* This example is BAD +BAD: + if (`sampleSize' <= 100) { } if (`sampleSize' > 100) { From 2634d1312ce8ca7efdd2099fc777732e49d7773d Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 14 Dec 2019 17:13:01 -0800 Subject: [PATCH 151/854] Macros --- appendix/stata-guide.tex | 28 +++++++++++++++++++++------- code/stata-macros.do | 27 ++++++++++++--------------- 2 files changed, 33 insertions(+), 22 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index ca84b9706..f304c60aa 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -263,17 +263,31 @@ \subsection{Writing conditional expressions} \subsection{Using macros} -Stata has several types of \textbf{macros} where numbers or text can be stored temporarily, but the two most common -macros are \textbf{local} and \textbf{global}. Locals should always be the default type and globals should only -be used when the information stored is used in a different do-file. Globals are error-prone since they are -active as long as Stata is open, which creates a risk that a global from one project is incorrectly used in -another, so only use globals where they are necessary. Our recommendation is that globals should only be defined in the \textbf{master do-file}. -All globals should be referenced using both the the dollar sign and the curly brackets around their name; +Stata has several types of \textbf{macros} where numbers or text can be stored temporarily, +but the two most common macros are \textbf{local} and \textbf{global}. +All macros should be defined using the \texttt{=} operator. +Never abbreviate the commands for \textbf{local} and \textbf{global}. +Locals should always be the default type and globals should only +be used when the information stored is used in a different do-file. +Globals are error-prone since they are active as long as Stata is open, +which creates a risk that a global from one project is incorrectly used in another, +so only use globals where they are necessary. +Our recommendation is that globals should only be defined in the \textbf{master do-file}. +All globals should be referenced using both the the dollar sign and curly brackets around their name (\texttt{\$\{\}}); otherwise, they can cause readability issues when the endpoint of the macro name is unclear. +You should use descriptive names for all macros (up to 32 characters; prefer fewer). +Simple prefixes are useful and encouraged such as \texttt{thisParam}, \texttt{allParams}, +\texttt{theLastParam}, \texttt{allParams}, or \texttt{nParams}. There are several naming conventions you can use for macros with long or multi-word names. Which one you use is not as important as whether you and your team are consistent in how you name then. -You can use all lower case (\texttt{mymacro}), underscores (\texttt{my\_macro}), or ``camel case'' (\texttt{myMacro}), as long as you are consistent. +You can use all lower case (\texttt{mymacro}), underscores (\texttt{my\_macro}), +or ``camel case'' (\texttt{myMacro}), as long as you are consistent. +Nested locals are also possible for a variety of reasons when looping. +Finally, if you need a macro to hold a literal macro name, +it can be done using the backslash escape character; +this causes the stored macro to be evaluated +at the usage of the macro rather than at its creation. \codeexample{stata-macros.do}{./code/stata-macros.do} diff --git a/code/stata-macros.do b/code/stata-macros.do index e232b0e40..e421e75ae 100644 --- a/code/stata-macros.do +++ b/code/stata-macros.do @@ -1,20 +1,17 @@ -* Define a local and a global using the same name convention - local myLocal "A string local" - global myGlobal "A string global" +GOOD: + + global myGlobal = "A string global" + local myLocal1 = length("${myGlobal}") + local myLocal2 = "\${myGlobal}" -* Reference the local and the global macros - display "`myLocal'" display "${myGlobal}" + global myGlobal = "A different string" -* Escape character. If backslashes are used just before a local -* or a global then two backslashes must be used - local myFolderLocal "Documents" - local myFolderGlobal "Documents" + forvalues i = 1/2 { + display "`myLocal`i''" + } -* These are BAD - display "C:\Users\username\`myFolderLocal'" - display "C:\Users\username\${myFolderGlobal}" +BAD: -* These are GOOD - display "C:\Users\username\\`myFolderLocal'" - display "C:\Users\username\\${myFolderGlobal}" + global myglobal "A string global" + local my_Local = length($myGlobal) From 16b712a5914842b326da58bc03e4c8be680168e3 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 14 Dec 2019 17:24:50 -0800 Subject: [PATCH 152/854] Filepaths --- appendix/stata-guide.tex | 44 ++++++++++++++++++++++++---------------- code/stata-filepaths.do | 15 +++++++------- 2 files changed, 34 insertions(+), 25 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index f304c60aa..dcacb3520 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -293,25 +293,35 @@ \subsection{Using macros} \subsection{Writing file paths} -All file paths should be absolute and dynamic, should always be enclosed in double quotes, and should -always use forward slashes for folder hierarchies (\texttt{/}), since Mac and Linux computers cannot read file paths with backslashes. -File paths should also always include the file extension (\texttt{.dta}, \texttt{.do}, \texttt{.csv}, etc.), -since to omit the extension causes ambiguity if another file with the same name is created (even if there is a default). - -\textbf{Absolute file paths} means that all file paths must start at the root folder of the computer, for -example \texttt{C:/} on a PC or \texttt{/Users/} on a Mac. This makes sure that -you always get the correct file in the correct folder. \textbf{We never use \texttt{cd}.} -We have seen many cases when using \texttt{cd} where -a file has been overwritten in another project folder where \texttt{cd} was currently pointing to. -Relative file paths are common in many other programming languages, but there they are relative to the -location of the file running the code, and then there is no risk that a file is saved in a completely different folder. +All file paths should be absolute and dynamic, +should always be enclosed in double quotes, +and should \textbf{always use forward slashes} for folder hierarchies (\texttt{/}). +File names should be written in lower case with dashes (\texttt{my-file.dta}). +Mac and Linux computers cannot read file paths with backslashes, +and backslashes cannot be removed with find-and-replace. +File paths should also always include the file extension +(\texttt{.dta}, \texttt{.do}, \texttt{.csv}, etc.). +Omitting the extension causes ambiguity +if another file with the same name is created +(even if there is a default file type). + +\textbf{Absolute} means that all file paths start at the root folder of the computer, +often \texttt{C:/} on a PC or \texttt{/Users/} on a Mac. +This makes ensures that you always get the correct file in the correct folder. +\textbf{Do not use \texttt{cd} unless there is a function that \textit{requires} it.} +When using \texttt{cd}, it is easy to overwrite a file in another project folder. +Many Stata functions use \texttt{cd} and therefore the current directory may change without warning. +Relative file paths are common in many other programming languages, +but there they are always relative to the location of the file running the code. Stata does not provide this functionality. -\textbf{Dynamic file paths} use globals that are set in a central master do-file to dynamically build your file -paths. This has the same function in practice as setting \texttt{cd}, as all new users should only have to change these -file path globals in one location. But dynamic absolute file paths are a better practice since if the -global names are set uniquely there is no risk that files are saved in the incorrect project folder, and -you can create multiple folder globals instead of just one location as with \texttt{cd}. +\textbf{Dynamic} file paths use global macros for the location of the root folder. +These globals should be set in a central master do-file. +This makes it possible to write file paths that work very similarly to relative paths. +This achieves the functionality that setting \texttt{cd} is often intended to: +executing the code on a new system only requires updating file path globals in one location. +If global names are unique, there is no risk that files are saved in the incorrect project folder. +You can create multiple folder globals as needed and this is encouraged. \codeexample{stata-filepaths.do}{./code/stata-filepaths.do} diff --git a/code/stata-filepaths.do b/code/stata-filepaths.do index 36393dea8..36ab067ac 100644 --- a/code/stata-filepaths.do +++ b/code/stata-filepaths.do @@ -1,15 +1,14 @@ -* Dynamic, absolute file paths +GOOD: - * Dynamic (and absolute) - GOOD - global myDocs "C:/Users/username/Documents" - global myProject "${myDocs}/MyProject" - use "${myProject}/MyDataset.dta" + global myDocs = "C:/Users/username/Documents" + global myProject = "${myDocs}/MyProject" + use "${myProject}/my-dataset.dta" , clear -* Relative and absolute file paths +BAD: - * Relative - BAD + RELATIVE PATHS: cd "C:/Users/username/Documents/MyProject" use MyDataset.dta - * Absolute but not dynamic - BAD + STATIC PATHS: use "C:/Users/username/Documents/MyProject/MyDataset.dta" From f3171a78812fb5f8c920819ab347f5b2629fc1d5 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 14 Dec 2019 17:34:15 -0800 Subject: [PATCH 153/854] Line breaks --- appendix/stata-guide.tex | 25 +++++++++++++++---------- code/stata-linebreak.do | 15 ++++++++++----- 2 files changed, 25 insertions(+), 15 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index dcacb3520..833e66195 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -325,19 +325,24 @@ \subsection{Writing file paths} \codeexample{stata-filepaths.do}{./code/stata-filepaths.do} -\subsection{Placing line breaks} +\subsection{Line breaks} Long lines of code are difficult to read if you have to scroll left and right to see the full line of code. -When your line of code is wider than text on a regular paper you should introduce a line break. -A common line breaking length is around 80 characters, and Stata and other code editors -provide a visible ``guide line'' to tell you when you should start a new line using \texttt{///}. -This breaks the line in the code editor while telling Stata that the same line of code continues on the next row in the -code editor. Using the \texttt{\#delimit} command is only intended for advanced programming -and is discouraged for analytical code in an article in Stata's official journal.\cite{cox2005styleguide} -These do not need to be horizontally aligned in code unless they have comments, +When your line of code is wider than text on a regular paper, you should introduce a line break. +A common line breaking length is around 80 characters. +Stata and other code editors provide a visible ``guide line''. +Around that length, start a new line using \texttt{///}. +You can and should write comments after \texttt{///} just as with \texttt{//}. +(The \texttt{\#delimit} command is only acceptable for advanced function programming +and is officially discouraged in analytical code.\cite{cox2005styleguide} +Never, for any reason, use \texttt{/* */} to wrap a line.) +The \texttt{///} breaks the line in the code editor, +while telling Stata that the same line of code continues on the next line. +The \texttt{///} breaks do not need to be horizontally aligned in code, +although you may prefer to if they have comments that read better aligned, since indentations should reflect that the command continues to a new line. -Line breaks and indentations can similarly be used to highlight the placement -of the \textbf{option comma} in Stata commands. +Line breaks and indentations may be used to highlight the placement +of the \textbf{option comma} or other functional syntax in Stata commands. \codeexample{stata-linebreak.do}{./code/stata-linebreak.do} diff --git a/code/stata-linebreak.do b/code/stata-linebreak.do index 51d112650..1fd87c20e 100644 --- a/code/stata-linebreak.do +++ b/code/stata-linebreak.do @@ -1,11 +1,12 @@ -* This is GOOD - graph hbar /// - invil if (priv == 1) /// - , over(statename, sort(1) descending) blabel(bar, format(%9.0f)) /// +GOOD: + graph hbar invil /// Proportion in village + if (priv == 1) /// Private facilities only + , over(statename, sort(1) descending) /// Order states by values + blabel(bar, format(%9.0f)) /// Label the bars ylab(0 "0%" 25 "25%" 50 "50%" 75 "75%" 100 "100%") /// ytit("Share of private primary care visits made in own village") -* This is BAD +BAD: #delimit ; graph hbar invil if (priv == 1) @@ -13,3 +14,7 @@ ylab(0 "0%" 25 "25%" 50 "50%" 75 "75%" 100 "100%") ytit("Share of private primary care visits made in own village"); #delimit cr + +UGLY: + graph hbar /* +*/ invil if (priv == 1) From 22b02a3b926266184dda90fc4d31039c9123a544 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 14 Dec 2019 17:42:46 -0800 Subject: [PATCH 154/854] Boilerplate --- appendix/stata-guide.tex | 25 ++++++++++++++++++------- code/stata-boilerplate.do | 9 ++++++--- 2 files changed, 24 insertions(+), 10 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 833e66195..f95128c2e 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -348,13 +348,24 @@ \subsection{Line breaks} \subsection{Using boilerplate code} -Boilerplate code is a type of code that always comes at the top of the code file, and its purpose is to harmonize -the settings across users running the same code to the greatest degree possible. There is no way in Stata to -guarantee that any two installations of Stata will always run code in exactly the same way. In the vast -majority of cases it does, but not always, and boilerplate code can mitigate that risk (although not eliminate -it). We have developed a command that runs many commonly used boilerplate settings that are optimized given -your installation of Stata. It requires two lines of code to execute the \texttt{version} -setting that avoids difference in results due to different versions of Stata. +Boilerplate code is a few lines of code that always comes at the top of the code file, +and its purpose is to harmonize settings across users running the same code to the greatest degree possible. There is no way in Stata to guarantee that any two installations of Stata +will always run code in exactly the same way. +In the vast majority of cases it does, but not always, +and boilerplate code can mitigate that risk (although not eliminate it). +We have developed a command that runs many commonly used boilerplate settings +that are optimized given your installation of Stata. +It requires two lines of code to execute the \texttt{version} +setting that avoids differences in results due to different versions of Stata. +Among other things, it turns the \texttt{more} flag off so code never hangs; +it turns \texttt{varabbrev} off so abbrevated variable names are rejected; +and it maximizes the allowed memory usage and matrix size +so that code is not rejected on other machines for violating system limits. +(Again, other software versions, such as Small Stata and outdated versions, +have lower limits and it may not be able to run newer code in them.) +Finally, it clears all stored information in Stata memory, +such as non-installed programs and globals, +so it gets as close as possible to opening Stata fresh. \codeexample{stata-boilerplate.do}{./code/stata-boilerplate.do} diff --git a/code/stata-boilerplate.do b/code/stata-boilerplate.do index 16df65235..52a91eb92 100644 --- a/code/stata-boilerplate.do +++ b/code/stata-boilerplate.do @@ -1,8 +1,11 @@ -* This is GOOD +GOOD: + ieboilstart, version(13.1) `r(version)' -* This is GOOD but less GOOD - set more off +OK: + + set more off , perm + clear all set maxvar 10000 version 13.1 From eb73f509cd13af45cd14a23ac27c51b9df5d7a24 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 14 Dec 2019 17:46:48 -0800 Subject: [PATCH 155/854] Before saving --- appendix/stata-guide.tex | 26 ++++++++++++++++---------- code/stata-before-saving.do | 26 +++++++++++++++----------- 2 files changed, 31 insertions(+), 21 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index f95128c2e..24c8ca355 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -371,16 +371,22 @@ \subsection{Using boilerplate code} \subsection{Saving data} -Similarly to boilerplate code, there are good practices that should be followed before saving the data set. -These are sorting and ordering the data set, dropping intermediate variables that are not needed, and -compressing the data set to save disk space and network bandwidth. - -If there is an ID variable or a set of ID variables, then the code should also test that they are uniqueally and -fully identifying the data set.\sidenote{\url{https://dimewiki.worldbank.org/wiki/ID\_Variable\_Properties}} ID -variables are also perfect variables to sort on, and to order leftmost in the data set. - -The command \texttt{compress} makes the data set smaller in terms of memory usage without ever losing any -information. It optimizes the storage types for all variables and therefore makes it smaller on your computer +There are good practices that should be followed before saving any data set. +These are to \texttt{sort} and \texttt{order} the data set, +dropping intermediate variables that are not needed, +and compressing the data set to save disk space and network bandwidth. + +If there is a unique ID variable or a set of ID variables, +the code should test that they are uniqueally and +fully identifying the data set.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/ID\_Variable\_Properties}} +ID variables are also perfect variables to sort on, +and to \texttt{order} first in the data set. + +The command \texttt{compress} makes the data set smaller in terms of memory usage +without ever losing any information. +It optimizes the storage types for all variables +and therefore makes it smaller on your computer and faster to send over a network or the internet. \codeexample{stata-before-saving.do}{./code/stata-before-saving.do} diff --git a/code/stata-before-saving.do b/code/stata-before-saving.do index cf980debc..bdeb65e01 100644 --- a/code/stata-before-saving.do +++ b/code/stata-before-saving.do @@ -1,18 +1,22 @@ -* If the data set has ID variables, create a local and test -* if they are fully and uniquely identifying the observations. -local idvars household_ID household_member year -isid `idvars' +* If the data set has ID variables, test if they uniquely identifying the observations. + + local idvars household_ID household_member year + isid `idvars' * Sort and order on the idvars (or any other variables if there are no ID variables) -sort `idvars' -order * , seq // Place all variables in alphanumeric order (optional but useful) -order `idvars' , first // Make sure the idvars are the leftmost vars when browsing + + sort `idvars' + order * , seq // Place all variables in alphanumeric order (optional but useful) + order `idvars' , first // Make sure the idvars are the leftmost vars when browsing * Drop intermediate variables no longer needed * Optimize disk space -compress -* Save data settings -save "${myProject}/myDataFile.dta" , replace // The folder global is set in master do-file - use "${myProject}/myDataFile.dta" , clear // It is useful to be able to recall the data quickly + compress + +* Save data + + save "${myProject}/myDataFile.dta" , replace // The folder global is set in master do-file + saveold "${myProject}/myDataFile-13.dta" , replace v(13) // For others + use "${myProject}/myDataFile.dta" , clear // It is useful to be able to recall the data quickly From dc73c4ed3eba1789883c62b73da8433a212a0a57 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 14 Dec 2019 18:02:16 -0800 Subject: [PATCH 156/854] Final notes --- appendix/stata-guide.tex | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 24c8ca355..7db60c805 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -391,5 +391,29 @@ \subsection{Saving data} \codeexample{stata-before-saving.do}{./code/stata-before-saving.do} +\subsection{Miscellaneous notes} + +Use wildcards in variable names (\texttt{xx\_?\_*\_xx}) sparingly, +as they may change results if the dataset changes. +Write multiple graphs as \texttt{tw (xx)(xx)(xx)}, not \texttt{tw xx||xx||xx}. +Put spaces around each binary operator except \texttt{\^}. +Therefore write \texttt{gen z = x + y} and \texttt{x\^}\texttt{2}. +When order of operations applies, use spacing and parentheses: +\texttt{hours + (minutes/60) + (seconds/3600)}, not \texttt{hours + minutes / 60 + seconds / 3600}. +For long expressions, the operator starts the new line, so: + +\texttt{gen sumvar = x ///} + +\texttt{ + y ///} + +\texttt{ - z ///} + +\texttt{ + a*(b-c)} + +\noindent Make sure your code doesn't print very much to the results window as this is slow. +This can be accomplished by using \texttt{run file.do} rather than \texttt{do file.do}. +Run outputs like \texttt{reg} using the \texttt{qui} prefix. +Never use interactive commands like \texttt{sum} or \texttt{tab} in dofiles, +unless they are with \texttt{qui} for the purpose of getting \texttt{r()}-statistics. \mainmatter From 282788ceb3d164300f1fda526df7f234f8cd0fcc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 18 Dec 2019 14:34:18 -0800 Subject: [PATCH 157/854] Edits, updates, indexing --- chapters/handling-data.tex | 182 +++++++++++++++++++++---------------- 1 file changed, 106 insertions(+), 76 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 0781d4e86..7e2c1e2df 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -41,24 +41,33 @@ \section{Protecting confidence in development research} +Across the social sciences, the open science movement +has been fueled by discoveries of low-quality research practices, +data and code that are inaccessible to the public, +analytical errors in major research papers, +and in some cases even outright fraud. +While the development research community has not yet +experienced any major scandals, +it has become clear that there are necessary incremental improvements +in the way that code and data are handled as part of research. +Major publishers and funders, most notably the American Economic Association, +have taken steps to require that these research components +are accurately reported and preserved as outputs in themselves.\sidenote{ + \url{https://www.aeaweb.org/journals/policies/data-code/}} + The empirical revolution in development research +has therefore led to increased public scrutiny of the reliability of research.\cite{rogers_2017} \index{transparency}\index{credibility}\index{reproducibility} -has led to increased public scrutiny of the reliability of research.\cite{rogers_2017} -Three major components make up this scrutiny: \textbf{reproducibility}.\cite{duvendack2017meant}, \textbf{transparency},\cite{christensen2018transparency} and \textbf{credibility},\cite{ioannidis2017power}. -Reproducibility is one key component of transparency. -Transparency is necessary for consumers of research products -to be able to determine the quality of the research process and the value of the evidence. -Without it, all evidence of credibility comes from reputation, -and it's unclear what that reputation is based on, since it's not transparent. - -Development researchers should take these concerns particularly seriously. +Three major components make up this scrutiny: \textbf{reproducibility}\cite{duvendack2017meant}, \textbf{transparency},\cite{christensen2018transparency} and \textbf{credibility}.\cite{ioannidis2017power} +Development researchers should take these concerns seriously. Many development research projects are purpose-built to address specific questions, and often use unique data or small samples. This approach opens the door to working closely with the broader development community to answer specific programmatic questions and general research inquiries. However, almost by definition, primary data that researchers use for such studies has never been reviewed by anyone else, -so it is hard for others to verify that it was collected, handled, and analyzed appropriately. +so it is hard for others to verify that it was collected, handled, and analyzed appropriately.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} Maintaining confidence in research via the components of credibility, transparency, and reproducibility is the most important way that researchers using primary data can avoid serious error, and therefore these are not by-products but core components of research output. @@ -66,7 +75,7 @@ \section{Protecting confidence in development research} \subsection{Research reproducibility} Reproducible research means that the actual analytical processes you used are executable by others.\cite{dafoe2014science} -(We use ``reproducibility'' to refer to the code processes in a specific study.\sidenote{ +(We use ``reproducibility'' to refer to the precise analytical code in a specific study.\sidenote{ \url{http://datacolada.org/76}}) All your code files involving data cleaning, construction and analysis should be public, unless they contain identifying information. @@ -78,7 +87,8 @@ \subsection{Research reproducibility} if any or all of these things were to be done slightly differently.\cite{simmons2011false,simonsohn2015specification,wicherts2016degrees} Letting people play around with your data and code is a great way to have new questions asked and answered -based on the valuable work you have already done. +based on the valuable work you have already done.\sidenote{ + \url{https://blogs.worldbank.org/opendata/making-analytics-reusable}} Services like GitHub that log your research process are valuable resources here. \index{GitHub} Such services can show things like modifications made in response to referee comments, @@ -122,11 +132,11 @@ \subsection{Research transparency} Transparent research will expose not only the code, but all the other research processes involved in developing the analytical approach.\sidenote{ \url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} -This means that readers be able to judge for themselves if the research was done well, -and if the decision-making process was sound. +This means that readers be able to judge for themselves if the research was done well +and the decision-making process was sound. If the research is well-structured, and all of the relevant documentation\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} is shared, -this makes it as easy as possible for the reader to understand the analysis later. + \url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} +is shared, this makes it easy for the reader to understand the analysis later. Expecting process transparency is also an incentive for researchers to make better decisions, be skeptical and thorough about their assumptions, and, as we hope to convince you, make the process easier for themselves, @@ -154,7 +164,8 @@ \subsection{Research transparency} Many disciplines have a tradition of keeping a ``lab notebook'', and adapting and expanding this process for the development of lab-style working groups in development is a critical step. -This means explicitly noting decisions as they are made, and explaining the process behind them. +This means explicitly noting decisions as they are made, +and explaining the process behind the decision-making. Documentation on data processing and additional hypotheses tested will be expected in the supplemental materials to any publication. Careful documentation will also save the research team a lot of time during a project, @@ -170,16 +181,16 @@ \subsection{Research transparency} (Email is \textit{not} a note-taking service, because communications are rarely well-ordered and easy to delete.) There are various software solutions for building documentation over time. -The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution, +The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution,\index{Open Science Framework} with integrated file storage, version histories, and collaborative wiki pages. \textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system,\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}} \index{task management}\index{GitHub} in addition to version histories and wiki pages. -Such services offers multiple different ways to record the decision process leading to changes and additions, +Such services offers multiple different ways +to record the decision process leading to changes and additions, track and register discussions, and manage tasks. -These are flexibles tool that can be adapted to different team and project dynamics, -but GitHub, unfortunately is less effective for file storage. +These are flexible tools that can be adapted to different team and project dynamics. Each project has specific requirements for data, code, and documentation management, and the exact shape of this process can be molded to the team's needs, but it should be agreed upon prior to project launch. @@ -219,11 +230,10 @@ \subsection{Research credibility} for which practices to adopt, such as reporting on whether and how various practices were implemented. -With this ongoing rise of empirical research and increased public scrutiny of scientific evidence, -this is no longer enough to guarantee that findings will hold their credibility. Even if your methods are highly precise, -your evidence is just as good as your data, -and there are plenty of mistakes that can be made between establishing a design and generating final results that would compromise its conclusions. +your evidence is only as good as your data -- +and there are plenty of mistakes that can be made between +establishing a design and generating final results that would compromise its conclusions. That is why transparency is key for research credibility. It allows other researchers, and research consumers, to verify the steps to a conclusion by themselves, @@ -242,19 +252,19 @@ \subsection{Research credibility} \section{Ensuring privacy and security in research data} -Anytime you are collecting primary data in a development research project,\index{primary data} -you are almost certainly handling data that include \textbf{personally-identifying information (PII)}.\sidenote{\textbf{Personally-identifying information:} -any piece or set of information that can be used to identify an individual research subject. +Anytime you are collecting primary data in a development research project, +you are almost certainly handling data that include \textbf{personally-identifying information (PII)}.\sidenote{ + \textbf{Personally-identifying information:} any piece or set of information that can be used to identify an individual research subject. \url{https://dimewiki.worldbank.org/wiki/De-identification\#Personally\_Identifiable\_Information}} - \index{personally-identifying information} + \index{personally-identifying information}\index{primary data} PII data contains information that can, without any transformation, be used to identify individual people, households, villages, or firms that were included in \textbf{data collection}. \index{data collection} This includes names, addresses, and geolocations, and extends to personal information - \index{geodata} such as email addresses, phone numbers, and financial information. - \index{de-identification} + \index{geodata}\index{de-identification} It is important to keep in mind these data privacy principles not only for the individual respondent but also the PII data of their household members or other caregivers who are covered under the survey. + \index{privacy} In some contexts this list may be more extensive -- for example, if you are working in an environment that is either small, specific, or has extensive linkable data sources available to others, @@ -273,14 +283,16 @@ \section{Ensuring privacy and security in research data} If you interact with European institutions or persons, you will also become familiar with ``GDPR'',\sidenote{ \url{http://blogs.lshtm.ac.uk/library/2018/01/15/gdpr-for-research-data/}} -a set of regulations governing data ownership and privacy standards. +a set of regulations governing \textbf{data ownership} and privacy standards.\sidenote{ + \textbf{Data ownership:} the set of rights governing who may accesss, alter, use, or share data, regardless of who possesses it.} + \index{data ownership} In all settings, you should have a clear understanding of who owns your data (it may not be you, even if you collect or possess it), the rights of the people whose information is reflected there, and the necessary level of caution and risk involved in storing and transferring this information. Due to the increasing scrutiny on many organizations -from recently advanced standards and rights, +from recently advanced data rights and regulations, these considerations are critically important. Check with your organization if you have any legal questions; in general, you are responsible to avoid taking any action that @@ -289,23 +301,29 @@ \section{Ensuring privacy and security in research data} \subsection{Obtaining ethical approval and consent} For almost all data collection or research activities that involves PII data, -you will be required to complete some form of Institutional Review Board (IRB) process. +you will be required to complete some form of \textbf{Institutional Review Board (IRB)} process.\sidenote{ + \textbf{Institutional Review Board (IRB):} An institution formally responsible for ensuring that research meets ethical standards.} + \index{Institutional Review Board} Most commonly this consists of a formal application for approval of a specific protocol for consent, data collection, and data handling. -The IRB which has authority over your project is not always apparent, -particularly if your institution does not have its own. -It is customary to obtain an approval from the university IRB where one PI is affiliated, -and if work is being done in an international setting approval is often also required -from a local institution subject to local law. - -The primary consideration of IRBs is the protection of the people whose data is being collected. -Many jurisdictions (especially those responsible to EU law) view all personal data +An IRB which has sole authority over your project is not always apparent, +particularly if some institutions do not have their own. +It is customary to obtain an approval from a university IRB +where at least one PI is affiliated, +and if work is being done in an international setting, +approval is often also required +from an appropriate institution subject to local law. + +One primary consideration of IRBs +is the protection of the people about whom information is being collected +and whose lives may be affected by the research design. +Some jurisdictions (especially those responsible to EU law) view all personal data as being intrinsically owned by the persons who they describe. This means that those persons have the right to refuse to participate in data collection before it happens, as it is happening, or after it has already happened. It also means that they must explicitly and affirmatively consent -to the collection, storage, and use of their information for all purposes. -Therefore, the development of these consent processes is of primary importance. +to the collection, storage, and use of their information for any purpose. +Therefore, the development of appropriate consent processes is of primary importance. Ensuring that research participants are aware that their information will be stored and may be used for various research purposes is critical. There are special additional protections in place for vulnerable populations, @@ -315,24 +333,47 @@ \subsection{Obtaining ethical approval and consent} Make sure you have significant advance timing with your IRB submissions. You may not begin data collection until approval is in place, and IRBs may have infrequent meeting schedules -or require several rounds of review for an application to be completed. +or require several rounds of review for an application to be approved. If there are any deviations from an approved plan or expected adjustments, report these as early as you can so that you can update or revise the protocol. Particularly at universities, IRBs have the power to retroactively deny the right to use data which was not collected in accordance with an approved plan. This is extremely rare, but shows the seriousness of these considerations -since the institution itself may face governmental penalties if its IRB +since the institution itself may face legal penalties if its IRB is unable to enforce them. As always, as long as you work in good faith, -you should not have any issues complying with these expectations. +you should not have any issues complying with these regulations. \subsection{Transmitting and storing data securely} -Raw data which contains PII \textit{must} be \textbf{encrypted}\sidenote{ +Secure data storage and transfer are ultimately your personal responsibility.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Data_Security}} +First, all online and offline accounts +-- including personal accounts like computer logins and email -- +need to be protected by strong and unique passwords. +There are several services that create and store these passwords for you, +and some provide utilities for sharing passwords with others +inside that secure environment if multiple users share accounts. +However, password-protection alone is not sufficient, +because if the underlying data is obtained through a leak the information itself remains usable. +Raw data which contains PII \textit{must} therefoer be \textbf{encrypted}\sidenote{ + \textbf{Encryption:} Data storage methods which ensure that accessed files are unreadable if unauthorized access is obtained. \url{https://dimewiki.worldbank.org/wiki/encryption}} - \index{encryption} during data collection, storage, and transfer. - \index{data transfer}\index{data storage} -This means that, even if the information were to be intercepted or made public, + \index{encryption}\index{data transfer}\index{data storage} +The biggest security gap is often in transmitting survey plans to and from staff in the field, +since staff with technical specialization are usually in an HQ office. +To protect information in transit to field staff, some key steps are: +(a) to ensure that all devices have hard drive encryption and password-protection; +(b) that no PII information is sent over e-mail (use a secure sync drive instead); +and (c) all field staff receive adequate training on the privacy standards applicable to their work. + +Most modern data collection software has features that, +if enabled, make secure transmission straightforward.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} +Many also have features that ensure data is encrypted when stored on their servers, +although this usually needs to be actively enabled and administered. +Proper encryption means that, +even if the information were to be intercepted or made public, the files that would be obtained would be useless to the recipient. In security parlance this person is often referred to as an ``intruder'' but it is rare that data breaches are nefarious or even intentional. @@ -346,24 +387,6 @@ \subsection{Transmitting and storing data securely} or planning or submission of survey materials, you must actively protect those materials in transmission and storage. -First, all accounts need to be protected by strong, unique passwords. -There are many services that create and store these passwords for you, -and some even provide utilities for sharing passwords with teams -inside that secure environment. (There are very few other secure ways to do this.) -Most modern data collection software has additional features that, if enabled, make secure transmission straightforward.\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} -Many also have features that ensure your data is encrypted when stored on their servers, -although this usually needs to be actively administered. -Note that password-protection alone is not sufficient to count as encryption, -because if the underlying data is obtained through a leak the information itself is usable. -The biggest security gap is often in transmitting survey plans to field teams, -since they usually do not have a highly trained analyst on site. -To protect this information, some key steps are -(a) to ensure that all devices have hard drive encryption and password-protection; -(b) that no information is sent over e-mail (use a secure sync drive instead); -and (c) all field staff receive adequate training on the privacy standards applicable to their work. - -Secure storage and transfer are ultimately your personal responsibility.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Data_Security}} There are plenty of options available to keep your data safe, at different prices, from enterprise-grade solutions to free software. It may be sufficient to hold identifying information in an encrypted service, @@ -398,8 +421,9 @@ \subsection{De-identifying and anonymizing information} or the CITI Program.\sidenote{ \url{https://about.citiprogram.org/en/series/human-subjects-research-hsr/}} -In general, though, you shouldn't need to handle PII data very often. -and you can take simple steps to minimize risk by minimizing the handling of PII. +In general, though, you shouldn't need to handle PII data very often +once the data collection processes are completed. +You can take simple steps to avoid risks by minimizing the handling of PII. First, only collect information that is strictly needed for the research. Second, avoid the proliferation of copies of identified data. There should only be one raw identified dataset copy @@ -409,14 +433,19 @@ \subsection{De-identifying and anonymizing information} and can be avoided by properly linking identifiers to research information such as treatment statuses and weights, then removing identifiers. -Therefore, once data is securely collected and stored, the first thing you will generally do is \textbf{de-identify} it.\sidenote{ +Therefore, once data is securely collected and stored, +the first thing you will generally do is \textbf{de-identify} it.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/De-identification}} \index{de-identification} (We will provide more detail on this in the chapter on data collection.) -This will create a working de-identified copy that can safely be shared among collaborators. -De-identified data should avoid, for example, you being sent back to every household -to alert them that someone dropped all their personal information on a public bus and we don't know who has it. -This simply means creating a copy of the data that contains no personally-identifiable information. +This will create a working de-identified copy +that can safely be shared among collaborators. +De-identified data should avoid, for example, +you being sent back to every household +to alert them that someone dropped all their personal information +on a public bus and we don't know who has it. +This simply means creating a copy of the data +that contains no personally-identifiable information. This data should be an exact copy of the raw data, except it would be okay if it were for some reason publicly released.\cite{matthews2011data} @@ -429,6 +458,7 @@ \subsection{De-identifying and anonymizing information} These include \texttt{PII\_detection}\sidenote{\url{https://github.com/PovertyAction/PII\_detection}} from IPA, \texttt{PII-scan}\sidenote{\url{https://github.com/J-PAL/PII-Scan}} from JPAL, and \texttt{sdcMicro}\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} from the World Bank. + \index{anonymization} The \texttt{sdcMicro} tool, in particular, has a feature that allows you to assess the uniqueness of your data observations, and simple measures of the identifiability of records from that. From 77b1eecac9abc5234aecba7749e0bf1e53daa0c0 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 6 Jan 2020 10:59:10 -0500 Subject: [PATCH 158/854] minor review changes --- chapters/handling-data.tex | 37 ++++++++++++++++++------------------- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 0781d4e86..a1d517469 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -124,7 +124,7 @@ \subsection{Research transparency} \url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} This means that readers be able to judge for themselves if the research was done well, and if the decision-making process was sound. -If the research is well-structured, and all of the relevant documentation\sidenote{ +If the research is well-structured, and the relevant documentation\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} is shared, this makes it as easy as possible for the reader to understand the analysis later. Expecting process transparency is also an incentive for researchers to make better decisions, @@ -167,7 +167,7 @@ \subsection{Research transparency} not a one-time requirement or retrospective task. New decisions are always being made as the plan begins contact with reality, and there is nothing wrong with sensible adaptation so long as it is recorded and disclosed. -(Email is \textit{not} a note-taking service, because communications are rarely well-ordered and easy to delete.) +(Email is \textit{not} a note-taking service, because communications are rarely well-ordered and can be easily deleted.) There are various software solutions for building documentation over time. The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution, @@ -176,10 +176,9 @@ \subsection{Research transparency} \url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}} \index{task management}\index{GitHub} in addition to version histories and wiki pages. -Such services offers multiple different ways to record the decision process leading to changes and additions, +Such services offer multiple different ways to record the decision process leading to changes and additions, track and register discussions, and manage tasks. -These are flexibles tool that can be adapted to different team and project dynamics, -but GitHub, unfortunately is less effective for file storage. +These are flexible tools that can be adapted to different team and project dynamics (GitHub, unfortunately, is less effective for file storage). Each project has specific requirements for data, code, and documentation management, and the exact shape of this process can be molded to the team's needs, but it should be agreed upon prior to project launch. @@ -209,17 +208,17 @@ \subsection{Research credibility} \index{pre-registration} Garden varieties of research standards from journals, funders, and others feature both ex ante -(or ”regulation”) and ex post (or “verification”) policies. -Ex ante policies requires that the authors bear the burden +(or ``regulation'') and ex post (or ``verification'') policies. +Ex ante policies require that authors bear the burden of ensuring they provide some set of materials before publication and their quality meet some minimum standard. Ex post policies require that authors make certain materials available to the public, but their quality is not a direct condition for publication. -Still others have suggested “guidance” policies that would offer checklists +Still, others have suggested ``guidance'' policies that would offer checklists for which practices to adopt, such as reporting on whether and how various practices were implemented. -With this ongoing rise of empirical research and increased public scrutiny of scientific evidence, +With the ongoing rise of empirical research and increased public scrutiny of scientific evidence, this is no longer enough to guarantee that findings will hold their credibility. Even if your methods are highly precise, your evidence is just as good as your data, @@ -233,7 +232,7 @@ \subsection{Research credibility} and finding the tools and workflows that best match your project and team. Every investment you make in documentation and transparency up front protects your project down the line, particularly as these standards continue to tighten. -Since projects span over many years, +Since projects tend to span over many years, the records you will need to have available for publication are only bound to increase by the time you do so. @@ -250,17 +249,17 @@ \section{Ensuring privacy and security in research data} PII data contains information that can, without any transformation, be used to identify individual people, households, villages, or firms that were included in \textbf{data collection}. \index{data collection} -This includes names, addresses, and geolocations, and extends to personal information +It includes names, addresses, and geolocations, and extends to personal information \index{geodata} such as email addresses, phone numbers, and financial information. \index{de-identification} -It is important to keep in mind these data privacy principles not only for the individual respondent but also the PII data of their household members or other caregivers who are covered under the survey. +It is important to keep data privacy principles in mind not only for the individual respondent but also the PII data of their household members or other caregivers who are covered under the survey. In some contexts this list may be more extensive -- for example, if you are working in an environment that is either small, specific, or has extensive linkable data sources available to others, information like someone's age and gender may be sufficient to identify them even though these would not be considered PII in a larger context. -Therefore you will have to use careful judgment in each case +There is no one-size-fits-all solution to determine what is PII, and you will have to use careful judgment in each case to decide which pieces of information fall into this category.\sidenote{ \url{https://sdcpractice.readthedocs.io/en/latest/}} @@ -290,13 +289,13 @@ \subsection{Obtaining ethical approval and consent} For almost all data collection or research activities that involves PII data, you will be required to complete some form of Institutional Review Board (IRB) process. -Most commonly this consists of a formal application for approval of a specific +It most commonly consists of a formal application for approval of a specific protocol for consent, data collection, and data handling. The IRB which has authority over your project is not always apparent, particularly if your institution does not have its own. It is customary to obtain an approval from the university IRB where one PI is affiliated, and if work is being done in an international setting approval is often also required -from a local institution subject to local law. +from an institution subject to local law. The primary consideration of IRBs is the protection of the people whose data is being collected. Many jurisdictions (especially those responsible to EU law) view all personal data @@ -398,7 +397,7 @@ \subsection{De-identifying and anonymizing information} or the CITI Program.\sidenote{ \url{https://about.citiprogram.org/en/series/human-subjects-research-hsr/}} -In general, though, you shouldn't need to handle PII data very often. +In general, though, you shouldn't need to handle PII data very often, and you can take simple steps to minimize risk by minimizing the handling of PII. First, only collect information that is strictly needed for the research. Second, avoid the proliferation of copies of identified data. @@ -420,7 +419,7 @@ \subsection{De-identifying and anonymizing information} This data should be an exact copy of the raw data, except it would be okay if it were for some reason publicly released.\cite{matthews2011data} -Note, however, that you can never \textbf{anonymize} data. +Note, however, that it is in practice impossible to \textbf{anonymize} data. There is always some statistical chance that an individual's identity will be re-linked to the data collected about them by using some other set of data that are collectively unique. @@ -434,11 +433,11 @@ \subsection{De-identifying and anonymizing information} and simple measures of the identifiability of records from that. Additional options to protect privacy in data that will become public exist, and you should expect and intend to release your datasets at some point. -One option is to add noise to data, as the US Census has proposed, +One option is to add noise to data, as the US Census Bureau has proposed, as it makes the trade-off between data accuracy and privacy explicit. But there are no established norms for such ``differential privacy'' approaches: most approaches fundamentally rely on judging ``how harmful'' information disclosure would be. -The fact remains that there is always a balance between information release +The fact remains that there is always a balance between information release (and therefore transparency) and privacy protection, and that you should engage with it actively and explicitly. The best thing you can do is make a complete record of the steps that have been taken so that the process can be reviewed, revised, and updated as necessary. From f5778431ed7a61e7df125e6be4d892700278f5c0 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 6 Jan 2020 16:40:53 -0500 Subject: [PATCH 159/854] [Ch 2] minor changes --- chapters/planning-data-work.tex | 62 ++++++++++++++++++--------------- 1 file changed, 33 insertions(+), 29 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 813649f91..08cb5b075 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -3,7 +3,7 @@ \begin{fullwidth} Preparation for collaborative data work begins long before you collect any data, and involves planning both the software tools you will use yourself -as well as the collaboration platforms and processes for your team. +and the collaboration platforms and processes for your team. In order to be prepared to work on the data you receive with a group, you need to plan out the structure of your workflow in advance. This means knowing which data sets and output you need at the end of the process, @@ -103,18 +103,18 @@ \subsection{Setting up your computer} Find your \textbf{home folder}. It is never your desktop. On MacOS, this will be a folder with your username. On Windows, this will be something like ``This PC''. -Ensure you know how to get the \textbf{absolute filepath} for any given file. -Using the absolute filepath, starting from the filesystem root, +Ensure you know how to get the \textbf{absolute file path} for any given file. +Using the absolute file path, starting from the filesystem root, means that the computer will never accidentally load the wrong file. - \index{filepaths} + \index{file paths} On MacOS this will be something like \path{/users/username/dropbox/project/...}, and on Windows, \path{C:/users/username/Github/project/...}. -We will write filepaths such as \path{/Dropbox/project-titleDataWorkEncryptedData/}, +We will write file paths such as \path{/Dropbox/project-title/DataWork/EncryptedData/}, assuming the ``Dropbox'' folder lives inside your home folder. -Filepaths will use forward slashes (\texttt{/}) to indicate folders, +File paths will use forward slashes (\texttt{/}) to indicate folders, and typically use only A-Z (the 26 English characters), dashes (\texttt{-}), and underscores (\texttt{\_}) in folder names and filenames. -You should \textit{always} use forward slashes (\texttt{/}) in filepaths in code, +You should \textit{always} use forward slashes (\texttt{/}) in file paths in code, just like an internet address, and no matter how your computer provides them, because the other type will cause your code to break on many systems. Making the structure of your directories a core part of your workflow is very important, @@ -125,7 +125,7 @@ \subsection{Setting up your computer} some kind of \textbf{file sharing} software. \index{file sharing} The exact services you use will depend on your tasks, -but in general, there are three file sharing paradigms that are the most common. +but in general, there are different approaches to file sharing, and the three discussed here are the most common. \textbf{File syncing} is the most familiar method, and is implemented by software like Dropbox and OneDrive. \index{file syncing} @@ -147,7 +147,7 @@ \subsection{Setting up your computer} high-powered computing processes for large and complex data. All three file sharing methods are used for collaborative workflows, and you should review the types of data work -that you are going to be doing, and plan which types of files +that you will be doing, and plan which types of files will live in which types of sharing services. It is important to note that they are, in general, not interoperable: you cannot have version-controlled files inside a syncing service, @@ -164,8 +164,8 @@ \subsection{Documenting decisions and tasks} is using instant communication for management and documentation. Email is, simply put, not a system. It is not a system for anything. Neither is WhatsApp. \index{email} \index{WhatsApp} -These tools are developed for communicating ``now'' and this is what they does well. -These tools are not structured to manage group membership or to present the same information +These tools are developed for communicating ``now'' and this is what they do well. +They are not structured to manage group membership or to present the same information across a group of people, or to remind you when old information becomes relevant. They are not structured to allow people to collaborate over a long time or to review old discussions. It is therefore easy to miss or lose communications from the past when they have relevance in the present. @@ -186,7 +186,7 @@ \subsection{Documenting decisions and tasks} These systems therefore link communication to specific tasks so that the records related to decision making on those tasks is permanently recorded and easy to find in the future when questions about that task come up. -One popular and free implementation of this system is the one found in GitHub project boards. +One popular and free implementation of this system is found in GitHub project boards. Other systems which offer similar features (but are not explicitly Kanban-based) are GitHub Issues and Dropbox Paper, which has a more chronological structure. What is important is that your team chooses its system and sticks to it, @@ -353,7 +353,7 @@ \subsection{Organizing files and folder structures} can easily move between projects without having to reorient themselves to how files and folders are organized. -The DIME file structure is not for everyone. +Our suggested file structure is not for everyone. But if you do not already have a standard file structure across projects, it is intended to be an easy template to start from. This system operates by creating a \texttt{DataWork} folder at the project level, @@ -374,10 +374,10 @@ \subsection{Organizing files and folder structures} The \texttt{DataWork} folder may be created either inside an existing project-based folder structure, or it may be created separately. -It should always be created by the leading RA by agreement with the PI. +It's usually created by the leading RA in agreement with the PI. Increasingly, our recommendation is to create the \texttt{DataWork} folder separately from the project management materials, -reserving the ``project folder'' for data collection and other management work. +reserving the ``project folder'' for contracts, Terms of Reference, briefs and other administrative or management work. \index{project folder} This is so the project folder can be maintained in a synced location like Dropbox, while the code folder can be maintained in a version-controlled location like GitHub. @@ -394,7 +394,7 @@ \subsection{Organizing files and folder structures} \index{{\LaTeX}}\index{dynamic documents} Keeping such plaintext files in a version-controlled folder allows you to maintain better control of their history and functionality. -Because of the high degree with which code files depend on file structure, +Because of the high degree of dependence between code files depend and file structure, you will be able to enforce better practices in a separate folder than in the project folder, which will usually be managed by a PI, FC, or field team members. @@ -408,10 +408,12 @@ \subsection{Organizing files and folder structures} or to understand why the significance level of your estimates has changed. Everyone who has ever encountered a file named something like \texttt{final\_report\_v5\_LJK\_KLE\_jun15.docx} can appreciate how useful such a system can be. + + Most syncing services offer some kind of rudimentary version control; -These are usually enough to manage changes to binary files (such as Word and PowerPoint documents) +these are usually enough to manage changes to binary files (such as Word and PowerPoint documents) without needing to rely on dreaded filename-based versioning conventions. -For technical files, however, a more detailed version control system is usually desirable. +For code files, however, a more detailed version control system is usually desirable. We recommend using Git\sidenote{ \textbf{Git:} a multi-user version control system for collaborating on and tracking changes to code as it is written.} for all plaintext files. @@ -464,7 +466,7 @@ \subsection{Documenting and organizing code} Below we discuss a few crucial steps to code organization. They all come from the principle that code is an output by itself, not just a means to an end, -and code should be written thinking of how easy it will be for someone to read it later. +and should be written thinking of how easy it will be for someone to read it later. Code documentation is one of the main factors that contribute to readability. Start by adding a code header to every file. @@ -478,17 +480,18 @@ \subsection{Documenting and organizing code} (GitHub offers a lot of different documentation options, for example), the information that is relevant to understand the code should always be written in the code file. -Mixed among the code itself, are two types of comments that should be included. +In the script, alongside the code, are two types of comments that should be included. The first type of comment describes what is being done. This might be easy to understand from the code itself if you know the language well enough and the code is clear, but often it is still a great deal of work to reverse-engineer the code's intent. Writing the task in plain English (or whichever language you communicate with your team on) -will make it easier for everyone to read and understand the code's purpose. +will make it easier for everyone to read and understand the code's purpose +-- and also for you to think about your code as you write it. The second type of comment explains why the code is performing a task in a particular way. As you are writing code, you are making a series of decisions that (hopefully) make perfect sense to you at the time. -These are often highly specialized and may exploit functionality +These are often highly specialized and may exploit a functionality that is not obvious or has not been seen by others before. Even you will probably not remember the exact choices that were made in a couple of weeks. Therefore, you must document your precise processes in your code. @@ -502,7 +505,7 @@ \subsection{Documenting and organizing code} So, for example, if you want to find the line in your code where a variable was created, you can go straight to \texttt{PART 2: Create new variables}, instead of reading line by line through the entire code. -RStudio, for example makes it very easy to create sections, +RStudio, for example, makes it very easy to create sections, and it compiles them into an interactive script index for you. In Stata, you can use comments to create section headers, though they're just there to make the reading easier and don't have functionality. @@ -537,7 +540,7 @@ \subsection{Documenting and organizing code} The master script should mimic the structure of the \texttt{DataWork} folder. This is done through the creation of globals (in Stata) or string scalars (in R). These coding shortcuts can refer to subfolders, -so that those folders can be referenced without repeatedly writing out their absolute filepaths. +so that those folders can be referenced without repeatedly writing out their absolute file paths. Because the \texttt{DataWork} folder is shared by the whole team, its structure is the same in each team member's computer. The only difference between machines should be @@ -556,12 +559,13 @@ \subsection{Documenting and organizing code} Reading it again to organize and comment it as you prepare it to be reviewed will help you identify them. Try to have a code review scheduled frequently, every time you finish writing a piece of code, or complete a small task. -If you wait for a long time to have your code review, and it gets too complex, +If you wait for a long time to have your code reviewed, and it gets too complex, preparation and code review will require more time and work, and that is usually the reason why this step is skipped. -Making sure that the code is running properly on other machines, +One other important advantage of code review if that +making sure that the code is running properly on other machines, and that other people can read and understand the code easily, -is also the easiest way to be prepared in advance for a smooth project handover. +is the easiest way to be prepared in advance for a smooth project handover. % ---------------------------------------------------------------------------------------------- \subsection{Output management} @@ -598,7 +602,7 @@ \subsection{Output management} It is common for teams to maintain one analyisis file or folder with ``exploratory analysis'', which are pieces of code that are stored only to be found again in the future, but not cleaned up to be included in any outputs yet. -Once you are happy with a partiular result or output, +Once you are happy with a result or output, it should be named and moved to a dedicated location. It's typically desirable to have the names of outputs and scripts linked, so, for example, \texttt{factor-analysis.do} creates \texttt{factor-analysis-f1.eps} and so on. @@ -622,7 +626,7 @@ \subsection{Output management} that import inputs every time they are compiled. This means you can skip the copying and pasting whenever an output is updated. Because it's written in plaintext, it's also easier to control and document changes using Git. -Creating documents in {\LaTeX} using an integrated writing environment such as TeXstudio +Creating documents in {\LaTeX} using an integrated writing environment such as TeXstudio, TeXmaker or LyX is great for outputs that focus mainly on text, but include small chunks of code and static code outputs. This book, for example, was written in {\LaTeX} and managed on GitHub. From 7c4bd0f8586df55a0ae77ddf812b505c99571ab2 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 6 Jan 2020 18:42:39 -0500 Subject: [PATCH 160/854] [Appendix] Minor changes --- appendix/stata-guide.tex | 153 ++++++++++++++++++++++----------------- 1 file changed, 85 insertions(+), 68 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 7db60c805..941471fa8 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -9,27 +9,28 @@ their first years after graduating. Recent Masters' program graduates that have joined our team tended to have very good knowledge in the theory of our -trade, but tended to require a lot of training in its practical skills. -To us, it is like hiring architects that can sketch, describe, and discuss +trade, but also to require a lot of training in its practical skills. +To us, this is like graduating in architecture having learned +how to sketch, describe, and discuss the concepts and requirements of a new building very well, -but who do not have the technical skillset -to actually contribute to a blueprint using professional standards +but without having the technical skill-set +to actually contribute to a blueprint following professional standards that can be used and understood by other professionals during construction. The reasons for this are probably a topic for another book, but in today's data-driven world, people working in quantitative economics research must be proficient programmers, and that includes more than being able to compute the correct numbers. -This appendix begins with a short section with instructions +This appendix begins with a short section containing instructions on how to access and use the code examples shared in this book. -The second section contains a the current DIME Analytics style guide for Stata code. -No matter your technical proficiency in writing Stata code, -we believe these resources can help any person write more understandable code. +The second section contains the DIME Analytics style guide for Stata code. +we believe these resources can help anyone write more understandable code, +no matter how proficient they are in writing Stata code. Widely accepted and used style guides are common in most programming languages, and we think that using such a style guide greatly improves the quality of research projects coded in Stata. -We hope that this guide can help to increase the emphasis in the Stata community -on using, improving, sharing and standardizing Stata code style. +We hope that this guide can help to increase the emphasis +given to using, improving, sharing and standardizing code style among the Stata community. Style guides are the most important tool in how you, like an architect, can draw a blueprint that can be understood and used by everyone in your trade. @@ -59,32 +60,41 @@ \section{Using the code examples in this book} \subsection{Understanding Stata code} -Regardless if you are new to Stata or have used it for decades, you will always run into commands that -you have not seen before or do not remember what they do. Every time that happens, you should always look -that command up in the helpfile. For some reason, we often encounter the conception that the helpfiles -are only for beginners. We could not disagree with that conception more, as the only way to get better at Stata -is to constantly read helpfiles. So if there is a command that you do not understand in any of our code -examples, for example \texttt{isid}, then write \texttt{help isid}, and the helpfile for the command \texttt{isid} will open. - -We cannot emphasize enough how important we think it is that you get into the habit of reading helpfiles. - -Sometimes, you will encounter code employing user-written commands, -and you will not be able to read those helpfiles until you have installed the commands. -Two examples of these in our code are \texttt{reandtreat} or \texttt{ieboilstart}. -The most common place to distribute user-written commands for Stata is the Boston College Statistical Software Components -(SSC) archive. In our code examples, we only use either Stata's built-in commands or commands available from the -SSC archive. So, if your installation of Stata does not recognize a command in our code, for example +Regardless of being new to Stata or having used it for decades, you will always run into commands that +you have not seen before or whose purpose you do not remember. +Every time that happens, you should always look that command up in the help file. +For some reason, we often encounter the conception that help files are only for beginners. +We could not disagree with that conception more, +as the only way to get better at Stata is to constantly read help files. +So if there is a command that you do not understand in any of our code examples, +for example \texttt{isid}, then write \texttt{help isid}, +and the help file for the command \texttt{isid} will open. + +We cannot emphasize enough how important we think it is that you get into the habit of reading help files. + +Sometimes, you will encounter code that employs user-written commands, +and you will not be able to read their help files until you have installed the commands. +Two examples of these in our code are \texttt{randtreat} or \texttt{ieboilstart}. +The most common place to distribute user-written commands for Stata +is the Boston College Statistical Software Components (SSC) archive. +In our code examples, we only use either Stata's built-in commands or commands available from the +SSC archive. +So, if your installation of Stata does not recognize a command in our code, for example \texttt{randtreat}, then type \texttt{ssc install randtreat} in Stata. -Some commands on SSC are distributed in packages, for example \texttt{ieboilstart}, meaning that you will -not be able to install it using \texttt{ssc install ieboilstart}. If you do, Stata will suggest that you -instead use \texttt{findit ieboilstart} which will search SSC (among other places) and see if there is a -package that has a command called \texttt{ieboilstart}. Stata will find \texttt{ieboilstart} in the package -\texttt{ietoolkit}, so then you will type \texttt{ssc install ietoolkit} instead in Stata. - -We understand that this can be confusing the first time you work with this, but this is the best way to set -up your Stata installation to benefit from other people's work that they have made publicly available, and -once used to installing commands like this it will not be confusing at all. +Some commands on SSC are distributed in packages. +This is the case, for example, of \texttt{ieboilstart}. +That means that you will not be able to install it using \texttt{ssc install ieboilstart}. +If you do, Stata will suggest that you instead use \texttt{findit ieboilstart}, +which will search SSC (among other places) and see if there is a +package that contains a command called \texttt{ieboilstart}. +Stata will find \texttt{ieboilstart} in the package \texttt{ietoolkit}, +so to use this command you will type \texttt{ssc install ietoolkit} in Stata instead. + +We understand that it can be confusing to work with packages for first time, +but this is the best way to set up your Stata installation to benefit from other +people's work that has been made publicly available, +and once you get used to installing commands like this it will not be confusing at all. All code with user-written commands, furthermore, is best written when it installs such commands at the beginning of the master do-file, so that the user does not have to search for packages manually. @@ -96,19 +106,23 @@ \subsection{Why we use a Stata style guide} non-official style guides like the JavaScript Standard Style\sidenote{\url{https://standardjs.com/\#the-rules}} for JavaScript or Hadley Wickham's\sidenote{\url{http://adv-r.had.co.nz/Style.html}} style guide for R. -Aesthetics is an important part of style guides, but not the main point. The existence of style guides -improves the quality of the code in that language produced by all programmers in the community. +Aesthetics is an important part of style guides, but not the main point. +The existence of style guides improves the quality of the code in that language that is produced by all programmers in the community. It is through a style guide that unexperienced programmers can learn from more experienced programmers -how certain coding practices are more or less error-prone. Broadly-accepted style guides make it easier to -borrow solutions from each other and from examples online without causing bugs that might only be found too -late. Similarly, globally standardized style guides make it easier to solve each others' +how certain coding practices are more or less error-prone. +Broadly-accepted style guides make it easier to borrow solutions from each other and from examples online without causing bugs that might only be found too late. +Similarly, globally standardized style guides make it easier to solve each others' problems and to collaborate or move from project to project, and from team to team. -There is room for personal preference in style guides, but style guides are first and foremost -about quality and standardization -- especially when collaborating on code. We believe that a commonly used Stata style guide -would improve the quality of all code written in Stata, which is why we have begun the one included here. You do not necessarily need to follow our -style guide precisely. We encourage you to write your own style guide if you disagree with us. The best style guide -woud be the one adopted the most widely. What is most important is that you adopt a style guide and follow it consistently across your projects. +There is room for personal preference in style guides, +but style guides are first and foremost about quality and standardization -- +especially when collaborating on code. +We believe that a commonly used Stata style guide would improve the quality of all code written in Stata, +which is why we have begun the one included here. +You do not necessarily need to follow our style guide precisely. +We encourage you to write your own style guide if you disagree with us. +The best style guide would be the one adopted the most widely. +What is important is that you adopt a style guide and follow it consistently across your projects. \newpage @@ -158,15 +172,15 @@ \subsection{Commenting code} \subsection{Abbreviating commands} -Stata commands can often be abbreviated in the code. In the helpfiles you can tell if a command can be -abbreviated, indicated by the part of the name that is underlined in the syntax section at the top. -Only built-in commands can be abbreviated; user-written commands can not. +Stata commands can often be abbreviated in the code. +You can tell if a command can be abbreviated if the help file indicates an abbreviation by underlining part of the name in the syntax section at the top. +Only built-in commands can be abbreviated; user-written commands cannot. Although Stata allows some commands to be abbreviated to one or two characters, this can be confusing -- two-letter abbreviations can rarely be ``pronounced'' in an obvious way that connects them to the functionality of the full command. Therefore, command abbreviations in code should not be shorter than three characters, with the exception of \texttt{tw} for \texttt{twoway} and \texttt{di} for \texttt{display}, -and abbreviations should only be used when widely accepted abbreviation exists. +and abbreviations should only be used when widely a accepted abbreviation exists. We do not abbreviate \texttt{local}, \texttt{global}, \texttt{save}, \texttt{merge}, \texttt{append}, or \texttt{sort}. Here is our non-exhaustive list of widely accepted abbreviations of common Stata commands. @@ -201,31 +215,33 @@ \subsection{Abbreviating variables} \subsection{Writing loops} -In Stata examples and other code languages, it is common that the name of the local generated by \texttt{foreach} or \texttt{forvalues} -is named something as simple as \texttt{i} or \texttt{j}. In Stata, however, +In Stata examples and other code languages, it is common for the name of the local generated by \texttt{foreach} or \texttt{forvalues} +to be something as simple as \texttt{i} or \texttt{j}. In Stata, however, loops generally index a real object, and looping commands should name that index descriptively. One-letter indices are acceptable only for general examples; for looping through \textbf{iterations} with \texttt{i}; and for looping across matrices with \texttt{i}, \texttt{j}. Other typical index names are \texttt{obs} or \texttt{var} when looping over observations or variables, respectively. -But since Stata does not have arrays such abstract syntax should not be used in Stata code otherwise. -Instead, index names should describe what the code is looping over, for example household members, crops, or -medicines. This makes code much more readable, particularly in nested loops. +But since Stata does not have arrays, such abstract syntax should not be used in Stata code otherwise. +Instead, index names should describe what the code is looping over -- +for example household members, crops, or medicines. +This makes code much more readable, particularly in nested loops. \codeexample{stata-loops.do}{./code/stata-loops.do} \subsection{Using whitespace} -In Stata, one space or many spaces does not make a difference to code execution, +In Stata, adding one or many spaces does not make a difference to code execution, and this can be used to make the code much more readable. We are all very well trained in using whitespace in software like PowerPoint and Excel: we would never present a PowerPoint presentation where the text does not align -or submit an Excel table with unstructured rows and columns, and the same principles apply to coding. +or submit an Excel table with unstructured rows and columns. +The same principles apply to coding. In the example below the exact same code is written twice, but in the better example whitespace is used to signal to the reader that the central object of this segment of code is the variable \texttt{employed}. -Organizing the code like this makes the code much quicker to read, and small typos -stand out much more, making them easier to spot. +Organizing the code like this makes it much quicker to read, +and small typos stand out more, making them easier to spot. \codeexample{stata-whitespace-columns.do}{./code/stata-whitespace-columns.do} @@ -256,7 +272,7 @@ \subsection{Writing conditional expressions} Use \texttt{if-else} statements when applicable even if you can express the same thing with two separate \texttt{if} statements. When using \texttt{if-else} statements you are communicating to anyone reading your code -that the two cases are mutually exclusive which makes your code more readable. +that the two cases are mutually exclusive, which makes your code more readable. It is also less error-prone and easier to update if you want to change the condition. \codeexample{stata-conditional-expressions2.do}{./code/stata-conditional-expressions2.do} @@ -318,7 +334,7 @@ \subsection{Writing file paths} \textbf{Dynamic} file paths use global macros for the location of the root folder. These globals should be set in a central master do-file. This makes it possible to write file paths that work very similarly to relative paths. -This achieves the functionality that setting \texttt{cd} is often intended to: +It also achieves the functionality that setting \texttt{cd} is often intended to: executing the code on a new system only requires updating file path globals in one location. If global names are unique, there is no risk that files are saved in the incorrect project folder. You can create multiple folder globals as needed and this is encouraged. @@ -336,7 +352,7 @@ \subsection{Line breaks} (The \texttt{\#delimit} command is only acceptable for advanced function programming and is officially discouraged in analytical code.\cite{cox2005styleguide} Never, for any reason, use \texttt{/* */} to wrap a line.) -The \texttt{///} breaks the line in the code editor, +Using \texttt{///} breaks the line in the code editor, while telling Stata that the same line of code continues on the next line. The \texttt{///} breaks do not need to be horizontally aligned in code, although you may prefer to if they have comments that read better aligned, @@ -361,11 +377,12 @@ \subsection{Using boilerplate code} it turns \texttt{varabbrev} off so abbrevated variable names are rejected; and it maximizes the allowed memory usage and matrix size so that code is not rejected on other machines for violating system limits. -(Again, other software versions, such as Small Stata and outdated versions, -have lower limits and it may not be able to run newer code in them.) +(For example, Stata/SE and Stata/IC, allow for different maximum numbers of variables, +and the same happens with Stata 14 and Stata 15, +so it may not be able to run code written in one of these version using another.) Finally, it clears all stored information in Stata memory, such as non-installed programs and globals, -so it gets as close as possible to opening Stata fresh. +getting as close as possible to opening Stata fresh. \codeexample{stata-boilerplate.do}{./code/stata-boilerplate.do} @@ -402,18 +419,18 @@ \subsection{Miscellaneous notes} \texttt{hours + (minutes/60) + (seconds/3600)}, not \texttt{hours + minutes / 60 + seconds / 3600}. For long expressions, the operator starts the new line, so: -\texttt{gen sumvar = x ///} +\texttt{gen sumvar = x ///} -\texttt{ + y ///} +\texttt{ + y ///} -\texttt{ - z ///} +\texttt{ - z ///} -\texttt{ + a*(b-c)} +\texttt{ + a*(b-c)} \noindent Make sure your code doesn't print very much to the results window as this is slow. This can be accomplished by using \texttt{run file.do} rather than \texttt{do file.do}. Run outputs like \texttt{reg} using the \texttt{qui} prefix. Never use interactive commands like \texttt{sum} or \texttt{tab} in dofiles, -unless they are with \texttt{qui} for the purpose of getting \texttt{r()}-statistics. +unless they are combined with \texttt{qui} for the purpose of getting \texttt{r()}-statistics. \mainmatter From e9ea30d421e3c3e78dceac7d54dabcac29b2f781 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 6 Jan 2020 18:42:54 -0500 Subject: [PATCH 161/854] [Appendix] Changes to spacing --- code/stata-before-saving.do | 4 ++-- code/stata-comments.do | 3 +-- code/stata-linebreak.do | 12 ++++++------ 3 files changed, 9 insertions(+), 10 deletions(-) diff --git a/code/stata-before-saving.do b/code/stata-before-saving.do index bdeb65e01..f770187dd 100644 --- a/code/stata-before-saving.do +++ b/code/stata-before-saving.do @@ -17,6 +17,6 @@ * Save data - save "${myProject}/myDataFile.dta" , replace // The folder global is set in master do-file + save "${myProject}/myDataFile.dta" , replace // The folder global is set in master do-file saveold "${myProject}/myDataFile-13.dta" , replace v(13) // For others - use "${myProject}/myDataFile.dta" , clear // It is useful to be able to recall the data quickly + use "${myProject}/myDataFile.dta" , clear // It is useful to be able to recall the data quickly diff --git a/code/stata-comments.do b/code/stata-comments.do index bc9a5d374..911d48e2a 100644 --- a/code/stata-comments.do +++ b/code/stata-comments.do @@ -8,8 +8,7 @@ TYPE 1: TYPE 2: -* Standardize settings, explicitly set the version, and -* clear all previous information from memory +* Standardize settings, explicitly set version, and clear memory * (This comment is used to document a task covering at maximum a few lines of code) ieboilstart, version(13.1) `r(version)' diff --git a/code/stata-linebreak.do b/code/stata-linebreak.do index 1fd87c20e..a44886af8 100644 --- a/code/stata-linebreak.do +++ b/code/stata-linebreak.do @@ -1,10 +1,10 @@ GOOD: - graph hbar invil /// Proportion in village - if (priv == 1) /// Private facilities only - , over(statename, sort(1) descending) /// Order states by values - blabel(bar, format(%9.0f)) /// Label the bars - ylab(0 "0%" 25 "25%" 50 "50%" 75 "75%" 100 "100%") /// - ytit("Share of private primary care visits made in own village") + graph hbar invil /// Proportion in village + if (priv == 1) /// Private facilities only + , over(statename, sort(1) descending) /// Order states by values + blabel(bar, format(%9.0f)) /// Label the bars + ylab(0 "0%" 25 "25%" 50 "50%" 75 "75%" 100 "100%") /// + ytit("Share of private primary care visits made in own village") BAD: #delimit ; From 4bf1915e84d62082d949396dc91d2e38e3f9c10e Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 6 Jan 2020 18:43:09 -0500 Subject: [PATCH 162/854] [Appendix] Commented out actions inside loops and ifs --- code/stata-conditional-expressions2.do | 4 ++++ code/stata-loops.do | 6 +++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/code/stata-conditional-expressions2.do b/code/stata-conditional-expressions2.do index bd47d48c3..8d6476b40 100644 --- a/code/stata-conditional-expressions2.do +++ b/code/stata-conditional-expressions2.do @@ -1,13 +1,17 @@ GOOD: if (`sampleSize' <= 100) { + * do something } else { + * do something else } BAD: if (`sampleSize' <= 100) { + * do something } if (`sampleSize' > 100) { + * do something else } diff --git a/code/stata-loops.do b/code/stata-loops.do index 7091a3a6b..2550b7805 100644 --- a/code/stata-loops.do +++ b/code/stata-loops.do @@ -2,14 +2,14 @@ BAD: * Loop over crops foreach i in potato cassava maize { - do something to `i' + * do something to `i' } GOOD: * Loop over crops foreach crop in potato cassava maize { - do something to `crop' + * do something to `crop' } GOOD: @@ -19,6 +19,6 @@ local crops potato cassava maize foreach crop of local crops { * Loop over plot number forvalues plot_num = 1/10 { - do something to `crop' in `plot_num' + * do something to `crop' in `plot_num' } // End plot loop } // End crop loop From 3308f8f110e6a67bb4e814c2ce5c4622061385fa Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 7 Jan 2020 07:07:20 -0500 Subject: [PATCH 163/854] [Ch 7] Minor changes --- chapters/publication.tex | 61 ++++++++++++++++++++-------------------- 1 file changed, 31 insertions(+), 30 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 1f0373357..563c9ff4e 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -3,7 +3,7 @@ \begin{fullwidth} Publishing academic research today extends well beyond writing up a Word document alone. There are often various contributors making specialized inputs to a single output, -a large number of iterations, versions, and rervisions, +a large number of iterations, versions, and revisions, and a wide variety of raw materials and results to be published together. Ideally, your team will spend as little time as possible fussing with the technical requirements of publication. @@ -20,13 +20,13 @@ These represent an intellectual contribution in their own right, because they enable others to learn from your process and better understand the results you have obtained. -Holding code and data to the same standards a written work -is a new discpline for many researchers, -and here we provide some basic guidelines and basic responsibilities for both -that will help you to prepare a functioning and informative replication package. +Holding code and data to the same standards as written work +is a new discipline for many researchers, +and here we provide some basic guidelines and responsibilities for both +that will help you prepare a functioning and informative replication package. In all cases, we note that technology is rapidly evolving and that the specific tools noted here may not remain cutting-edge, -but the core principles of materials publication and transparency will endure. +but the core principles involved in publication and transparency will endure. \end{fullwidth} %------------------------------------------------ @@ -35,14 +35,14 @@ \section{Collaborating on technical writing} It is increasingly rare that a single author will prepare an entire manuscript alone. More often than not, documents will pass back and forth between several writers -before they are prepared for publication, +before they are ready for publication, so it is essential to use technology and workflows that avoid conflicts. Just as with the preparation of analytical outputs, -this means adopting tools practices that enable tasks +this means adopting tools and practices that enable tasks such as version control and simultaneous contribution. Furthermore, it means preparing documents that are \textbf{dynamic} -- meaning that updates to the analytical outputs that constitute them -can be updated in the final output with a single process, +can be passed on to the final output with a single process, rather than copy-and-pasted or otherwise handled individually. Thinking of the writing process in this way is intended to improve organization and reduce error, @@ -58,13 +58,13 @@ \subsection{Dynamic documents} the next iteration of the document will automatically include all changes made to all outputs without any additional intervention from the writer. This means that updates will never be accidentally excluded, -and it further means that updating results will never become more difficult +and it further means that updating results will not become more difficult as the number of inputs grows, because they are all managed by a single integrated process. You will note that this is not possible in tools like Microsoft Office. In Word, for example, you have to copy and paste each object individually -whenever there are materials that have to be updated. +whenever tables, graphs or other inputs have to be updated. This means that both the features above are not available: fully updating the document becomes more and more time-consuming as the number of inputs increases, @@ -79,8 +79,8 @@ \subsection{Dynamic documents} There are a number of tools that can be used for dynamic documents. They fall into two broad groups -- -the first which compiles a document as part of code execution, -and the second which operates a separate document compiler. +the first compiles a document as part of code execution, +and the second operates a separate document compiler. In the first group are tools such as R's RMarkdown and Stata's \texttt{dyndoc}. These tools ``knit'' or ``weave'' text and code together, and are programmed to insert code outputs in pre-specified locations. @@ -97,29 +97,29 @@ \subsection{Dynamic documents} that allows linkages to files in Dropbox, which are then automatically updated anytime the file is replaced. Like the first class of tools, Dropbox Paper has very limited formatting options, -but it is appropriate for work with collaborators who are not using statistical software. +but it is appropriate for working with collaborators who are not using statistical software. However, the most widely utilized software for dynamically managing both text and results is \LaTeX\ (pronounced ``lah-tek'').\sidenote{ \url{https://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/GSWLaTeX.pdf}} \index{\LaTeX} -(\LaTeX\ also operates behind-the-scenes in many other tools.) While this tool has a significant learning curve, its enormous flexibility in terms of operation, collaboration, and output formatting and styling makes it the primary choice for most large technical outputs today. +In fact, \LaTeX\ operates behind-the-scenes in many of the tools listed in the first group. \subsection{Technical writing with \LaTeX} \LaTeX\ is billed as a ``document preparation system''. What this means is worth unpacking. -In \LaTeX\, instead of writing in a ``what-you-see-is-what-you-get'' mode +In {\LaTeX}, instead of writing in a ``what-you-see-is-what-you-get'' mode as you do in Word or the equivalent, you write plain text interlaced with coded instructions for formatting (similar in concept to HTML). The \LaTeX\ system includes commands for simple markup like font styles, paragraph formatting, section headers and the like. But it also includes special controls for including tables and figures, -footnotes and endnotes, complex mathematics, and automated bibliography preparation. +footnotes and endnotes, complex mathematical notation, and automated bibliography preparation. It also allows publishers to apply global styles and templates to already-written material, allowing them to reformat entire documents in house styles with only a few keystrokes. @@ -241,11 +241,11 @@ \subsection{Getting started with \LaTeX\ via Overleaf} \section{Preparing a complete replication package} While we have focused so far on the preparation of written materials for publication, -is is increasingly important for you to consider how you will publish +it is increasingly important for you to consider how you will publish the data and code you used for your research as well. -Increasingly, major journals are requiring that publications +More and more major journals are requiring that publications provide direct links to both the code and data used to create the results, -and some even require that they are able to reproduce the results themselves +and some even require being able to reproduce the results themselves before they will approve a paper for publication.\sidenote{ \url{https://www.aeaweb.org/journals/policies/data-code/}} If your material has been well-structured throughout the analytical process, @@ -272,7 +272,7 @@ \subsection{Publishing data for replication} at least some subset of your analytical dataset. You should only directly publish data which is fully de-identified and, to the extent required to ensure reasonable privacy, -potential identifying characteristics are futher masked or removed. +potentially identifying characteristics are further masked or removed. In all other cases, you should contact an appropriate data catalog to determine what privacy and licensing options are available. @@ -334,27 +334,27 @@ \subsection{Publishing code for replication} In most cases code will not contain identifying information; check carefully that it does not. Pubishing code also requires assigning a license to it; -in a majority of cases code publishers like GitHub +in a majority of cases, code publishers like GitHub offer extremely permissive licensing options by default. (If you do not provide a license, nobody can use your code!) Make sure the code functions identically on a fresh install of your chosen software. A new user should have no problem getting the code to execute perfectly. In either a scripts folder or in the root directory, -include a master script (dofile or Rscript for example). +include a master script (dofile or R script for example). The master script should allow the reviewer -to change one line of code setting the directory path. -Then, running the master script should run the entire project +to change a single line of code: the one setting the directory path. +After that, running the master script should run the entire project and re-create all the raw outputs exactly as supplied. Indicate the filename and line to change. Check that all your code will run completely on a new computer: Install any required user-written commands in the master script (for example, in Stata using \texttt{ssc install} or \texttt{net install} -and in R include code for installing packages, +and in R include code giving users the option to install packages, including selecting a specific version of the package if necessary). In many cases you can even directly provide the underlying code for any user-installed packages that are needed to ensure forward-compatibility. -Make sure system settings like \texttt{version}, \texttt{matsize}, and texttt{varabbrev} are set. +Make sure system settings like \texttt{version}, \texttt{matsize}, and \texttt{varabbrev} are set. Finally, make sure that the code and its inputs and outputs are clearly identified. A new user should, for example, be able to easily identify and remove @@ -364,7 +364,8 @@ \subsection{Publishing code for replication} such as ensuring that the raw components of figures or tables are clearly identified. For example, outputs should clearly correspond by name to an exhibit in the paper, and vice versa. (Supplying a compiling \LaTeX\ document can support this.) -Code and outputs which are not used should be removed. +Code and outputs which are not used should be removed -- +if you are using GitHub, consider making them available in a different branch for transparency. \subsection{Releasing a replication package} @@ -380,7 +381,7 @@ \subsection{Releasing a replication package} GitHub provides one solution. Making your GitHub repository public is completely free for finalized projects. -The site can hold any file types, +It can hold any file types, provide a structured download of your whole project, and allow others to look at alternate versions or histories easily. It is straightforward to simply upload a fixed directory to GitHub @@ -415,7 +416,7 @@ \subsection{Releasing a replication package} without having to download your tools and match your local environment when packages and other underlying softwares may have changed since publication. -In addition to the code and data, +In addition to code and data, you may also want to release an author's copy or preprint of the article itself along with these raw materials. Check with your publisher before doing so; From f477d2a10982be19b292ce8f1a3c8f9eb5f5d0bc Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 7 Jan 2020 09:19:07 -0500 Subject: [PATCH 164/854] [Ch 4] Minor changes --- chapters/sampling-randomization-power.tex | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 06e0bdcc1..d4c861e6e 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -104,7 +104,7 @@ \subsection{Reproducibility in random Stata processes} If anything is different, the underlying randomization algorithms may have changed, and it will be impossible to recover the original result. In Stata, the \texttt{version} command ensures that the software algorithm is fixed. -We recommend using \texttt{version 13.1} for back-compatibility; +We recommend using \texttt{version 13.1} for backward compatibility; the algorithm was changed after Stata 14 but its improvements do not matter in practice. (Note that you will \textit{never} be able to transfer a randomization to another software such as R.) The \texttt{ieboilstart} command in \texttt{ietoolkit} provides functionality to support this requirement.\sidenote{ @@ -115,8 +115,8 @@ \subsection{Reproducibility in random Stata processes} via the master do-file may produce different results, since Stata's \texttt{version} expires after execution just like a \texttt{local}. -\textbf{Sorting} means that the actual data that the random process is run on is fixed; -because numbers are assigned to each observation in sequence, +\textbf{Sorting} means that the actual data that the random process is run on is fixed. +Because numbers are assigned to each observation in sequence, changing their order will change the result of the process. A corollary is that the underlying data must be unchanged between runs: you must make a fixed final copy of the data when you run a randomization for fieldwork. @@ -213,7 +213,7 @@ \subsection{Sampling} \subsection{Randomization} -\textbf{Randomization} is the process of assigning units to some kind of treatment program. +\textbf{Randomization}, in this context, is the process of assigning units into treatment arms. Most of the Stata commands used for sampling can be directly transferred to randomization, since randomization is also a process of splitting a sample into groups. Where sampling determines whether a particular individual @@ -222,7 +222,7 @@ \subsection{Randomization} Randomizing a treatment guarantees that, \textit{on average}, the treatment will not be correlated with anything it did not cause.\cite{duflo2007using} Causal inference from randomization therefore depends on a specific counterfactual: -that the units who recieved the treatment program might not have done so. +that the units who received the treatment program might not have done so. Therefore, controlling the exact probability that each individual receives treatment is the most important part of a randomization process, and must be carefully worked out in more complex designs. @@ -335,7 +335,7 @@ \subsection{Stratification} This accounts for the fact that randomizations were conducted within the strata, comparing units to the others within its own strata by correcting for the local mean. Stratification is typically used for sampling -in order to ensure that individuals with various types will be observed; +in order to ensure that individuals with relevant characteristics will be observed; no adjustments are necessary as long as the sampling proportion is constant across all strata. One common pitfall is to vary the sampling or randomization \textit{probability} across different strata (such as ``sample/treat all female heads of household''). @@ -409,7 +409,7 @@ \subsection{Power calculations} or a fraction of a standard deviation, then it is nonsensical to run a study whose MDE is much larger than that. Conversely, the \textbf{minimum sample size} pre-specifies expected effects -and tells you how large a study would need to be to detect that effect. +and tells you how large a study's sample would need to be to detect that effect. Stata has some commands that can calculate power analytically for very simple designs -- \texttt{power} and \texttt{clustersampsi} -- From 4ad35ba1dbd141373fa54c1090da1620ceadb829 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 10 Jan 2020 10:57:26 -0500 Subject: [PATCH 165/854] Ch5: rewrite questionnaire design Updated questionnaire design (first section) --- chapters/data-collection.tex | 286 ++++++++++++++++------------------- 1 file changed, 132 insertions(+), 154 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index a933aa187..fcf85773e 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -1,6 +1,58 @@ %------------------------------------------------ \begin{fullwidth} + + %PLACEHOLDER FOR NEW INTRO + +\end{fullwidth} + +%------------------------------------------------ + +\section{Designing CAPI questionnaires} +A well-designed questionnaire results from careful planning, consideration of analysis and indicators, close review of existing questionnaires, survey pilots, and research team and stakeholder review.Although most surveys are now collected electronically -- Computer Assisted Personal Interviews (CAPI) -- +\textbf{Questionnaire design}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Design}} +\index{questionnaire design} +(content development) and questionnaire programming (functionality development) should be seen as two strictly separate tasks. The research team should agree on all questionnaire content and design a paper version before programming a CAPI version. This facilitates a focus on content during the design process, rather than technical programming details, and ensures teams have a readable, printable paper version of their questionnaire. + +An easy-to-read paper version of the questionnaire is particularly critical for training enumerators, so they can get an overview of the survey content and structure before diving into the programming. It is much easier for enumerators to understand all possible response pathways from a paper version, than from swiping question by question. Finalizing the questionnaire before programming also avoids version control concerns that arise from concurrent work on paper and electronic survey instruments. In addition, a paper questionnaire is an important documentation for data publication. + +The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: begin from broad concepts and slowly flesh out the specifics. It is essential to start with a clear understanding of the +\textbf{theory of change} \sidenote{url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} +and \textbf{experimental design} for your project.The first step of questionnaire design is to list key outcomes of interest, as well as the main covariates and variables needed for experimental design. The ideal starting point for this is a +\textbf{pre-analysis plan}. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} + +Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). Each module should then be expanded into specific indicators to observe in the field. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} +At this point, it is useful to do a +\textbf{content-focused pilot} \sidenote{url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. +Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. + +Once the content of the questionnaire is finalized, it should be translated into appropriate language(s). Only then is it time to proceed with the CAPI programming. + + + +\textbf{Extensive tracking} sections -- +in which reasons for \textbf{attrition}, treatment \textbf{contamination}, and +\textbf{loss to follow-up} can be documented -- +\index{attrition}\index{contamination} +are essential data components for completing CONSORT\sidenote[][-3.5cm]{\textbf{CONSORT:} a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups.} records.\sidenote[][-1.5cm]{Begg, C., Cho, M., Eastwood, S., + Horton, R., Moher, D., Olkin, I., Pitkin, + R., Rennie, D., Schulz, K. F., Simel, D., + et al. (1996). Improving the quality of + reporting of randomized controlled + trials: The CONSORT statement. \textit{JAMA}, + 276(8):637--639} + +From a data perspective, there are a few important points to keep in mind for all quantitative analysis of survey data (regardless of sector). First, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. + +Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like +\textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. + +%------------------------------------------------ + +\section{Programming CAPI questionnaires} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} + Most data collection is now done using digital data entry using tools that are specially designed for surveys. These tools, called \textbf{computer-assisted personal interviewing (CAPI)} software, @@ -18,15 +70,10 @@ planning data structure during survey design, developing surveys that are easy to control for quality and security, and having proper file storage ready for sensitive PII data. -\end{fullwidth} - -%------------------------------------------------ -\section{Designing CAPI questionnaires} CAPI surveys\sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted_Personal_Interviews_(CAPI)}} -are primarily created in Excel, Google Sheets, -or software-specific form builders +are primarily created in a spreadsheet (Excel or Google Sheets),or software-specific form builders making them one of the few research outputs for which little coding is required.\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} The \texttt{ietestform} command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of \texttt{iefieldkit}, implements form-checking routines @@ -36,161 +83,19 @@ \section{Designing CAPI questionnaires} both the field team and the data team should collaborate to make sure that the survey suits all needs.\cite{krosnick2018questionnaire} -Generally, this collaboration means building the experimental design -fundamentally into the structure of the survey. -In addition to having prepared a unique anonymous ID variable -using the master data, +In addition to having prepared a unique anonymous ID variable using the master data, that ID should be built into confirmation checks in the survey form. -When ID matching and tracking across rounds is essential, -the survey should be prepared to verify new data +When ID matching and tracking across rounds is essential, the survey should be prepared to verify new data against \textbf{preloaded data} from master records or from other rounds. -\textbf{Extensive tracking} sections -- -in which reasons for \textbf{attrition}, treatment \textbf{contamination}, and -\textbf{loss to follow-up} can be documented -- -\index{attrition}\index{contamination} -are essential data components for completing CONSORT\sidenote[][-3.5cm]{\textbf{CONSORT:} a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups.} records.\sidenote[][-1.5cm]{Begg, C., Cho, M., Eastwood, S., - Horton, R., Moher, D., Olkin, I., Pitkin, - R., Rennie, D., Schulz, K. F., Simel, D., - et al. (1996). Improving the quality of - reporting of randomized controlled - trials: The CONSORT statement. \textit{JAMA}, - 276(8):637--639} -\textbf{Questionnaire design}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Design}} -\index{questionnaire design} -is the first task where the data team -and the field team must collaborate on data structure.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Preparing_for_Field_Data_Collection}} -Questionnaire design should always be a separated task from questionnaire programming, meaning -that the team should have designed and agreed on a questionnaire before that questionnaire is -programmed in the CAPI software used. Otherwise teams tend to spend more time discussing technical -programming details, rather than the content in the questionnaire. The field-oriented staff and the -PIs will likely prefer to capture a large amount of detailed \textit{information} -in the field, some of which will serve very poorly as \textit{data}.\sidenote{\url{ -https://iriss.stanford.edu/sites/g/files/sbiybj6196/f/questionnaire\_design\_1.pdf}} -In particular, \textbf{open-ended responses} and questions which will have -many null or missing responses by design will not be very useful -in statistical analyses unless pre-planned. -You must work with the field team to determine the appropriate amount -of abstraction inherent in linking concepts to responses.\sidenote{\url{ -https://www.povertyactionlab.org/sites/default/files/documents/Instrument\%20Design\_Diva\_final.pdf}} -For example, it is always possible to ask for open-ended responses to questions, -but it is far more useful to ask for things like \textbf{Likert scale}\sidenote{\textbf{Likert scale:} an ordered selection of choices -indicating the respondent's level of agreement or disagreement -with a proposed statement.} responses -instead of asking, for instance, -``How do you feel about the proposed policy change?'' - -Coded responses are always more useful than open-ended responses, -because they reduce the time necessary for post-processing by -expensive specialized staff. -For example, if collecting data on medication use or supplies, -you could collect: the brand name of the product; -the generic name of the product; -the coded compound of the product; -or the broad category to which each product belongs (antibiotic, etc.). -All four may be useful for different reasons, -but the latter two are likely to be the most useful for the analyst. -The coded compound requires providing a translation dictionary -to field staff, but enables automated rapid recoding for analysis -with no loss of information. -The generic class requires agreement on the broad categories of interest, -but allows for much more comprehensible top-line statistics and data quality checks. - -Broadly, the questionnaire should be designed as follows. -The workflow will feel much like writing an essay: -begin from broad concepts and slowly flesh out the specifics. -The \textbf{theory of change}, \textbf{experimental design}, -and any \textbf{pre-analysis plans} should be discussed -and the structure of required data for those conceptualized first.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} -Next, the conceptual outcomes of interest, as well as the main covariates, classifications, -and other variables needed for the experimental design should be listed out. -The questionnaire \textit{modules} should be outlined based on this list. -At this stage, modules should not be numbered -- -they should use a short prefix so they can be easily reordered. -Each module should then be expanded into specific indicators to observe in the field. There is not yet a full consensus over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. -Finally, the questionnaire can be \textbf{piloted}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} -in a non-experimental sample. -Revisions are made, and the survey is then translated into the appropriate language and programmed electronically.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} - -%------------------------------------------------ -\section{Collecting data securely} +\section{Piloting} +\textbf{piloted}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} -At all points in the data collection and handling process, -access to personally-identifying information (PII) -must always be restricted to the members of the team -who have been granted that permission by the appropriate agreement -(typically an IRB approving data collection or partner agency providing it). -Any established data collection platform will always \textbf{encrypt}\sidenote{\textbf{Encryption:} the process of making information unreadable -to anyone without access to a specific deciphering key. \\ \url{https://dimewiki.worldbank.org/wiki/Encryption}} -all data submitted from the field automatically while in transit -(i.e., upload or download), so if you use servers hosted by SurveyCTO -or SurveySolutions this is nothing you need to worry about. -Your data will be encrypted from the time it leaves the device -(in tablet-assisted data collation) or your browser (in web data collection) -until it reaches the server. - -\textbf{Encryption at rest} is the only way to ensure -that PII data remains private when it is stored -on someone else's server on the open internet. -Encryption makes data files completely unusable -without access to a security key specific to that data -- -a higher level of security than password-protection. -The World Bank's and many of our donors' security requirements -for data storage can only be fulfilled by this method. -We recommend keeping your data encrypted whenever PII data is collected -- -therefore, we recommend it for all field data collection. - -Encryption in cloud storage, by contrast, is not enabled by default. -This is because the service will not encrypt user data unless you confirm -you know how to operate the encryption system and understand the consequences if basic protocols are not followed. -Encryption at rest is different from password-protection: -encryption at rest makes the underlying data itself unreadable, -even if accessed, except to users who have a specific private \textbf{keyfile}. -Encryption at rest requires active participation from you, the user, -and you should be fully aware that if your private key is lost, -there is absolutely no way to recover your data. - -To enable data encryption in SurveyCTO, for example, -simply select the encryption option -when you create a new form on a SurveyCTO server. -Other data collection services should work similarly, -but the exact implementation of encryption at rest -will vary slightly from service to service. -At that time, the service will allow you to download -- once -- -the keyfile pair needed to decrypt the data. -You must download and store this in a secure location. -Make sure you store keyfiles with descriptive names to match the survey to which it corresponds. -Any time you access the data - either when viewing it in browser or syncing it to your -computer - the user will be asked to provide this keyfile. It could differ between the software, -but typically you would copy that keyfile from where it is stored (for example LastPass) -to your desktop, point to it, and the rest is automatic. -After each time you use the keyfile, delete it from your desktop, -but not from your password manager. - -Finally, you should ensure that all teams take basic precautions -to ensure the security of data, as most problems are due to human error. -Most importantly, all computers, tablets, and accounts used -\textit{must} have a logon password associated with them. -Ideally, the machine hard drives themselves should also be encrypted. -This policy should also be applied to physical data storage -such as flash drives and hard drives; -similarly, files sent to the field containing PII data -such as the sampling list should at least be password-protected. -This can be done using a zip-file creator. -LastPass can also be used to share passwords securely, -and you cannot share passwords across email. -This step significantly mitigates the risk in case there is -a security breach such as loss, theft, hacking, or a virus, -and adds very little hassle to utilization. - - -%------------------------------------------------ - -\section{Overseeing fieldwork and quality assurance} +\section{Data quality assurance} While the team is in the field, the research assistant and field coordinator will be jointly responsible @@ -314,7 +219,76 @@ \section{Overseeing fieldwork and quality assurance} %------------------------------------------------ -\section{Receiving primary data} +\section{Collecting data securely} + +At all points in the data collection and handling process, +access to personally-identifying information (PII) +must always be restricted to the members of the team +who have been granted that permission by the appropriate agreement +(typically an IRB approving data collection or partner agency providing it). +Any established data collection platform will always \textbf{encrypt}\sidenote{\textbf{Encryption:} the process of making information unreadable +to anyone without access to a specific deciphering key. \\ \url{https://dimewiki.worldbank.org/wiki/Encryption}} +all data submitted from the field automatically while in transit +(i.e., upload or download), so if you use servers hosted by SurveyCTO +or SurveySolutions this is nothing you need to worry about. +Your data will be encrypted from the time it leaves the device +(in tablet-assisted data collation) or your browser (in web data collection) +until it reaches the server. + +\textbf{Encryption at rest} is the only way to ensure +that PII data remains private when it is stored +on someone else's server on the open internet. +Encryption makes data files completely unusable +without access to a security key specific to that data -- +a higher level of security than password-protection. +The World Bank's and many of our donors' security requirements +for data storage can only be fulfilled by this method. +We recommend keeping your data encrypted whenever PII data is collected -- +therefore, we recommend it for all field data collection. + +Encryption in cloud storage, by contrast, is not enabled by default. +This is because the service will not encrypt user data unless you confirm +you know how to operate the encryption system and understand the consequences if basic protocols are not followed. +Encryption at rest is different from password-protection: +encryption at rest makes the underlying data itself unreadable, +even if accessed, except to users who have a specific private \textbf{keyfile}. +Encryption at rest requires active participation from you, the user, +and you should be fully aware that if your private key is lost, +there is absolutely no way to recover your data. + +To enable data encryption in SurveyCTO, for example, +simply select the encryption option +when you create a new form on a SurveyCTO server. +Other data collection services should work similarly, +but the exact implementation of encryption at rest +will vary slightly from service to service. +At that time, the service will allow you to download -- once -- +the keyfile pair needed to decrypt the data. +You must download and store this in a secure location. +Make sure you store keyfiles with descriptive names to match the survey to which it corresponds. +Any time you access the data - either when viewing it in browser or syncing it to your +computer - the user will be asked to provide this keyfile. It could differ between the software, +but typically you would copy that keyfile from where it is stored (for example LastPass) +to your desktop, point to it, and the rest is automatic. +After each time you use the keyfile, delete it from your desktop, +but not from your password manager. + +Finally, you should ensure that all teams take basic precautions +to ensure the security of data, as most problems are due to human error. +Most importantly, all computers, tablets, and accounts used +\textit{must} have a logon password associated with them. +Ideally, the machine hard drives themselves should also be encrypted. +This policy should also be applied to physical data storage +such as flash drives and hard drives; +similarly, files sent to the field containing PII data +such as the sampling list should at least be password-protected. +This can be done using a zip-file creator. +LastPass can also be used to share passwords securely, +and you cannot share passwords across email. +This step significantly mitigates the risk in case there is +a security breach such as loss, theft, hacking, or a virus, +and adds very little hassle to utilization. + In this section, you finally get your hands on some data! What do we do with it? Data handling is one of the biggest @@ -360,3 +334,7 @@ \section{Receiving primary data} but the decryption and usage of the raw data is a manual process. With the raw data securely stored and backed up, you are ready to move to de-identification, data cleaning, and analysis. +%------------------------------------------------ + + + From e24e559efd1acf5cffef23729624950f7ec6a1f1 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 10 Jan 2020 11:27:16 -0500 Subject: [PATCH 166/854] Ch5 rewrite Update to questionnaire design --- chapters/data-collection.tex | 25 ++++++++++--------------- 1 file changed, 10 insertions(+), 15 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index fcf85773e..f09ac4866 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -27,26 +27,21 @@ \section{Designing CAPI questionnaires} \textbf{content-focused pilot} \sidenote{url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. -Once the content of the questionnaire is finalized, it should be translated into appropriate language(s). Only then is it time to proceed with the CAPI programming. - +\subsection{Data-focused issues in questionnaire design} +From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like +\textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. -\textbf{Extensive tracking} sections -- -in which reasons for \textbf{attrition}, treatment \textbf{contamination}, and -\textbf{loss to follow-up} can be documented -- +\textbf{Extensive tracking} sections -- in which reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up} can be documented -- \index{attrition}\index{contamination} -are essential data components for completing CONSORT\sidenote[][-3.5cm]{\textbf{CONSORT:} a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups.} records.\sidenote[][-1.5cm]{Begg, C., Cho, M., Eastwood, S., - Horton, R., Moher, D., Olkin, I., Pitkin, - R., Rennie, D., Schulz, K. F., Simel, D., - et al. (1996). Improving the quality of - reporting of randomized controlled - trials: The CONSORT statement. \textit{JAMA}, - 276(8):637--639} +are essential data components for completing CONSORT +\sidenote[][-3.5cm]{\textbf{CONSORT:} a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups.} +records. +\sidenote[][-1.5cm]{Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K. F., Simel, D., et al. (1996). Improving the quality of reporting of randomized controlled trials: The CONSORT statement. \textit{JAMA}, 276(8):637--639} -From a data perspective, there are a few important points to keep in mind for all quantitative analysis of survey data (regardless of sector). First, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. -Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like -\textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. +Once the content of the questionnaire is finalized, it should be translated into appropriate language(s). Only then is it time to proceed with the CAPI programming. + %------------------------------------------------ From 3e2cb225b7e51f123ef0fcd790ef6f5ca683cc38 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 10 Jan 2020 14:51:00 -0500 Subject: [PATCH 167/854] Ch5 re-write Updated questionnaire programming, data quality assurance, and data security sections --- chapters/data-collection.tex | 370 ++++++++++++----------------------- 1 file changed, 123 insertions(+), 247 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index f09ac4866..67cfd4f24 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -3,6 +3,11 @@ \begin{fullwidth} %PLACEHOLDER FOR NEW INTRO + Here we focus on tools and workflows that are primarily conceptual, rather than software-specific. This chapter should provide a motivation for + planning data structure during survey design, + developing surveys that are easy to control for quality and security, + and having proper file storage ready for sensitive PII data. + \end{fullwidth} @@ -12,7 +17,7 @@ \section{Designing CAPI questionnaires} A well-designed questionnaire results from careful planning, consideration of analysis and indicators, close review of existing questionnaires, survey pilots, and research team and stakeholder review.Although most surveys are now collected electronically -- Computer Assisted Personal Interviews (CAPI) -- \textbf{Questionnaire design}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Design}} \index{questionnaire design} -(content development) and questionnaire programming (functionality development) should be seen as two strictly separate tasks. The research team should agree on all questionnaire content and design a paper version before programming a CAPI version. This facilitates a focus on content during the design process, rather than technical programming details, and ensures teams have a readable, printable paper version of their questionnaire. +(content development) and questionnaire programming (functionality development) should be seen as two strictly separate tasks. By focusing on content first and programming implementation later, the survey design quality is better than when the questionnaire is set up in a way which is technically convenient to program. The research team should agree on all questionnaire content and design a paper version before programming a CAPI version. This facilitates a focus on content during the design process and ensures teams have a readable, printable paper version of their questionnaire. An easy-to-read paper version of the questionnaire is particularly critical for training enumerators, so they can get an overview of the survey content and structure before diving into the programming. It is much easier for enumerators to understand all possible response pathways from a paper version, than from swiping question by question. Finalizing the questionnaire before programming also avoids version control concerns that arise from concurrent work on paper and electronic survey instruments. In addition, a paper questionnaire is an important documentation for data publication. @@ -21,24 +26,39 @@ \section{Designing CAPI questionnaires} and \textbf{experimental design} for your project.The first step of questionnaire design is to list key outcomes of interest, as well as the main covariates and variables needed for experimental design. The ideal starting point for this is a \textbf{pre-analysis plan}. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} -Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). Each module should then be expanded into specific indicators to observe in the field. +Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). For each module, determine if the module is applicable to the full sample, the appropriate respondent, and whether (or how often), the module should be repeated. A few examples: a module on maternal health only applies to household with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. + +Each module should then be expanded into specific indicators to observe in the field. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} At this point, it is useful to do a \textbf{content-focused pilot} \sidenote{url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. -\subsection{Data-focused issues in questionnaire design} +\subsection{Questionnaire design considerations for quantitative analysis} +This book covers surveys designed to yield datasets useful for quantitative analysis. This is a subset of surveys, and there are specific design considerations that will help to ensure the raw data outputs are ready for analysis. +\subsubsection{Coded response options} From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like \textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. -\textbf{Extensive tracking} sections -- in which reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up} can be documented -- +\subsubsection{Sample tracking} +\textbf{Extensive tracking} sections - in which reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up} are documented - \index{attrition}\index{contamination} are essential data components for completing CONSORT \sidenote[][-3.5cm]{\textbf{CONSORT:} a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups.} records. \sidenote[][-1.5cm]{Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K. F., Simel, D., et al. (1996). Improving the quality of reporting of randomized controlled trials: The CONSORT statement. \textit{JAMA}, 276(8):637--639} +\subsubsection{How to name questions} +% needs update +There is not yet a full consensus over how individual questions should be identified: +formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, +but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. + +\subsubsection{importance of a unique ID} +When ID matching and tracking across rounds is essential, the survey should be prepared to verify new data +against \textbf{preloaded data} from master records or from other rounds. + Once the content of the questionnaire is finalized, it should be translated into appropriate language(s). Only then is it time to proceed with the CAPI programming. @@ -48,28 +68,20 @@ \subsection{Data-focused issues in questionnaire design} \section{Programming CAPI questionnaires} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} -Most data collection is now done using digital data entry -using tools that are specially designed for surveys. -These tools, called \textbf{computer-assisted personal interviewing (CAPI)} software, -provide a wide range of features designed to make -implementing even highly complex surveys easy, scalable, and secure. -However, these are not fully automatic: -you still need to actively design and manage the survey. -Each software has specific practices that you need to follow -to enable features such as Stata-compatibility and data encryption. - -You can work in any software you like, -and this guide will present tools and workflows -that are primarily conceptual: -this chapter should provide a motivation for -planning data structure during survey design, -developing surveys that are easy to control for quality and security, -and having proper file storage ready for sensitive PII data. - - -CAPI surveys\sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted_Personal_Interviews_(CAPI)}} -are primarily created in a spreadsheet (Excel or Google Sheets),or software-specific form builders -making them one of the few research outputs for which little coding is required.\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} +Most data collection is now done using software tools specifically designed for surveys. CAPI surveys +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted_Personal_Interviews_(CAPI)}} +are typically created in a spreadsheet (e.g. Excel or Google Sheets), or software-specific form builder. As these are typically accessible even to novice users, we will not address software-specific form design in this book. Rather, we focus coding conventions that are important to follow regardless of CAPI software choice, rather than software-specific form design. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} + +CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. However, these are not fully automatic: you still need to actively design and manage the survey. Each software has specific practices that you need to follow to take advantage of CAPI features, encrypt you data, and ensure that the exported data is compatible with the software that will be used for analysis. + +\section{CAPI features} +\subsection{uniquely identified data} +As it is critical to be able to uniquely identify each observation, and link it to the original sample, build in a programming check to confirm the numeric ID and identifying details of the household. + +\section{Data encryption} + +\subsection{compatibility with data analysis software} The \texttt{ietestform} command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of \texttt{iefieldkit}, implements form-checking routines for \textbf{SurveyCTO}, a proprietary implementation of \textbf{Open Data Kit (ODK)}. @@ -78,119 +90,42 @@ \section{Programming CAPI questionnaires} both the field team and the data team should collaborate to make sure that the survey suits all needs.\cite{krosnick2018questionnaire} -In addition to having prepared a unique anonymous ID variable using the master data, -that ID should be built into confirmation checks in the survey form. -When ID matching and tracking across rounds is essential, the survey should be prepared to verify new data -against \textbf{preloaded data} from master records or from other rounds. +\subsection{version control} -There is not yet a full consensus over how individual questions should be identified: -formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, -but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. +%------------------------------------------------ \section{Piloting} \textbf{piloted}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} +https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes +https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes + +%------------------------------------------------ \section{Data quality assurance} -While the team is in the field, the research assistant -and field coordinator will be jointly responsible -for making sure that the survey is progressing correctly,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Quality_Assurance_Plan}} -that the collected data matches the survey sample, -and that errors and duplicate observations are resolved -quickly so that the field team can make corrections.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Duplicates_and_Survey_Logs}} -Modern survey software makes it relatively easy -to control for issues in individual surveys, -using a combination of in-built features -such as hard constraints on answer ranges -and soft confirmations or validation questions. - -These features allow you to spend more time -looking for issues that the software cannot check automatically.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} -Namely, these will be suspicious patterns across multiple responses -or across a group of surveys rather than errors in any single response field -(those can often be flagged by the questionnaire software); -enumerators who are taking either too long or not long enough to complete their work, -``difficult'' groups of respondents who are systematically incomplete; and -systematic response errors. -These are typically done in two main forms: -high-frequency checks (HFCs) and back-checks. - -\textbf{High-frequency checks} are carried out on the data side.\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks/wiki}} -First, observations need to be checked for duplicate entries: -\texttt{ieduplicates}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ieduplicates}} -provides a workflow for collaborating on the resolution of -duplicate entries between you and the field team. -Next, observations must be validated against the sample list: -this is as straightforward as \texttt{merging} the sample list with -the survey data and checking for mismatches. -High-frequency checks should carefully inspect -key treatment and outcome variables -so that the data quality of core experimental variables is uniformly high, -and that additional field effort is centered where it is most important. -Finally, data quality checks -should be run on the data every time it is downloaded -to flag irregularities in observations, sample groups, or enumerators. -\texttt{ipacheck}\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks}} is a very useful command that automates some of these tasks. - -Unfortunately, it is very hard to specify in general -what kinds of quality checks should be utilized, -since the content of surveys varies so widely. -Fortunately, you will know your survey very well -by the time it is programmed, and should have a good sense -of the types of things that would raise concerns -that you were unable to program directly into the survey. -Thankfully it is also easy to prepare high-frequency checking code in advance. -Once you have built and piloted the survey form, -you will have some bits of test data to play with. -Using this data, you should prepare code that outputs -a list of flags from any given dataset. -This HFC code is then ready to run every time you download the data, -and should allow you to rapidly identify issues -that are occurring in fieldwork. -You should also have a plan to address issues found with the field team. -Discuss in advance which inconsistencies will require a revisit, -and how you will communicate to the field teams what were the problems found. -Some issues will need immediate follow-up, -and it will be harder to solve them once the enumeration team leaves the area. +A huge advantage of CAPI surveys, compared to traditional paper surveys, is the ability to access and analyze the data in real time. +This greatly simplifies monitoring and improves data quality assurance. As part of survey preparation, the research team should develop a +\textbf{data quality assurance plan} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Quality_Assurance_Plan}}. While data collection is ongoing, a research assistant or data analyst should work closely with the field team to ensure that the survey is progressing correctly, and perform \textbf{high-frequency checks (HFCs)} of the incoming data. +\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks/wiki}} +Ensuring high quality data requires a combination of both real-time data checks and field monitoring. For this book, we focus on high-frequency data checks, and specific data-related considerations for field monitoring. + +\subsection{High-frequency checks} +High-frequency checks should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, +and that additional field effort is centered where it is most important. Data quality checks should be run on the data every time it is downloaded (ideally on a daily basis), to flag irregularities in survey progress, sample completeness or response quality. +\texttt{ipacheck}\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks}} +is a very useful command that automates some of these tasks. -\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} -involve more extensive collaboration with the field team, -and are best thought of as direct data audits. -In back-checks, a random subset of the field sample is chosen -and basic information from the full survey is verified -using a small survey re-done with the same respondent. -Back-checks should be done with a substantial portion -of the full sample early in the survey -so that the enumerators and field team -get used to the idea of \textbf{quality assurance}. -Checks should continue throughout fieldwork, -and their content and targeting can be refined if particular -questionnaire items are flagged as error-prone -or specific enumerators or observations appear unusual. - -Back-checks cover three main types of questions. -First, they validate basic information that should not change. -This ensures the basic quality control that the right respondent -was interviewed or observed in a given survey, -and flags cases of outright quality failure that need action. -Second, they check the quality of enumeration, -particularly in cases that involve measurement or calculation -on the part of the enumerator. -This can be anything such as correctly calculating plot sizes, -family rosters, or income measurements. -These questions should be carefully validated -to determine whether they are reliable measures -and how much they may vary as a result of difficulty in measurement. -Finally, back-checks confirm that questions are being asked and answered -in a consistent fashion. Some questions, if poorly phrased, -can be hard for the enumerator to express or for all respondents -to understand in an identical fashion. -Changes in responses between original and back-check surveys -of the same respondent -can alert you and the team that changes need to be made -to portions of the survey. -\texttt{bcstats} is a Stata command for back checks -that takes the different question types into account when comparing surveys. + +\subsubsection{Sample completeness} +It is important to check every day that the households interviewed match the survey sample. Reporting errors and duplicate observations in real-time allows the field team to make corrections efficiently. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Duplicates_and_Survey_Logs}} +It also helps the team track attrition, so that it is clear early on if a change in protocols or additional tracking will be needed. It is also important to check interview progress and sample compliance by surveyor and survey team, to identify any under-performing individuals or teams. + +To assess sample completeness, observations first need to be checked for duplicate entries, which may occur due to field errors or duplicated submissions to the server. +\texttt{ieduplicates}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ieduplicates}} +provides a workflow for collaborating on the resolution of duplicate entries between you and the field team. +Next, observed units in the data must be validated against the expected sample: +this is as straightforward as \texttt{merging} the sample list with the survey data and checking for mismatches. When all data collection is complete, the survey team should have a final field report @@ -212,124 +147,65 @@ \section{Data quality assurance} and loss to follow-up occurred in the field and how they were implemented and resolved. -%------------------------------------------------ -\section{Collecting data securely} - -At all points in the data collection and handling process, -access to personally-identifying information (PII) -must always be restricted to the members of the team -who have been granted that permission by the appropriate agreement -(typically an IRB approving data collection or partner agency providing it). -Any established data collection platform will always \textbf{encrypt}\sidenote{\textbf{Encryption:} the process of making information unreadable -to anyone without access to a specific deciphering key. \\ \url{https://dimewiki.worldbank.org/wiki/Encryption}} -all data submitted from the field automatically while in transit -(i.e., upload or download), so if you use servers hosted by SurveyCTO -or SurveySolutions this is nothing you need to worry about. -Your data will be encrypted from the time it leaves the device -(in tablet-assisted data collation) or your browser (in web data collection) -until it reaches the server. - -\textbf{Encryption at rest} is the only way to ensure -that PII data remains private when it is stored -on someone else's server on the open internet. -Encryption makes data files completely unusable -without access to a security key specific to that data -- -a higher level of security than password-protection. -The World Bank's and many of our donors' security requirements -for data storage can only be fulfilled by this method. -We recommend keeping your data encrypted whenever PII data is collected -- -therefore, we recommend it for all field data collection. - -Encryption in cloud storage, by contrast, is not enabled by default. -This is because the service will not encrypt user data unless you confirm -you know how to operate the encryption system and understand the consequences if basic protocols are not followed. -Encryption at rest is different from password-protection: -encryption at rest makes the underlying data itself unreadable, -even if accessed, except to users who have a specific private \textbf{keyfile}. -Encryption at rest requires active participation from you, the user, -and you should be fully aware that if your private key is lost, -there is absolutely no way to recover your data. - -To enable data encryption in SurveyCTO, for example, -simply select the encryption option -when you create a new form on a SurveyCTO server. -Other data collection services should work similarly, -but the exact implementation of encryption at rest -will vary slightly from service to service. -At that time, the service will allow you to download -- once -- -the keyfile pair needed to decrypt the data. -You must download and store this in a secure location. -Make sure you store keyfiles with descriptive names to match the survey to which it corresponds. -Any time you access the data - either when viewing it in browser or syncing it to your -computer - the user will be asked to provide this keyfile. It could differ between the software, -but typically you would copy that keyfile from where it is stored (for example LastPass) -to your desktop, point to it, and the rest is automatic. -After each time you use the keyfile, delete it from your desktop, -but not from your password manager. - -Finally, you should ensure that all teams take basic precautions -to ensure the security of data, as most problems are due to human error. -Most importantly, all computers, tablets, and accounts used -\textit{must} have a logon password associated with them. -Ideally, the machine hard drives themselves should also be encrypted. -This policy should also be applied to physical data storage -such as flash drives and hard drives; -similarly, files sent to the field containing PII data -such as the sampling list should at least be password-protected. -This can be done using a zip-file creator. -LastPass can also be used to share passwords securely, -and you cannot share passwords across email. -This step significantly mitigates the risk in case there is -a security breach such as loss, theft, hacking, or a virus, -and adds very little hassle to utilization. - - -In this section, you finally get your hands on some data! -What do we do with it? Data handling is one of the biggest -``black boxes'' in primary research -- it always gets done, -but teams have wildly different approaches for actually doing it. -This section breaks the process into key conceptual steps -and provides at least one practical solution for each. -Initial receipt of data will proceed as follows: -the data will be downloaded, and a ``gold master'' copy -of the raw data should be permanently stored in a secure location. -Then, a ``master'' copy of the data is placed into an encrypted location -that will remain accessible on disk and backed up. -This handling satisfies the rule of three: -there are two on-site copies of the data and one off-site copy, -so the data can never be lost in case of hardware failure. -For this step, the remote location can be a variety of forms: -the cheapest is a long-term cloud storage service -such as Amazon Web Services or Microsoft Azure. -Equally sufficient is a physical hard drive -stored somewhere other than the primary work location -(and encrypted with a service like BitLocker To Go). -If you remain lucky, you will never have to access this copy -- -you just want to know it is out there, safe, if you need it. - -The copy of the raw data you are going to use -should be handled with care. -Since you will probably need to share it among the team, -it should be placed in an encrypted storage location, -although the data file itself may or may not need to be encrypted. -Enterprise cloud solutions like Microsoft OneDrive can work as well. -If the service satisfies your security needs, -the raw data can be stored unencrypted here. -Placing encrypted data (such as with VeraCrypt) -into an unencrypted cloud storage location (such as Dropbox) -may also satisfy this requirement for some teams, -since this will never make the data visible to someone -who gets access to the Dropbox, -without the key to the file that is generated on encryption. -\textit{The raw data file must never be placed in Dropbox unencrypted, however.} -The way VeraCrypt works is that it creates a virtual copy -of the unencrypted file outside of Dropbox, and lets you access that copy. -Since you should never edit the raw data, this will not be very cumbersome, -but the decryption and usage of the raw data is a manual process. -With the raw data securely stored and backed up, -you are ready to move to de-identification, data cleaning, and analysis. +\subsubsection{response quality} +As discussed above, modern survey software makes it relatively easy to control for issues in individual surveys as part of the questionnaire programming, using a combination of in-built features such as hard constraints on answer ranges and soft confirmations or validation questions. These features allow you to spend more time looking for issues that the software cannot check automatically, such as consistency across multiple responses or suspicious timing or resopnse patters from specific enumerators. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} + + +\section{Data considerations for field monitoring} +Careful monitoring of field work is essential for high quality data. +\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} +and other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. +For back-checks, a random subset of the field sample is chosen and a subset of information from the full survey is verified through a brief interview with the original respondent. Design of the backcheck questionnaire follows the same survey design principles discussed above, in particular you should use the pre-analysis plan or list of key outcomes to establish which subset of variables to prioritize. + +Real-time access to the survey data increases the potential utility of backchecks dramatically, and both simplifies and improves the rigor of related workflows. You can use the raw data to ensure that the backcheck sample is appropriately apportioned across interviews and survey teams. As soon as backchecks are done, the backcheck data can be tested against the original data to identify areas of concern in real-time. +\texttt{bcstats} is a useful tool for analyzing back-check data in Stata module. +\sidenote{url{https://ideas.repec.org/c/boc/bocode/s458173.html}} + %------------------------------------------------ +\section{Collecting Data Securely} +Primary data collection almost always includes +\textbf{personally-identifiable information (PII)} \sidenote{url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. +PII must be handled with great care at all points in the data collection and management process, to avoid breaches of confidentiality. +Access to PII must be restricted to team members granted that permission by the applicable Institutional Review Board or a data licensing agreement with a partner agency. Research teams must maintain strict protocols for data security at each stage of the process: data collection, storage, and sharing. + +\subsection{Securing data in the field} +All mainstream data collection software automatically \textbf{encrypt}\sidenote{\textbf{Encryption:} the process of making information unreadable +to anyone without access to a specific deciphering key. +\sidenote{url{https://dimewiki.worldbank.org/wiki/Encryption}} +all data submitted from the field while in transit (i.e., upload or download). As long as you are using an established CAPI software, this step is taken care of. Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. + +\subsection{Securing data on the server} +\textbf{Encryption at rest} is the only way to ensure that PII data remains private when it is stored on someone else's server on the open internet. +Encryption makes data files completely unusable without access to a security key specific to that data -- +a higher level of security than password-protection. You must keep your data encrypted on the server whenever PII data is collected. + +Encryption in cloud storage is not enabled by default. The service will not encrypt user data unless you confirm you know how to operate the encryption system and understand the consequences if basic protocols are not followed. +Encryption at rest is different from password-protection: encryption at rest makes the underlying data itself unreadable, even if accessed, except to users who have a specific private \textbf{keyfile}. Encryption at rest requires active participation from you, the user, and you should be fully aware that if your private key is lost, there is absolutely no way to recover your data. + +When you enable encryption, the service will allow you to download -- once -- the keyfile pair needed to decrypt the data. +You must download and store this in a secure location. Make sure you store keyfiles with descriptive names to match the survey to which they correspond. +Any time anyone accesses the data - either when viewing it in the browser or downloading it to your computer - they will be asked to provide the keyfile. + + +\subsection{Securing stored data} +How should you ensure data security once downloaded to a computer? +The workflow for securely receiving and storing data looks like this: +\begin{itemize} + \item download data + \item store a ``master'' copy of the data into an encrypted location that will remain accessible on disk and be regularly backed up + \item secure a ``gold master'' copy of the raw data in a secure location, such as a long-term cloud storage service or an encrypted physical hard drive stored in a separate location. If you remain lucky, you will never have to access this copy -- you just want to know it is out there, safe, if you need it. + +\end{itemize} + +This handling satisfies the rule of three: there are two on-site copies of the data and one off-site copy, so the data can never be lost in case of hardware failure. In addition, you should ensure that all teams take basic precautions to ensure the security of data, as most problems are due to human error. +Most importantly, all computers, tablets, and accounts used \textit{must} have a logon password associated with them. +Ideally, the machine hard drives themselves should also be encrypted, as well as any external hard drives or flash drives used. All files sent to the field containing PII data, such as sampling lists, should at least be password protected. This can be done using a zip-file creator. +You must never share passwords by email; rather, use a secure password manager. This significantly mitigates the risk in case there is a security breach such as loss, theft, hacking, or a virus, with little impact on day-to-day utilization. +With the raw data securely stored and backed up, you are ready to move to de-identification, data cleaning, and analysis. + +%------------------------------------------------ From 09a4dcf12c2c4a30e03e87c6d1c53b5aecf70bb6 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 10 Jan 2020 15:22:48 -0500 Subject: [PATCH 168/854] Ch5 re-write Formatting edits --- chapters/data-collection.tex | 78 ++++++++++++++++++------------------ 1 file changed, 38 insertions(+), 40 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 67cfd4f24..4d2ad1090 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -22,7 +22,7 @@ \section{Designing CAPI questionnaires} An easy-to-read paper version of the questionnaire is particularly critical for training enumerators, so they can get an overview of the survey content and structure before diving into the programming. It is much easier for enumerators to understand all possible response pathways from a paper version, than from swiping question by question. Finalizing the questionnaire before programming also avoids version control concerns that arise from concurrent work on paper and electronic survey instruments. In addition, a paper questionnaire is an important documentation for data publication. The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: begin from broad concepts and slowly flesh out the specifics. It is essential to start with a clear understanding of the -\textbf{theory of change} \sidenote{url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} +\textbf{theory of change} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} and \textbf{experimental design} for your project.The first step of questionnaire design is to list key outcomes of interest, as well as the main covariates and variables needed for experimental design. The ideal starting point for this is a \textbf{pre-analysis plan}. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} @@ -31,34 +31,33 @@ \section{Designing CAPI questionnaires} Each module should then be expanded into specific indicators to observe in the field. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} At this point, it is useful to do a -\textbf{content-focused pilot} \sidenote{url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. +\textbf{content-focused pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. \subsection{Questionnaire design considerations for quantitative analysis} This book covers surveys designed to yield datasets useful for quantitative analysis. This is a subset of surveys, and there are specific design considerations that will help to ensure the raw data outputs are ready for analysis. -\subsubsection{Coded response options} +\textit{Coded response options:} From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like \textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. -\subsubsection{Sample tracking} +\textit{Sample tracking:} \textbf{Extensive tracking} sections - in which reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up} are documented - \index{attrition}\index{contamination} -are essential data components for completing CONSORT -\sidenote[][-3.5cm]{\textbf{CONSORT:} a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups.} -records. -\sidenote[][-1.5cm]{Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K. F., Simel, D., et al. (1996). Improving the quality of reporting of randomized controlled trials: The CONSORT statement. \textit{JAMA}, 276(8):637--639} +are essential data components for completing CONSORT records. +\sidenote[][-3.5cm]{\textbf{CONSORT:} a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups. Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K. F., Simel, D., et al. (1996). Improving the quality of reporting of randomized controlled trials: The CONSORT statement. \textit{JAMA}, 276(8):637--639} -\subsubsection{How to name questions} +\textit{How to name questions:} % needs update There is not yet a full consensus over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. -\subsubsection{importance of a unique ID} +\textit{importance of a unique ID:} When ID matching and tracking across rounds is essential, the survey should be prepared to verify new data against \textbf{preloaded data} from master records or from other rounds. +\end{itemize} Once the content of the questionnaire is finalized, it should be translated into appropriate language(s). Only then is it time to proceed with the CAPI programming. @@ -66,22 +65,23 @@ \subsubsection{importance of a unique ID} %------------------------------------------------ \section{Programming CAPI questionnaires} -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} - Most data collection is now done using software tools specifically designed for surveys. CAPI surveys \sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted_Personal_Interviews_(CAPI)}} -are typically created in a spreadsheet (e.g. Excel or Google Sheets), or software-specific form builder. As these are typically accessible even to novice users, we will not address software-specific form design in this book. Rather, we focus coding conventions that are important to follow regardless of CAPI software choice, rather than software-specific form design. +are typically created in a spreadsheet (e.g. Excel or Google Sheets), or software-specific form builder. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} +As these are typically accessible even to novice users, we will not address software-specific form design in this book. Rather, we focus coding conventions that are important to follow regardless of CAPI software choice, rather than software-specific form design. \sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} -CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. However, these are not fully automatic: you still need to actively design and manage the survey. Each software has specific practices that you need to follow to take advantage of CAPI features, encrypt you data, and ensure that the exported data is compatible with the software that will be used for analysis. +CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. However, these are not fully automatic: you still need to actively design and manage the survey. Each software has specific practices that you need to follow to take advantage of CAPI features and ensure that the exported data is compatible with the software that will be used for analysis. -\section{CAPI features} -\subsection{uniquely identified data} +\subsection{CAPI features} +\begin{itemize} +\item{Unique Identifier} As it is critical to be able to uniquely identify each observation, and link it to the original sample, build in a programming check to confirm the numeric ID and identifying details of the household. +\item{etc} +\end{itemize} -\section{Data encryption} - -\subsection{compatibility with data analysis software} +\subsection{Compatibility with analysis software} The \texttt{ietestform} command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of \texttt{iefieldkit}, implements form-checking routines for \textbf{SurveyCTO}, a proprietary implementation of \textbf{Open Data Kit (ODK)}. @@ -96,8 +96,8 @@ \subsection{version control} %------------------------------------------------ \section{Piloting} \textbf{piloted}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} -https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes -https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes +\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes} +\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes} %------------------------------------------------ @@ -109,14 +109,12 @@ \section{Data quality assurance} \sidenote{\url{https://github.com/PovertyAction/high-frequency-checks/wiki}} Ensuring high quality data requires a combination of both real-time data checks and field monitoring. For this book, we focus on high-frequency data checks, and specific data-related considerations for field monitoring. -\subsection{High-frequency checks} High-frequency checks should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, and that additional field effort is centered where it is most important. Data quality checks should be run on the data every time it is downloaded (ideally on a daily basis), to flag irregularities in survey progress, sample completeness or response quality. \texttt{ipacheck}\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks}} is a very useful command that automates some of these tasks. - -\subsubsection{Sample completeness} +\subsection{Sample completeness} It is important to check every day that the households interviewed match the survey sample. Reporting errors and duplicate observations in real-time allows the field team to make corrections efficiently. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Duplicates_and_Survey_Logs}} It also helps the team track attrition, so that it is clear early on if a change in protocols or additional tracking will be needed. It is also important to check interview progress and sample compliance by surveyor and survey team, to identify any under-performing individuals or teams. @@ -148,12 +146,12 @@ \subsubsection{Sample completeness} and how they were implemented and resolved. -\subsubsection{response quality} +\subsection{response quality} As discussed above, modern survey software makes it relatively easy to control for issues in individual surveys as part of the questionnaire programming, using a combination of in-built features such as hard constraints on answer ranges and soft confirmations or validation questions. These features allow you to spend more time looking for issues that the software cannot check automatically, such as consistency across multiple responses or suspicious timing or resopnse patters from specific enumerators. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} -\section{Data considerations for field monitoring} +\subsection{Data considerations for field monitoring} Careful monitoring of field work is essential for high quality data. \textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. @@ -161,21 +159,21 @@ \section{Data considerations for field monitoring} Real-time access to the survey data increases the potential utility of backchecks dramatically, and both simplifies and improves the rigor of related workflows. You can use the raw data to ensure that the backcheck sample is appropriately apportioned across interviews and survey teams. As soon as backchecks are done, the backcheck data can be tested against the original data to identify areas of concern in real-time. \texttt{bcstats} is a useful tool for analyzing back-check data in Stata module. -\sidenote{url{https://ideas.repec.org/c/boc/bocode/s458173.html}} +\sidenote{\url{https://ideas.repec.org/c/boc/bocode/s458173.html}} %------------------------------------------------ \section{Collecting Data Securely} Primary data collection almost always includes -\textbf{personally-identifiable information (PII)} \sidenote{url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. -PII must be handled with great care at all points in the data collection and management process, to avoid breaches of confidentiality. -Access to PII must be restricted to team members granted that permission by the applicable Institutional Review Board or a data licensing agreement with a partner agency. Research teams must maintain strict protocols for data security at each stage of the process: data collection, storage, and sharing. +\textbf{personally-identifiable information (PII)} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. +PII must be handled with great care at all points in the data collection and management process, to comply with ethical requirements and avoid breaches of confidentiality. Access to PII must be restricted to team members granted that permission by the applicable Institutional Review Board or a data licensing agreement with a partner agency. Research teams must maintain strict protocols for data security at each stage of the process: data collection, storage, and sharing. \subsection{Securing data in the field} -All mainstream data collection software automatically \textbf{encrypt}\sidenote{\textbf{Encryption:} the process of making information unreadable -to anyone without access to a specific deciphering key. -\sidenote{url{https://dimewiki.worldbank.org/wiki/Encryption}} -all data submitted from the field while in transit (i.e., upload or download). As long as you are using an established CAPI software, this step is taken care of. Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. +All mainstream data collection software automatically \textbf{encrypt} +\sidenote{\textbf{Encryption:} the process of making information unreadable to anyone without access to a specific deciphering key.} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} +all data submitted from the field while in transit (i.e., upload or download). Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. Therefore, as long as you are using an established CAPI software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, and accounts used have a logon password and are never left unlocked. \subsection{Securing data on the server} \textbf{Encryption at rest} is the only way to ensure that PII data remains private when it is stored on someone else's server on the open internet. @@ -193,15 +191,15 @@ \subsection{Securing data on the server} \subsection{Securing stored data} How should you ensure data security once downloaded to a computer? The workflow for securely receiving and storing data looks like this: -\begin{itemize} - \item download data - \item store a ``master'' copy of the data into an encrypted location that will remain accessible on disk and be regularly backed up - \item secure a ``gold master'' copy of the raw data in a secure location, such as a long-term cloud storage service or an encrypted physical hard drive stored in a separate location. If you remain lucky, you will never have to access this copy -- you just want to know it is out there, safe, if you need it. + +\begin{enumerate} + \item Download data + \item Store a ``master'' copy of the data into an encrypted location that will remain accessible on disk and be regularly backed up + \item Secure a ``gold master'' copy of the raw data in a secure location, such as a long-term cloud storage service or an encrypted physical hard drive stored in a separate location. If you remain lucky, you will never have to access this copy -- you just want to know it is out there, safe, if you need it. -\end{itemize} +\end{enumerate} This handling satisfies the rule of three: there are two on-site copies of the data and one off-site copy, so the data can never be lost in case of hardware failure. In addition, you should ensure that all teams take basic precautions to ensure the security of data, as most problems are due to human error. -Most importantly, all computers, tablets, and accounts used \textit{must} have a logon password associated with them. Ideally, the machine hard drives themselves should also be encrypted, as well as any external hard drives or flash drives used. All files sent to the field containing PII data, such as sampling lists, should at least be password protected. This can be done using a zip-file creator. You must never share passwords by email; rather, use a secure password manager. This significantly mitigates the risk in case there is a security breach such as loss, theft, hacking, or a virus, with little impact on day-to-day utilization. From d921c32fd3ae8e89ad4f22fe04c51967d1af9d79 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 13 Jan 2020 10:01:56 -0500 Subject: [PATCH 169/854] [ch 1] fix comma, was to the left of sidenote number superscript --- chapters/handling-data.tex | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 0f90bbf49..9e7be7d2f 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -183,9 +183,8 @@ \subsection{Research transparency} There are various software solutions for building documentation over time. The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution,\index{Open Science Framework} with integrated file storage, version histories, and collaborative wiki pages. -\textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system,\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}} - \index{task management}\index{GitHub} +\textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}},\index{task management}\index{GitHub} in addition to version histories and wiki pages. Such services offers multiple different ways to record the decision process leading to changes and additions, From dab72b0c2471d3420e38e7c397e76eb5d68937ec Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 13 Jan 2020 10:05:08 -0500 Subject: [PATCH 170/854] [ch 1] remove extra commas and move one comma --- chapters/handling-data.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 9e7be7d2f..5ac304bf8 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -212,9 +212,9 @@ \subsection{Research credibility} \url{https://dimewiki.worldbank.org/wiki/Pre-Registration}} simply to create a record of the fact that the study was undertaken. This is increasingly required by publishers and can be done very quickly -using the \textbf{AEA} database,\sidenote{\url{https://www.socialscienceregistry.org/}} -the \textbf{3ie} database,\sidenote{\url{http://ridie.3ieimpact.org/}}, -the \textbf{eGAP} database,\sidenote{\url{http://egap.org/content/registration/}}, +using the \textbf{AEA} database\sidenote{\url{https://www.socialscienceregistry.org/}}, +the \textbf{3ie} database\sidenote{\url{http://ridie.3ieimpact.org/}}, +the \textbf{eGAP} database\sidenote{\url{http://egap.org/content/registration/}}, or the \textbf{OSF} registry\sidenote{\url{https://osf.io/registries}} as appropriate.\sidenote{\url{http://datacolada.org/12}} \index{pre-registration} From 48ec75b09bd5f38115177d154addd26f481787fd Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 13 Jan 2020 10:18:15 -0500 Subject: [PATCH 171/854] [ch 1] : remove double space. related to #78 --- chapters/handling-data.tex | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 5ac304bf8..778bc5af5 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -254,10 +254,11 @@ \subsection{Research credibility} \section{Ensuring privacy and security in research data} Anytime you are collecting primary data in a development research project, -you are almost certainly handling data that include \textbf{personally-identifying information (PII)}.\sidenote{ - \textbf{Personally-identifying information:} any piece or set of information that can be used to identify an individual research subject. - \url{https://dimewiki.worldbank.org/wiki/De-identification\#Personally\_Identifiable\_Information}} - \index{personally-identifying information}\index{primary data} +you are almost certainly handling data that include \textbf{personally-identifying +information (PII)}\index{personally-identifying information}\index{primary data}\sidenote{ +\textbf{Personally-identifying information:} any piece or set of information +that can be used to identify an individual research subject. +\url{https://dimewiki.worldbank.org/wiki/De-identification\#Personally\_Identifiable\_Information}}. PII data contains information that can, without any transformation, be used to identify individual people, households, villages, or firms that were included in \textbf{data collection}. \index{data collection} From 16857d76e831452c81c7a6219acce2830a8b6843 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 13 Jan 2020 11:52:49 -0500 Subject: [PATCH 172/854] [ch 2] removing excessive spaces - #78 --- chapters/planning-data-work.tex | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 08cb5b075..cc2a9b5c7 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -162,8 +162,7 @@ \subsection{Documenting decisions and tasks} you need to decide how you are going to communicate with your team. The first habit that many teams need to break is using instant communication for management and documentation. -Email is, simply put, not a system. It is not a system for anything. Neither is WhatsApp. - \index{email} \index{WhatsApp} +Email is, simply put, not a system. It is not a system for anything. Neither is WhatsApp.\index{email}\index{WhatsApp} These tools are developed for communicating ``now'' and this is what they do well. They are not structured to manage group membership or to present the same information across a group of people, or to remind you when old information becomes relevant. @@ -336,8 +335,7 @@ \subsection{Organizing files and folder structures} more importantly, ensure that your code files are always able to run on any machine. To support consistent folder organization, DIME Analytics maintains \texttt{iefolder}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iefolder}} -as a part of our \texttt{ietoolkit} package. - \index{\texttt{iefolder}} \index{\texttt{ietoolkit}} +as a part of our \texttt{ietoolkit} package.\index{\texttt{iefolder}}\index{\texttt{ietoolkit}} This Stata command sets up a pre-standardized folder structure for what we call the \texttt{DataWork} folder.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} @@ -386,12 +384,10 @@ \subsection{Organizing files and folder structures} when the syncing utility operates on them, and vice versa.) Nearly all code files and raw outputs (not datasets) are best managed this way. This is because code files are usually \textbf{plaintext} files, -and non-technical files are usually \textbf{binary} files. - \index{plaintext}\index{binary files} +and non-technical files are usually \textbf{binary} files.\index{plaintext}\index{binary files} It's also becoming more and more common for written outputs such as reports, presentations and documentations to be written using plaintext -tools such as {\LaTeX} and dynamic documents. - \index{{\LaTeX}}\index{dynamic documents} +tools such as {\LaTeX} and dynamic documents.\index{{\LaTeX}}\index{dynamic documents} Keeping such plaintext files in a version-controlled folder allows you to maintain better control of their history and functionality. Because of the high degree of dependence between code files depend and file structure, From d94e1296eac6fd47770b2de16bdfd1051ce1fe2a Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 13 Jan 2020 14:39:42 -0500 Subject: [PATCH 173/854] [ch 1] remove excessive spaces - #78 --- chapters/handling-data.tex | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 778bc5af5..382237086 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -56,8 +56,7 @@ \section{Protecting confidence in development research} \url{https://www.aeaweb.org/journals/policies/data-code/}} The empirical revolution in development research -has therefore led to increased public scrutiny of the reliability of research.\cite{rogers_2017} - \index{transparency}\index{credibility}\index{reproducibility} +has therefore led to increased public scrutiny of the reliability of research.\cite{rogers_2017}\index{transparency}\index{credibility}\index{reproducibility} Three major components make up this scrutiny: \textbf{reproducibility}\cite{duvendack2017meant}, \textbf{transparency},\cite{christensen2018transparency} and \textbf{credibility}.\cite{ioannidis2017power} Development researchers should take these concerns seriously. Many development research projects are purpose-built to address specific questions, @@ -145,8 +144,7 @@ \subsection{Research transparency} Tools like pre-registration, pre-analysis plans, and \textbf{Registered Reports}\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} -can help with this process where they are available. - \index{pre-registration}\index{pre-analysis plans}\index{Registered Reports} +can help with this process where they are available.\index{pre-registration}\index{pre-analysis plans}\index{Registered Reports} By pre-specifying a large portion of the research design,\sidenote{ \url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis/}} a great deal of analytical planning has already been completed, @@ -263,8 +261,7 @@ \section{Ensuring privacy and security in research data} individual people, households, villages, or firms that were included in \textbf{data collection}. \index{data collection} This includes names, addresses, and geolocations, and extends to personal information -such as email addresses, phone numbers, and financial information. - \index{geodata}\index{de-identification} +such as email addresses, phone numbers, and financial information.\index{geodata}\index{de-identification} It is important to keep in mind these data privacy principles not only for the individual respondent but also the PII data of their household members or other caregivers who are covered under the survey. \index{privacy} In some contexts this list may be more extensive -- @@ -360,8 +357,7 @@ \subsection{Transmitting and storing data securely} Raw data which contains PII \textit{must} therefoer be \textbf{encrypted}\sidenote{ \textbf{Encryption:} Data storage methods which ensure that accessed files are unreadable if unauthorized access is obtained. \url{https://dimewiki.worldbank.org/wiki/encryption}} -during data collection, storage, and transfer. - \index{encryption}\index{data transfer}\index{data storage} +during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} The biggest security gap is often in transmitting survey plans to and from staff in the field, since staff with technical specialization are usually in an HQ office. To protect information in transit to field staff, some key steps are: From e4f9c46ec737fd625192fa385eed7aec4b0804f0 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 13 Jan 2020 14:40:04 -0500 Subject: [PATCH 174/854] [ch 1] - typos --- chapters/handling-data.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 382237086..2ff13c14c 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -354,7 +354,7 @@ \subsection{Transmitting and storing data securely} inside that secure environment if multiple users share accounts. However, password-protection alone is not sufficient, because if the underlying data is obtained through a leak the information itself remains usable. -Raw data which contains PII \textit{must} therefoer be \textbf{encrypted}\sidenote{ +Raw data which contains PII \textit{must} therefore be \textbf{encrypted}\sidenote{ \textbf{Encryption:} Data storage methods which ensure that accessed files are unreadable if unauthorized access is obtained. \url{https://dimewiki.worldbank.org/wiki/encryption}} during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} @@ -427,7 +427,7 @@ \subsection{De-identifying and anonymizing information} There should only be one raw identified dataset copy and it should be somewhere where only approved people can access it. Finally, not everyone on the research team needs access to identified data. -Analysis that required PII data is rare +Analysis that requires PII data is rare and can be avoided by properly linking identifiers to research information such as treatment statuses and weights, then removing identifiers. From d1e12f103fe5755711fb8b08348e1f18e6209eb4 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 13 Jan 2020 14:40:59 -0500 Subject: [PATCH 175/854] [ch 1] new paragraph, not use PII is a method that deserves its own para --- chapters/handling-data.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 2ff13c14c..d3566a8cc 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -375,6 +375,7 @@ \subsection{Transmitting and storing data securely} the files that would be obtained would be useless to the recipient. In security parlance this person is often referred to as an ``intruder'' but it is rare that data breaches are nefarious or even intentional. + The easiest way to protect personal information is not to use it. It is often very simple to conduct planning and analytical work using a subset of the data that has anonymous identifying ID variables, From b6bfee313ee19026596a670e2a78a579d93e5858 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 13 Jan 2020 14:43:00 -0500 Subject: [PATCH 176/854] [ch 1] - highlights in intro the two main topics in this chapter --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index d3566a8cc..cfff83b80 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -11,7 +11,7 @@ personal lives, financial conditions, and other sensitive subjects. The rights and responsibilities involved in having such access to personal information are a core responsibility of collecting personal data. -Ethical scrutiny involves two major components: data handling and research transparency. +Ethical scrutiny involves two major components: \textbf{data handling} and \textbf{research transparency}. Performing at a high standard in both means that consumers of research can have confidence in its conclusions, and that research participants are appropriately protected. From 3828d278c959a952a2ac7d37fc06e41912253713 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 13 Jan 2020 14:55:46 -0500 Subject: [PATCH 177/854] [ch 1] - encryption edits --- chapters/handling-data.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index cfff83b80..6cbbfa721 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -355,14 +355,14 @@ \subsection{Transmitting and storing data securely} However, password-protection alone is not sufficient, because if the underlying data is obtained through a leak the information itself remains usable. Raw data which contains PII \textit{must} therefore be \textbf{encrypted}\sidenote{ - \textbf{Encryption:} Data storage methods which ensure that accessed files are unreadable if unauthorized access is obtained. + \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked or any other type of unauthorized access is obtained. \url{https://dimewiki.worldbank.org/wiki/encryption}} during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} The biggest security gap is often in transmitting survey plans to and from staff in the field, since staff with technical specialization are usually in an HQ office. To protect information in transit to field staff, some key steps are: -(a) to ensure that all devices have hard drive encryption and password-protection; -(b) that no PII information is sent over e-mail (use a secure sync drive instead); +(a) to ensure that all devices that store PII data have hard drive encryption and password-protection; +(b) that no PII information is sent over e-mail, WhatsApp etc. without encrypting the information first; and (c) all field staff receive adequate training on the privacy standards applicable to their work. Most modern data collection software has features that, From 7df24ead8a3b0eb7dfdc1d7f0b0fd281fa177861 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 13 Jan 2020 15:01:06 -0500 Subject: [PATCH 178/854] Oxford commas --- chapters/handling-data.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 6cbbfa721..4cfdabb2f 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -355,14 +355,14 @@ \subsection{Transmitting and storing data securely} However, password-protection alone is not sufficient, because if the underlying data is obtained through a leak the information itself remains usable. Raw data which contains PII \textit{must} therefore be \textbf{encrypted}\sidenote{ - \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked or any other type of unauthorized access is obtained. + \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked, or any other type of unauthorized access is obtained. \url{https://dimewiki.worldbank.org/wiki/encryption}} during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} The biggest security gap is often in transmitting survey plans to and from staff in the field, since staff with technical specialization are usually in an HQ office. To protect information in transit to field staff, some key steps are: (a) to ensure that all devices that store PII data have hard drive encryption and password-protection; -(b) that no PII information is sent over e-mail, WhatsApp etc. without encrypting the information first; +(b) that no PII information is sent over e-mail, WhatsApp, etc. without encrypting the information first; and (c) all field staff receive adequate training on the privacy standards applicable to their work. Most modern data collection software has features that, From ec01c64092a2065724e8380c295840ba50f6c322 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 13 Jan 2020 17:03:59 -0500 Subject: [PATCH 179/854] Ch5 re-write Added survey piloting --- chapters/data-collection.tex | 69 ++++++++++++++++-------------------- 1 file changed, 30 insertions(+), 39 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 4d2ad1090..6b46b0ec4 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -14,7 +14,8 @@ %------------------------------------------------ \section{Designing CAPI questionnaires} -A well-designed questionnaire results from careful planning, consideration of analysis and indicators, close review of existing questionnaires, survey pilots, and research team and stakeholder review.Although most surveys are now collected electronically -- Computer Assisted Personal Interviews (CAPI) -- +A well-designed questionnaire results from careful planning, consideration of analysis and indicators, close review of existing questionnaires, survey pilots, and research team and stakeholder review. +Although most surveys are now collected electronically (often referred to as Computer Assisted Personal Interviews (CAPI)) -- \textbf{Questionnaire design}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Design}} \index{questionnaire design} (content development) and questionnaire programming (functionality development) should be seen as two strictly separate tasks. By focusing on content first and programming implementation later, the survey design quality is better than when the questionnaire is set up in a way which is technically convenient to program. The research team should agree on all questionnaire content and design a paper version before programming a CAPI version. This facilitates a focus on content during the design process and ensures teams have a readable, printable paper version of their questionnaire. @@ -57,25 +58,31 @@ \subsection{Questionnaire design considerations for quantitative analysis} When ID matching and tracking across rounds is essential, the survey should be prepared to verify new data against \textbf{preloaded data} from master records or from other rounds. -\end{itemize} +\subsection{Content-focused Pilot} +A \textbf{Survey Pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} is the final step of questionnaire design. +A Content-focused Pilot \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} is best done on pen-and-paper, before the questionnaire is programmed. +The objective is to improve the structure and length of the questionnaire, refine the phrasing and translation of specific questions, and confirm coded response options are exhaustive.\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes}} +In addition, it is an opportunity to test and refine all survey protocols, such as how units will be sampled or pre-selected units identified. The pilot must be done out-of-sample, but in a context as similar as possible to the study sample. -Once the content of the questionnaire is finalized, it should be translated into appropriate language(s). Only then is it time to proceed with the CAPI programming. +Once the content of the questionnaire is finalized and translated, it is time to proceed with programming the electronic survey instrument. %------------------------------------------------ \section{Programming CAPI questionnaires} -Most data collection is now done using software tools specifically designed for surveys. CAPI surveys -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted_Personal_Interviews_(CAPI)}} -are typically created in a spreadsheet (e.g. Excel or Google Sheets), or software-specific form builder. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} -As these are typically accessible even to novice users, we will not address software-specific form design in this book. Rather, we focus coding conventions that are important to follow regardless of CAPI software choice, rather than software-specific form design. +Most data collection is now done using software tools specifically designed for surveys. CAPI surveys \sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted_Personal_Interviews_(CAPI)}} +are typically created in a spreadsheet (e.g. Excel or Google Sheets), or software-specific form builder. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} +As these are typically accessible even to novice users, we will not address software-specific form design in this book. Rather, we focus coding conventions that are important to follow regardless of CAPI software choice. \sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} +The starting point for questionnaire programming is a complete paper version of the questionnaire, piloted for content, and translated where needed. Starting the programming at this point reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. Most importantly, it means the research, not the technology, drives the questionnaire design. When you start programming, do not start with the first question and program your way to the last question. Instead, code from high level to small detail, following the same questionnaire outline established at design phase. The outline provides the basis for pseudocode, allowing you to start with high level structure and work down to the level of individual questions. This will save time and reduce errors. + CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. However, these are not fully automatic: you still need to actively design and manage the survey. Each software has specific practices that you need to follow to take advantage of CAPI features and ensure that the exported data is compatible with the software that will be used for analysis. \subsection{CAPI features} \begin{itemize} + \item{Survey logic}: build all skip patterns in to the survey instrument, to prevent enumerator errors related to the flow of questions + \item{Range checks}: \item{Unique Identifier} As it is critical to be able to uniquely identify each observation, and link it to the original sample, build in a programming check to confirm the numeric ID and identifying details of the household. \item{etc} @@ -83,21 +90,15 @@ \subsection{CAPI features} \subsection{Compatibility with analysis software} The \texttt{ietestform} command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of -\texttt{iefieldkit}, implements form-checking routines -for \textbf{SurveyCTO}, a proprietary implementation of \textbf{Open Data Kit (ODK)}. -However, since they make extensive use of logical structure and -relate directly to the data that will be used later, -both the field team and the data team should -collaborate to make sure that the survey suits all needs.\cite{krosnick2018questionnaire} +\texttt{iefieldkit}, implements form-checking routines for \textbf{SurveyCTO}, a proprietary implementation of \textbf{Open Data Kit (ODK)}. +However, since they make extensive use of logical structure and relate directly to the data that will be used later, +both the field team and the data team should collaborate to make sure that the survey suits all needs.\cite{krosnick2018questionnaire} -\subsection{version control} + +\subsection{Data-focused Pilot} +The final stage of questionnaire programming is another Survey Pilot. The objective of the Data-focused Pilot \sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. The Data-focused pilot should be done in advance of Enumerator training -%------------------------------------------------ -\section{Piloting} -\textbf{piloted}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} -\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes} -\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes} %------------------------------------------------ @@ -125,25 +126,14 @@ \subsection{Sample completeness} Next, observed units in the data must be validated against the expected sample: this is as straightforward as \texttt{merging} the sample list with the survey data and checking for mismatches. -When all data collection is complete, -the survey team should have a final field report -ready for validation against the sample frame and the dataset. -This should contain all the observations that were completed; -it should merge perfectly with the received dataset; -and it should report reasons for any observations not collected. -Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical -to the interpretation of any survey dataset. -It is important to structure this reporting in a way that -not only group broads rationales into specific categories -but also collects all the detailed, open-ended responses to questions the field team can provide -for any observations that they were unable to complete. -This reporting should be validated and saved -alongside the final raw data, and treated the same way. -This information should be stored as a dataset in its own right --- a \textbf{tracking dataset} -- -that records all events in which survey substitutions -and loss to follow-up occurred in the field -and how they were implemented and resolved. +When all data collection is complete, the survey team should have a final field report ready for validation against the sample frame and the dataset. +This should contain all the observations that were completed; it should merge perfectly with the received dataset; and it should report reasons for any observations not collected. +Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of any survey dataset. +It is important to structure this reporting in a way that not only group broads rationales into specific categories +but also collects all the detailed, open-ended responses to questions the field team can provide for any observations that they were unable to complete. +This reporting should be validated and saved alongside the final raw data, and treated the same way. +This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions +and loss to follow-up occurred in the field and how they were implemented and resolved. \subsection{response quality} @@ -203,6 +193,7 @@ \subsection{Securing stored data} Ideally, the machine hard drives themselves should also be encrypted, as well as any external hard drives or flash drives used. All files sent to the field containing PII data, such as sampling lists, should at least be password protected. This can be done using a zip-file creator. You must never share passwords by email; rather, use a secure password manager. This significantly mitigates the risk in case there is a security breach such as loss, theft, hacking, or a virus, with little impact on day-to-day utilization. + With the raw data securely stored and backed up, you are ready to move to de-identification, data cleaning, and analysis. %------------------------------------------------ From 0ace97ea46fea0b3cfd4194c28e394028317f0b5 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 13 Jan 2020 17:53:44 -0500 Subject: [PATCH 180/854] Ch5 re-write Additions to questionnaire programming --- chapters/data-collection.tex | 88 +++++++++++++++++------------------- 1 file changed, 41 insertions(+), 47 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 6b46b0ec4..171a16730 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -23,16 +23,14 @@ \section{Designing CAPI questionnaires} An easy-to-read paper version of the questionnaire is particularly critical for training enumerators, so they can get an overview of the survey content and structure before diving into the programming. It is much easier for enumerators to understand all possible response pathways from a paper version, than from swiping question by question. Finalizing the questionnaire before programming also avoids version control concerns that arise from concurrent work on paper and electronic survey instruments. In addition, a paper questionnaire is an important documentation for data publication. The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: begin from broad concepts and slowly flesh out the specifics. It is essential to start with a clear understanding of the -\textbf{theory of change} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} -and \textbf{experimental design} for your project.The first step of questionnaire design is to list key outcomes of interest, as well as the main covariates and variables needed for experimental design. The ideal starting point for this is a -\textbf{pre-analysis plan}. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} +\textbf{theory of change} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} and \textbf{experimental design} for your project. +The first step of questionnaire design is to list key outcomes of interest, as well as the main covariates and variables needed for experimental design. +The ideal starting point for this is a \textbf{pre-analysis plan}. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). For each module, determine if the module is applicable to the full sample, the appropriate respondent, and whether (or how often), the module should be repeated. A few examples: a module on maternal health only applies to household with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. -Each module should then be expanded into specific indicators to observe in the field. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} -At this point, it is useful to do a -\textbf{content-focused pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. +Each module should then be expanded into specific indicators to observe in the field. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} +At this point, it is useful to do a \textbf{content-focused pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. \subsection{Questionnaire design considerations for quantitative analysis} @@ -42,21 +40,13 @@ \subsection{Questionnaire design considerations for quantitative analysis} From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like \textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. -\textit{Sample tracking:} -\textbf{Extensive tracking} sections - in which reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up} are documented - -\index{attrition}\index{contamination} -are essential data components for completing CONSORT records. +\textit{Sample tracking:} it is essential to document the reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up} - +\index{attrition}\index{contamination} are essential data components for completing CONSORT records. \sidenote[][-3.5cm]{\textbf{CONSORT:} a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups. Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K. F., Simel, D., et al. (1996). Improving the quality of reporting of randomized controlled trials: The CONSORT statement. \textit{JAMA}, 276(8):637--639} \textit{How to name questions:} -% needs update -There is not yet a full consensus over how individual questions should be identified: -formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, -but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. - -\textit{importance of a unique ID:} -When ID matching and tracking across rounds is essential, the survey should be prepared to verify new data -against \textbf{preloaded data} from master records or from other rounds. +There is debate over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. +We recommend using descriptive names, but with clear prefixes so that variables within a module stay together when sorted alphabetically. {\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} Variable names should never include spaces or mixed cases (all lower case is best). Take care with the length: very long names will be cut off in certain software, which could result in a loss of uniqueness. \subsection{Content-focused Pilot} A \textbf{Survey Pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} is the final step of questionnaire design. @@ -73,46 +63,43 @@ \section{Programming CAPI questionnaires} Most data collection is now done using software tools specifically designed for surveys. CAPI surveys \sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted_Personal_Interviews_(CAPI)}} are typically created in a spreadsheet (e.g. Excel or Google Sheets), or software-specific form builder. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} As these are typically accessible even to novice users, we will not address software-specific form design in this book. Rather, we focus coding conventions that are important to follow regardless of CAPI software choice. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. However, these are not fully automatic: you still need to actively design and manage the survey. Each software has specific practices that you need to follow to take advantage of CAPI features and ensure that the exported data is compatible with the software that will be used for analysis. +\subsection{CAPI workflow} The starting point for questionnaire programming is a complete paper version of the questionnaire, piloted for content, and translated where needed. Starting the programming at this point reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. Most importantly, it means the research, not the technology, drives the questionnaire design. When you start programming, do not start with the first question and program your way to the last question. Instead, code from high level to small detail, following the same questionnaire outline established at design phase. The outline provides the basis for pseudocode, allowing you to start with high level structure and work down to the level of individual questions. This will save time and reduce errors. -CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. However, these are not fully automatic: you still need to actively design and manage the survey. Each software has specific practices that you need to follow to take advantage of CAPI features and ensure that the exported data is compatible with the software that will be used for analysis. - \subsection{CAPI features} \begin{itemize} - \item{Survey logic}: build all skip patterns in to the survey instrument, to prevent enumerator errors related to the flow of questions - \item{Range checks}: -\item{Unique Identifier} -As it is critical to be able to uniquely identify each observation, and link it to the original sample, build in a programming check to confirm the numeric ID and identifying details of the household. -\item{etc} + \item{Survey logic}: build all skip patterns into the survey instrument, to ensure that only relevant questions are asked. This covers simple skip codes, as well as more complex interdependencies (e.g., a child health module is only asked to households that report the presence of a child under 5) + \item{Range checks}: add range checks for all numeric variables to catch data entry mistakes (e.g. age must be less than 120) + \item{Confirmation of key variables}: require double entry of essential information (such as a contact phone number in a survey with planned phone follow-ups), with automatic validation that the two entries match + \item{Multimedia}: electronic questionnaires facilitate collection of images, video, and geolocation data directly during the survey, using the camera and GPS built into the tablet or phone. + \item{Preloaded data}: data from previous rounds or related surveys can be used to prepopulate certain sections of the questionnaire, and validated during the interview. + \item{Sortable response options}: filters reduce the number of response options dynamically (e.g. filtering the cities list based on the state provided). + \item{Location checks}: enumerators submit their actual location using in-built GPS, to confirm they are in the right place for the interview. + \item{Consistency checks}: check that answers to related questions align, and trigger a warning if not so that enumerators can probe further. For example, if a household reports producing 800 kg of maize, but selling 900 kg of maize from their own production. + \item{Calculations}: make the electronic survey instrument do all math, rather than relying on the enumerator or asking them to carry a calculator. \end{itemize} \subsection{Compatibility with analysis software} The \texttt{ietestform} command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of \texttt{iefieldkit}, implements form-checking routines for \textbf{SurveyCTO}, a proprietary implementation of \textbf{Open Data Kit (ODK)}. -However, since they make extensive use of logical structure and relate directly to the data that will be used later, -both the field team and the data team should collaborate to make sure that the survey suits all needs.\cite{krosnick2018questionnaire} - \subsection{Data-focused Pilot} The final stage of questionnaire programming is another Survey Pilot. The objective of the Data-focused Pilot \sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. The Data-focused pilot should be done in advance of Enumerator training - %------------------------------------------------ \section{Data quality assurance} - A huge advantage of CAPI surveys, compared to traditional paper surveys, is the ability to access and analyze the data in real time. This greatly simplifies monitoring and improves data quality assurance. As part of survey preparation, the research team should develop a \textbf{data quality assurance plan} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Quality_Assurance_Plan}}. While data collection is ongoing, a research assistant or data analyst should work closely with the field team to ensure that the survey is progressing correctly, and perform \textbf{high-frequency checks (HFCs)} of the incoming data. \sidenote{\url{https://github.com/PovertyAction/high-frequency-checks/wiki}} Ensuring high quality data requires a combination of both real-time data checks and field monitoring. For this book, we focus on high-frequency data checks, and specific data-related considerations for field monitoring. -High-frequency checks should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, -and that additional field effort is centered where it is most important. Data quality checks should be run on the data every time it is downloaded (ideally on a daily basis), to flag irregularities in survey progress, sample completeness or response quality. -\texttt{ipacheck}\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks}} +High-frequency checks should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, and that additional field effort is centered where it is most important. +Data quality checks should be run on the data every time it is downloaded (ideally on a daily basis), to flag irregularities in survey progress, sample completeness or response quality. \texttt{ipacheck}\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks}} is a very useful command that automates some of these tasks. \subsection{Sample completeness} @@ -136,26 +123,33 @@ \subsection{Sample completeness} and loss to follow-up occurred in the field and how they were implemented and resolved. -\subsection{response quality} -As discussed above, modern survey software makes it relatively easy to control for issues in individual surveys as part of the questionnaire programming, using a combination of in-built features such as hard constraints on answer ranges and soft confirmations or validation questions. These features allow you to spend more time looking for issues that the software cannot check automatically, such as consistency across multiple responses or suspicious timing or resopnse patters from specific enumerators. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} +\subsection{Response quality} +As discussed above, modern survey software makes it relatively easy to control for issues in individual surveys as part of the questionnaire programming, using a combination of in-built features such as hard constraints on answer ranges and soft confirmations or validation questions. These features allow you to spend more time looking for issues that the software cannot check automatically, such as consistency across multiple responses or suspicious timing or response patters from specific enumerators. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} Survey software also provides rich metadata, that can be useful in assessing interview quality. For example, automatically collected time stamps show how long enumerators spent per question, and trace histories show how many times answers were changed before the survey was submitted. \subsection{Data considerations for field monitoring} -Careful monitoring of field work is essential for high quality data. -\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} -and other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. -For back-checks, a random subset of the field sample is chosen and a subset of information from the full survey is verified through a brief interview with the original respondent. Design of the backcheck questionnaire follows the same survey design principles discussed above, in particular you should use the pre-analysis plan or list of key outcomes to establish which subset of variables to prioritize. +Careful monitoring of field work is essential for high quality data. +\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. +For back-checks, a random subset of the field sample is chosen and a subset of information from the full survey is verified through a brief interview with the original respondent. +Design of the backcheck questionnaire follows the same survey design principles discussed above, in particular you should use the pre-analysis plan or list of key outcomes to establish which subset of variables to prioritize. + +Real-time access to the survey data increases the potential utility of backchecks dramatically, and both simplifies and improves the rigor of related workflows. +You can use the raw data to ensure that the backcheck sample is appropriately apportioned across interviews and survey teams. +As soon as backchecks are done, the backcheck data can be tested against the original data to identify areas of concern in real-time. +\texttt{bcstats} is a useful tool for analyzing back-check data in Stata module. \sidenote{\url{https://ideas.repec.org/c/boc/bocode/s458173.html}} + +CAPI surveys also provide a unique opportunity to do audits through audio recordings of the interview, typically short recordings triggered at random throughout the questionnaire. +\textbf{Audio audits} are a useful means to assess whether the enumerator is conducting the interview as expected (and not sitting under a tree making up data). -Real-time access to the survey data increases the potential utility of backchecks dramatically, and both simplifies and improves the rigor of related workflows. You can use the raw data to ensure that the backcheck sample is appropriately apportioned across interviews and survey teams. As soon as backchecks are done, the backcheck data can be tested against the original data to identify areas of concern in real-time. -\texttt{bcstats} is a useful tool for analyzing back-check data in Stata module. -\sidenote{\url{https://ideas.repec.org/c/boc/bocode/s458173.html}} +\textcolor{red}{ +\subsection{Dashboard} +Do we want to include something here about displaying HFCs? } %------------------------------------------------ \section{Collecting Data Securely} -Primary data collection almost always includes -\textbf{personally-identifiable information (PII)} +Primary data collection almost always includes \textbf{personally-identifiable information (PII)} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. PII must be handled with great care at all points in the data collection and management process, to comply with ethical requirements and avoid breaches of confidentiality. Access to PII must be restricted to team members granted that permission by the applicable Institutional Review Board or a data licensing agreement with a partner agency. Research teams must maintain strict protocols for data security at each stage of the process: data collection, storage, and sharing. From 7a13a9b0c220e0b598ffbb05303fedaa678a9a02 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 14 Jan 2020 11:34:13 -0500 Subject: [PATCH 181/854] [ch 1] one more point why email is not note-taking --- chapters/handling-data.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 4cfdabb2f..a57d33867 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -176,7 +176,8 @@ \subsection{Research transparency} not a one-time requirement or retrospective task. New decisions are always being made as the plan begins contact with reality, and there is nothing wrong with sensible adaptation so long as it is recorded and disclosed. -(Email is \textit{not} a note-taking service, because communications are rarely well-ordered and can be easily deleted.) +(Email is \textit{not} a note-taking service, because communications are rarely well-ordered, +can be easily deleted, and is not available for future team members.) There are various software solutions for building documentation over time. The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution,\index{Open Science Framework} From b181552224edd44fd6213ad5ca4edc252bc87e65 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 14 Jan 2020 11:45:48 -0500 Subject: [PATCH 182/854] Ch5 re-write Update to "Collecting data securely" section --- chapters/data-collection.tex | 59 +++++++++++++++++++++++++----------- 1 file changed, 42 insertions(+), 17 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 171a16730..72fd1c63b 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -153,28 +153,25 @@ \section{Collecting Data Securely} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. PII must be handled with great care at all points in the data collection and management process, to comply with ethical requirements and avoid breaches of confidentiality. Access to PII must be restricted to team members granted that permission by the applicable Institutional Review Board or a data licensing agreement with a partner agency. Research teams must maintain strict protocols for data security at each stage of the process: data collection, storage, and sharing. -\subsection{Securing data in the field} +\subsection{Secure data in the field} All mainstream data collection software automatically \textbf{encrypt} \sidenote{\textbf{Encryption:} the process of making information unreadable to anyone without access to a specific deciphering key.} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} all data submitted from the field while in transit (i.e., upload or download). Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. Therefore, as long as you are using an established CAPI software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, and accounts used have a logon password and are never left unlocked. -\subsection{Securing data on the server} -\textbf{Encryption at rest} is the only way to ensure that PII data remains private when it is stored on someone else's server on the open internet. -Encryption makes data files completely unusable without access to a security key specific to that data -- -a higher level of security than password-protection. You must keep your data encrypted on the server whenever PII data is collected. - -Encryption in cloud storage is not enabled by default. The service will not encrypt user data unless you confirm you know how to operate the encryption system and understand the consequences if basic protocols are not followed. -Encryption at rest is different from password-protection: encryption at rest makes the underlying data itself unreadable, even if accessed, except to users who have a specific private \textbf{keyfile}. Encryption at rest requires active participation from you, the user, and you should be fully aware that if your private key is lost, there is absolutely no way to recover your data. +\subsection{Secure data storage} +\textbf{Encryption at rest} is the only way to ensure that PII data remains private when it is stored on someone else's server on the open internet. You must keep your data encrypted on the server whenever PII data is collected. +Encryption makes data files completely unusable without access to a security key specific to that data -- a higher level of security than password-protection. +Encryption at rest requires active participation from the user; and you should be fully aware that if your private key is lost, there is absolutely no way to recover your data. +You should not assume that your data is encrypted by default: indeed, for most CAPI software platforms, encryption needs to be enabled by the user. +To enable it, you must confirm you know how to operate the encryption system and understand the consequences if basic protocols are not followed. When you enable encryption, the service will allow you to download -- once -- the keyfile pair needed to decrypt the data. -You must download and store this in a secure location. Make sure you store keyfiles with descriptive names to match the survey to which they correspond. +You must download and store this in a secure location, such as a password manager. Make sure you store keyfiles with descriptive names to match the survey to which they correspond. Any time anyone accesses the data - either when viewing it in the browser or downloading it to your computer - they will be asked to provide the keyfile. +Only project teams members names in the IRB are allowed access to the private keyfile. - -\subsection{Securing stored data} -How should you ensure data security once downloaded to a computer? -The workflow for securely receiving and storing data looks like this: +To proceed with data analysis, you typically need a working copy of the data accessible from a personal computer. The following workflow allows you to receive data from the server and store it securely, without compromising data security. \begin{enumerate} \item Download data @@ -183,12 +180,40 @@ \subsection{Securing stored data} \end{enumerate} -This handling satisfies the rule of three: there are two on-site copies of the data and one off-site copy, so the data can never be lost in case of hardware failure. In addition, you should ensure that all teams take basic precautions to ensure the security of data, as most problems are due to human error. -Ideally, the machine hard drives themselves should also be encrypted, as well as any external hard drives or flash drives used. All files sent to the field containing PII data, such as sampling lists, should at least be password protected. This can be done using a zip-file creator. -You must never share passwords by email; rather, use a secure password manager. This significantly mitigates the risk in case there is a security breach such as loss, theft, hacking, or a virus, with little impact on day-to-day utilization. +This handling satisfies the rule of three: there are two on-site copies of the data and one off-site copy, so the data can never be lost in case of hardware failure. +In addition, you should ensure that all teams take basic precautions to ensure the security of data, as most problems are due to human error. +Ideally, the machine hard drives themselves should also be encrypted, as well as any external hard drives or flash drives used. +All files sent to the field containing PII data, such as sampling lists, should at least be password protected. This can be done using a zip-file creator. +You must never share passwords by email; rather, use a secure password manager. +This significantly mitigates the risk in case there is a security breach such as loss, theft, hacking, or a virus, with little impact on day-to-day utilization. + +\subsection{Secure data sharing} +To simplify workflow, it is best to remove PII variables from your data at the earliest possible opportunity, and save a de-identified copy of the data. +Once the data is de-identified, it no longer needs to be encrypted - therefore you can interact with it directly, without having to provide the keyfile. + +We recommend de-identification in two stages: an initial process to remove direct identifiers to create a working de-identified dataset, and a final process to remove all possible identifiers to create a publishable dataset. +The \textbf{initial de-identification} should happen directly after the encrypted data is downloaded to disk. At this time, for each variable that contains PII, ask: will this variable be needed for analysis? +If not, the variable should be dropped. Examples include respondent names, enumerator names, interview date, respondent phone number. +If the variable is needed for analysis, ask: can I encode or otherwise construct a variable to use for the analysis that masks the PII, and drop the original variable? +Examples include: geocoordinates - after construction measures of distance or area, the specific location is often not necessary; and names for social network analysis, which can be encoded to unique numeric IDs. +If PII variables are directly required for the analysis itself, it will be necessary to keep at least a subset of the data encrypted through the data analysis process. + +Flagging all potentially identifying variables in the questionnaire design stage, as recommended above, simplifies the initial de-identification. +You already have the list of variables to assess, and ideally have already assessed those against the pre-analysis plan. +If so, all you need to do is write a script to drop the variables that are not required for analysis, encode or otherwise mask those that are required, and save a working version of the data. + +The \textbf{final de-identification} is a more involved process, with the objective of creating a dataset for publication that cannot be manipulated or linked to identify any individual research participant. +You must remove all direct and indirect identifiers, and assess the risk of statistical disclosure. \sidenote{Disclosure risk: the likelihood that a released data record can be associated with an individual or organization}. +\index{statistical disclosure} +There will almost always be a trade-off between accuracy and privacy. For publicly disclosed data, you should always favor privacy. +There are a number of useful tools for de-identification: PII scanners for Stata \sidenote{\url{https://github.com/J-PAL/stata_PII_scan}} or R \sidenote{\url{https://github.com/J-PAL/PII-Scan}}, +and tools for statistical disclosure control. \sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/#}} +In cases where PII data is required for analysis, we recommend embargoing the sensitive variables when publishing the data. +Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. + -With the raw data securely stored and backed up, you are ready to move to de-identification, data cleaning, and analysis. +With the raw data securely stored and backed up, and a de-identified dataset to work with, you are ready to move to data cleaning, and analysis. %------------------------------------------------ From 2e06867a9a404b0c9df38dfdb542a9d957751ad4 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 14 Jan 2020 12:19:41 -0500 Subject: [PATCH 183/854] Ch5 re-write Data quality assurance section updated --- chapters/data-collection.tex | 39 ++++++++++++++++++------------------ 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 72fd1c63b..8c0e0f519 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -92,40 +92,40 @@ \subsection{Data-focused Pilot} %------------------------------------------------ \section{Data quality assurance} -A huge advantage of CAPI surveys, compared to traditional paper surveys, is the ability to access and analyze the data in real time. -This greatly simplifies monitoring and improves data quality assurance. As part of survey preparation, the research team should develop a -\textbf{data quality assurance plan} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Quality_Assurance_Plan}}. While data collection is ongoing, a research assistant or data analyst should work closely with the field team to ensure that the survey is progressing correctly, and perform \textbf{high-frequency checks (HFCs)} of the incoming data. +A huge advantage of CAPI surveys, compared to traditional paper surveys, is the ability to access and analyze the data while the survey is ongoing. +Data issues can be identified and resolved in real-time. Designing systematic data checks, and running them routinely throughout data collection, simplifies monitoring and improves data quality. +As part of survey preparation, the research team should develop a \textbf{data quality assurance plan} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Quality_Assurance_Plan}}. +While data collection is ongoing, a research assistant or data analyst should work closely with the field team to ensure that the survey is progressing correctly, and perform \textbf{high-frequency checks (HFCs)} of the incoming data. \sidenote{\url{https://github.com/PovertyAction/high-frequency-checks/wiki}} -Ensuring high quality data requires a combination of both real-time data checks and field monitoring. For this book, we focus on high-frequency data checks, and specific data-related considerations for field monitoring. +Data quality assurance requires a combination of both real-time data checks, survey audits, and field monitoring. Although field monitoring is critical for a successful survey, we focus on the first two in this chapter, as they are the most directly data related. + +\subsection{High Frequency Checks} High-frequency checks should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, and that additional field effort is centered where it is most important. Data quality checks should be run on the data every time it is downloaded (ideally on a daily basis), to flag irregularities in survey progress, sample completeness or response quality. \texttt{ipacheck}\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks}} is a very useful command that automates some of these tasks. -\subsection{Sample completeness} -It is important to check every day that the households interviewed match the survey sample. Reporting errors and duplicate observations in real-time allows the field team to make corrections efficiently. +It is important to check every day that the households interviewed match the survey sample. +Many CAPI software programs include case management features, through which sampled units are directly assigned to individual enumerators. +Even with careful management, it is often the case that raw data includes duplicate entries, which may occur due to field errors or duplicated submissions to the server. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Duplicates_and_Survey_Logs}} -It also helps the team track attrition, so that it is clear early on if a change in protocols or additional tracking will be needed. It is also important to check interview progress and sample compliance by surveyor and survey team, to identify any under-performing individuals or teams. - -To assess sample completeness, observations first need to be checked for duplicate entries, which may occur due to field errors or duplicated submissions to the server. \texttt{ieduplicates}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ieduplicates}} provides a workflow for collaborating on the resolution of duplicate entries between you and the field team. Next, observed units in the data must be validated against the expected sample: this is as straightforward as \texttt{merging} the sample list with the survey data and checking for mismatches. +Reporting errors and duplicate observations in real-time allows the field team to make corrections efficiently. +Tracking survey progress is important for monitoring attrition, so that it is clear early on if a change in protocols or additional tracking will be needed. +It is also important to check interview completion rate and sample compliance by surveyor and survey team, to identify any under-performing individuals or teams. -When all data collection is complete, the survey team should have a final field report ready for validation against the sample frame and the dataset. -This should contain all the observations that were completed; it should merge perfectly with the received dataset; and it should report reasons for any observations not collected. -Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of any survey dataset. +When all data collection is complete, the survey team should have a final field report, which should report reasons for any deviations between the original sample and the dataset collected. +Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of survey data. It is important to structure this reporting in a way that not only group broads rationales into specific categories but also collects all the detailed, open-ended responses to questions the field team can provide for any observations that they were unable to complete. This reporting should be validated and saved alongside the final raw data, and treated the same way. This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions and loss to follow-up occurred in the field and how they were implemented and resolved. - -\subsection{Response quality} -As discussed above, modern survey software makes it relatively easy to control for issues in individual surveys as part of the questionnaire programming, using a combination of in-built features such as hard constraints on answer ranges and soft confirmations or validation questions. These features allow you to spend more time looking for issues that the software cannot check automatically, such as consistency across multiple responses or suspicious timing or response patters from specific enumerators. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} Survey software also provides rich metadata, that can be useful in assessing interview quality. For example, automatically collected time stamps show how long enumerators spent per question, and trace histories show how many times answers were changed before the survey was submitted. +High frequency checks should also include survey-specific data checks. As CAPI software incorporates many data control features, discussed above, these checks should focus on issues CAPI software cannot check automatically. As most of these checks are survey specific, it is difficult to provide general guidance. An in-depth knowledge of the questionnaire, and a careful examination of the pre-analysis plan, is the best preparation.Examples include consistency across multiple responses, complex calculations suspicious patterns in survey timing or atypical response patters from specific enumerators. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} CAPI software typically provides rich metadata, which can be useful in assessing interview quality. For example, automatically collected time stamps show how long enumerators spent per question, and trace histories show how many times answers were changed before the survey was submitted. \subsection{Data considerations for field monitoring} @@ -135,12 +135,13 @@ \subsection{Data considerations for field monitoring} Design of the backcheck questionnaire follows the same survey design principles discussed above, in particular you should use the pre-analysis plan or list of key outcomes to establish which subset of variables to prioritize. Real-time access to the survey data increases the potential utility of backchecks dramatically, and both simplifies and improves the rigor of related workflows. -You can use the raw data to ensure that the backcheck sample is appropriately apportioned across interviews and survey teams. -As soon as backchecks are done, the backcheck data can be tested against the original data to identify areas of concern in real-time. +You can use the raw data to draw the backcheck sample; assuring it is appropriately apportioned across interviews and survey teams. +As soon as backchecks are complete, the backcheck data can be tested against the original data to identify areas of concern in real-time. \texttt{bcstats} is a useful tool for analyzing back-check data in Stata module. \sidenote{\url{https://ideas.repec.org/c/boc/bocode/s458173.html}} CAPI surveys also provide a unique opportunity to do audits through audio recordings of the interview, typically short recordings triggered at random throughout the questionnaire. -\textbf{Audio audits} are a useful means to assess whether the enumerator is conducting the interview as expected (and not sitting under a tree making up data). +\textbf{Audio audits} are a useful means to assess whether the enumerator is conducting the interview as expected (and not sitting under a tree making up data). +Do note, however, that audio audits must be included in the Informed Consent. \textcolor{red}{ \subsection{Dashboard} From f541a2e458eca3686a2eaa51cf12d4c0f1481247 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 14 Jan 2020 12:26:51 -0500 Subject: [PATCH 184/854] Typo --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index a57d33867..2ccebc2d5 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -177,7 +177,7 @@ \subsection{Research transparency} New decisions are always being made as the plan begins contact with reality, and there is nothing wrong with sensible adaptation so long as it is recorded and disclosed. (Email is \textit{not} a note-taking service, because communications are rarely well-ordered, -can be easily deleted, and is not available for future team members.) +can be easily deleted, and are not available for future team members.) There are various software solutions for building documentation over time. The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution,\index{Open Science Framework} From a2b89cdac56481ce21c5019210945b43135e43a2 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 14 Jan 2020 12:33:02 -0500 Subject: [PATCH 185/854] [CH 1] - link to github --- chapters/handling-data.tex | 151 +++++++++++++++++++------------------ 1 file changed, 76 insertions(+), 75 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 2ccebc2d5..477528b82 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -1,40 +1,40 @@ %------------------------------------------------ \begin{fullwidth} -Development research does not just \textit{involve} real people -- it also \textit{affects} real people. -Policy decisions are made every day using the results of briefs and studies, -and these can have wide-reaching consequences on the lives of millions. -As the range and importance of the policy-relevant questions -asked by development researchers grow, -so does the (rightful) scrutiny under which methods and results are placed. -Additionally, research also involves looking deeply into real people's -personal lives, financial conditions, and other sensitive subjects. -The rights and responsibilities involved in having such access -to personal information are a core responsibility of collecting personal data. -Ethical scrutiny involves two major components: \textbf{data handling} and \textbf{research transparency}. -Performing at a high standard in both means that -consumers of research can have confidence in its conclusions, -and that research participants are appropriately protected. -What we call ethical standards in this chapter are a set of practices -for research quality and data management that address these two principles. - -Neither transparency nor privacy is an ``all-or-nothing'' objective. -We expect that teams will do as much as they can to make their work -conform to modern practices of credibility, transparency, and reproducibility. -Similarly, we expect that teams will ensure the privacy of participants in research -by intelligently assessing and proactively averting risks they might face. -We also expect teams will report what they have and have not done -in order to provide objective measures of a research product's performance in both. -Otherwise, reputation is the primary signal for the quality of evidence, and two failures may occur: -low-quality studies from reputable sources may be used as evidence when in fact they don't warrant it, -and high-quality studies from sources without an international reputation may be ignored. -Both these outcomes reduce the quality of evidence overall. -Even more importantly, they usually mean that credibility in development research accumulates at international institutions -and top global universities instead of the people and places directly involved in and affected by it. -Simple transparency standards mean that it is easier to judge research quality, -and making high-quality research identifiable also increases its impact. -This section provides some basic guidelines and resources -for using field data ethically and responsibly to publish research findings. + Development research does not just \textit{involve} real people -- it also \textit{affects} real people. + Policy decisions are made every day using the results of briefs and studies, + and these can have wide-reaching consequences on the lives of millions. + As the range and importance of the policy-relevant questions + asked by development researchers grow, + so does the (rightful) scrutiny under which methods and results are placed. + Additionally, research also involves looking deeply into real people's + personal lives, financial conditions, and other sensitive subjects. + The rights and responsibilities involved in having such access + to personal information are a core responsibility of collecting personal data. + Ethical scrutiny involves two major components: \textbf{data handling} and \textbf{research transparency}. + Performing at a high standard in both means that + consumers of research can have confidence in its conclusions, + and that research participants are appropriately protected. + What we call ethical standards in this chapter are a set of practices + for research quality and data management that address these two principles. + + Neither transparency nor privacy is an ``all-or-nothing'' objective. + We expect that teams will do as much as they can to make their work + conform to modern practices of credibility, transparency, and reproducibility. + Similarly, we expect that teams will ensure the privacy of participants in research + by intelligently assessing and proactively averting risks they might face. + We also expect teams will report what they have and have not done + in order to provide objective measures of a research product's performance in both. + Otherwise, reputation is the primary signal for the quality of evidence, and two failures may occur: + low-quality studies from reputable sources may be used as evidence when in fact they don't warrant it, + and high-quality studies from sources without an international reputation may be ignored. + Both these outcomes reduce the quality of evidence overall. + Even more importantly, they usually mean that credibility in development research accumulates at international institutions + and top global universities instead of the people and places directly involved in and affected by it. + Simple transparency standards mean that it is easier to judge research quality, + and making high-quality research identifiable also increases its impact. + This section provides some basic guidelines and resources + for using field data ethically and responsibly to publish research findings. \end{fullwidth} %------------------------------------------------ @@ -53,7 +53,7 @@ \section{Protecting confidence in development research} Major publishers and funders, most notably the American Economic Association, have taken steps to require that these research components are accurately reported and preserved as outputs in themselves.\sidenote{ - \url{https://www.aeaweb.org/journals/policies/data-code/}} + \url{https://www.aeaweb.org/journals/policies/data-code/}} The empirical revolution in development research has therefore led to increased public scrutiny of the reliability of research.\cite{rogers_2017}\index{transparency}\index{credibility}\index{reproducibility} @@ -66,7 +66,7 @@ \section{Protecting confidence in development research} However, almost by definition, primary data that researchers use for such studies has never been reviewed by anyone else, so it is hard for others to verify that it was collected, handled, and analyzed appropriately.\sidenote{ - \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} + \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} Maintaining confidence in research via the components of credibility, transparency, and reproducibility is the most important way that researchers using primary data can avoid serious error, and therefore these are not by-products but core components of research output. @@ -75,7 +75,7 @@ \subsection{Research reproducibility} Reproducible research means that the actual analytical processes you used are executable by others.\cite{dafoe2014science} (We use ``reproducibility'' to refer to the precise analytical code in a specific study.\sidenote{ - \url{http://datacolada.org/76}}) + \url{http://datacolada.org/76}}) All your code files involving data cleaning, construction and analysis should be public, unless they contain identifying information. Nobody should have to guess what exactly comprises a given index, @@ -87,9 +87,10 @@ \subsection{Research reproducibility} Letting people play around with your data and code is a great way to have new questions asked and answered based on the valuable work you have already done.\sidenote{ - \url{https://blogs.worldbank.org/opendata/making-analytics-reusable}} -Services like GitHub that log your research process are valuable resources here. - \index{GitHub} + \url{https://blogs.worldbank.org/opendata/making-analytics-reusable}} +Services like \index{GitHub}GitHub\sidenote{ + \url{https://github.com}, GitHub will be discussed more in later chapters} +that log your research process are valuable resources here. Such services can show things like modifications made in response to referee comments, by having tagged version histories at each major revision. These services can also use issue trackers and abandoned work branches @@ -97,7 +98,7 @@ \subsection{Research reproducibility} as a resource to others who have similar questions. Secondly, reproducible research\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Reproducible_Research}} + \url{https://dimewiki.worldbank.org/wiki/Reproducible_Research}} enables other researchers to re-use your code and processes to do their own work more easily and effectively in the future. This may mean applying your techniques to their data @@ -110,7 +111,7 @@ \subsection{Research reproducibility} It should be easy to read and understand in terms of structure, style, and syntax. Finally, the corresponding dataset should be openly accessible unless for legal or ethical reasons it cannot be.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} + \url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} Reproducibility and transparency are not binary concepts: there's a spectrum, starting with simple materials publication. @@ -130,11 +131,11 @@ \subsection{Research transparency} Transparent research will expose not only the code, but all the other research processes involved in developing the analytical approach.\sidenote{ - \url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} + \url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} This means that readers be able to judge for themselves if the research was done well and the decision-making process was sound. If the research is well-structured, and all of the relevant documentation\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} + \url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} is shared, this makes it easy for the reader to understand the analysis later. Expecting process transparency is also an incentive for researchers to make better decisions, be skeptical and thorough about their assumptions, @@ -143,10 +144,10 @@ \subsection{Research transparency} Tools like pre-registration, pre-analysis plans, and \textbf{Registered Reports}\sidenote{ - \url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} + \url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} can help with this process where they are available.\index{pre-registration}\index{pre-analysis plans}\index{Registered Reports} By pre-specifying a large portion of the research design,\sidenote{ - \url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis/}} + \url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis/}} a great deal of analytical planning has already been completed, and at least some research questions are pre-committed for publication regardless of the outcome. This is meant to combat the ``file-drawer problem'',\cite{simonsohn2014p} @@ -171,7 +172,7 @@ \subsection{Research transparency} since you have a record of why something was done in a particular way. There are a number of available tools that will contribute to producing documentation, - \index{project documentation} +\index{project documentation} but project documentation should always be an active and ongoing process, not a one-time requirement or retrospective task. New decisions are always being made as the plan begins contact with reality, @@ -183,7 +184,7 @@ \subsection{Research transparency} The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution,\index{Open Science Framework} with integrated file storage, version histories, and collaborative wiki pages. \textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}},\index{task management}\index{GitHub} + \url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}},\index{task management}\index{GitHub} in addition to version histories and wiki pages. Such services offers multiple different ways to record the decision process leading to changes and additions, @@ -201,9 +202,9 @@ \subsection{Research credibility} Were the key research outcomes pre-specified or chosen ex-post? How sensitive are the results to changes in specifications or definitions? Tools such as \textbf{pre-analysis plans}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} + \url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} can be used to assuage these concerns for experimental evaluations - \index{pre-analysis plan} +\index{pre-analysis plan} by fully specifying some set of analysis intended to be conducted, but they may feel like ``golden handcuffs'' for other types of research.\cite{olken2015promises} Regardless of whether or not a formal pre-analysis plan is utilized, @@ -215,7 +216,7 @@ \subsection{Research credibility} the \textbf{3ie} database\sidenote{\url{http://ridie.3ieimpact.org/}}, the \textbf{eGAP} database\sidenote{\url{http://egap.org/content/registration/}}, or the \textbf{OSF} registry\sidenote{\url{https://osf.io/registries}} as appropriate.\sidenote{\url{http://datacolada.org/12}} - \index{pre-registration} +\index{pre-registration} Garden varieties of research standards from journals, funders, and others feature both ex ante (or ``regulation'') and ex post (or ``verification'') policies. @@ -254,17 +255,17 @@ \section{Ensuring privacy and security in research data} Anytime you are collecting primary data in a development research project, you are almost certainly handling data that include \textbf{personally-identifying -information (PII)}\index{personally-identifying information}\index{primary data}\sidenote{ -\textbf{Personally-identifying information:} any piece or set of information -that can be used to identify an individual research subject. -\url{https://dimewiki.worldbank.org/wiki/De-identification\#Personally\_Identifiable\_Information}}. + information (PII)}\index{personally-identifying information}\index{primary data}\sidenote{ + \textbf{Personally-identifying information:} any piece or set of information + that can be used to identify an individual research subject. + \url{https://dimewiki.worldbank.org/wiki/De-identification\#Personally\_Identifiable\_Information}}. PII data contains information that can, without any transformation, be used to identify individual people, households, villages, or firms that were included in \textbf{data collection}. - \index{data collection} +\index{data collection} This includes names, addresses, and geolocations, and extends to personal information such as email addresses, phone numbers, and financial information.\index{geodata}\index{de-identification} It is important to keep in mind these data privacy principles not only for the individual respondent but also the PII data of their household members or other caregivers who are covered under the survey. - \index{privacy} +\index{privacy} In some contexts this list may be more extensive -- for example, if you are working in an environment that is either small, specific, or has extensive linkable data sources available to others, @@ -279,13 +280,13 @@ \section{Ensuring privacy and security in research data} including approval, consent, security, and privacy. If you are a US-based researcher, you will become familiar with a set of governance standards known as ``The Common Rule''.\sidenote{ - \url{https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html}} + \url{https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html}} If you interact with European institutions or persons, you will also become familiar with ``GDPR'',\sidenote{ - \url{http://blogs.lshtm.ac.uk/library/2018/01/15/gdpr-for-research-data/}} + \url{http://blogs.lshtm.ac.uk/library/2018/01/15/gdpr-for-research-data/}} a set of regulations governing \textbf{data ownership} and privacy standards.\sidenote{ - \textbf{Data ownership:} the set of rights governing who may accesss, alter, use, or share data, regardless of who possesses it.} - \index{data ownership} + \textbf{Data ownership:} the set of rights governing who may accesss, alter, use, or share data, regardless of who possesses it.} +\index{data ownership} In all settings, you should have a clear understanding of who owns your data (it may not be you, even if you collect or possess it), the rights of the people whose information is reflected there, @@ -302,8 +303,8 @@ \subsection{Obtaining ethical approval and consent} For almost all data collection or research activities that involves PII data, you will be required to complete some form of \textbf{Institutional Review Board (IRB)} process.\sidenote{ - \textbf{Institutional Review Board (IRB):} An institution formally responsible for ensuring that research meets ethical standards.} - \index{Institutional Review Board} + \textbf{Institutional Review Board (IRB):} An institution formally responsible for ensuring that research meets ethical standards.} +\index{Institutional Review Board} Most commonly this consists of a formal application for approval of a specific protocol for consent, data collection, and data handling. An IRB which has sole authority over your project is not always apparent, @@ -346,7 +347,7 @@ \subsection{Obtaining ethical approval and consent} \subsection{Transmitting and storing data securely} Secure data storage and transfer are ultimately your personal responsibility.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Data_Security}} + \url{https://dimewiki.worldbank.org/wiki/Data_Security}} First, all online and offline accounts -- including personal accounts like computer logins and email -- need to be protected by strong and unique passwords. @@ -356,8 +357,8 @@ \subsection{Transmitting and storing data securely} However, password-protection alone is not sufficient, because if the underlying data is obtained through a leak the information itself remains usable. Raw data which contains PII \textit{must} therefore be \textbf{encrypted}\sidenote{ - \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked, or any other type of unauthorized access is obtained. - \url{https://dimewiki.worldbank.org/wiki/encryption}} + \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked, or any other type of unauthorized access is obtained. + \url{https://dimewiki.worldbank.org/wiki/encryption}} during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} The biggest security gap is often in transmitting survey plans to and from staff in the field, since staff with technical specialization are usually in an HQ office. @@ -368,7 +369,7 @@ \subsection{Transmitting and storing data securely} Most modern data collection software has features that, if enabled, make secure transmission straightforward.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} + \url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} Many also have features that ensure data is encrypted when stored on their servers, although this usually needs to be actively enabled and administered. Proper encryption means that, @@ -407,19 +408,19 @@ \subsection{Transmitting and storing data securely} \subsection{De-identifying and anonymizing information} Most of the field research done in development involves human subjects.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Human_Subjects_Approval}} - \index{human subjects} + \url{https://dimewiki.worldbank.org/wiki/Human_Subjects_Approval}} +\index{human subjects} As a researcher, you are asking people to trust you with personal information about themselves: where they live, how rich they are, whether they have committed or been victims of crimes, their names, their national identity numbers, and all sorts of other data. PII data carries strict expectations about data storage and handling, and it is the responsibility of the research team to satisfy these expectations.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Research_Ethics}} + \url{https://dimewiki.worldbank.org/wiki/Research_Ethics}} Your donor or employer will most likely require you to hold a certification from a source such as Protecting Human Research Participants\sidenote{ - \url{https://phrptraining.com}} + \url{https://phrptraining.com}} or the CITI Program.\sidenote{ - \url{https://about.citiprogram.org/en/series/human-subjects-research-hsr/}} + \url{https://about.citiprogram.org/en/series/human-subjects-research-hsr/}} In general, though, you shouldn't need to handle PII data very often once the data collection processes are completed. @@ -435,8 +436,8 @@ \subsection{De-identifying and anonymizing information} Therefore, once data is securely collected and stored, the first thing you will generally do is \textbf{de-identify} it.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/De-identification}} - \index{de-identification} + \url{https://dimewiki.worldbank.org/wiki/De-identification}} +\index{de-identification} (We will provide more detail on this in the chapter on data collection.) This will create a working de-identified copy that can safely be shared among collaborators. @@ -458,7 +459,7 @@ \subsection{De-identifying and anonymizing information} These include \texttt{PII\_detection}\sidenote{\url{https://github.com/PovertyAction/PII\_detection}} from IPA, \texttt{PII-scan}\sidenote{\url{https://github.com/J-PAL/PII-Scan}} from JPAL, and \texttt{sdcMicro}\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} from the World Bank. - \index{anonymization} +\index{anonymization} The \texttt{sdcMicro} tool, in particular, has a feature that allows you to assess the uniqueness of your data observations, and simple measures of the identifiability of records from that. From 4e4c739343ad73a1fe73e2fdc19025daeb88e972 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 13 Jan 2020 10:06:20 -0500 Subject: [PATCH 186/854] [ch 6] first stab --- bibliography.bib | 12 + chapters/data-analysis.tex | 595 +++++++++++++++++++++++++++---------- 2 files changed, 448 insertions(+), 159 deletions(-) diff --git a/bibliography.bib b/bibliography.bib index 3a2272318..95c04735c 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -1,3 +1,15 @@ +@Article{tidy-data, + author = {Hadley Wickham}, + issue = {10}, + journal = {The Journal of Statistical Software}, + selected = {TRUE}, + title = {Tidy data}, + url = {http://www.jstatsoft.org/v59/i10/}, + volume = {59}, + year = {2014}, + bdsk-url-1 = {http://www.jstatsoft.org/v59/i10/}, +} + @article{blischak2016quick, title={A quick introduction to version control with {Git} and {GitHub}}, author={Blischak, John D and Davenport, Emily R and Wilson, Greg}, diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 038f344e0..377e9de7f 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -1,179 +1,411 @@ %------------------------------------------------ \begin{fullwidth} -Data analysis is hard. Making sense of a dataset in such a way +Data analysis is hard. Making sense of a data set in such a way that makes a substantial contribution to scientific knowledge requires a mix of subject expertise, programming skills, and statistical and econometric knowledge. The process of data analysis is therefore typically a back-and-forth discussion between the various people who have differing experiences, perspectives, and research interests. -The research assistant usually ends up being the fulcrum -for this discussion, and has to transfer and translate -results among people with a wide range of technical capabilities +The research assistant usually ends up being the fulcrum for this discussion. +It is the RA's job to translate the data received from the field +into economically meaningful indicators, and to analyze them, while making sure that code and outputs do not become tangled and lost over time (typically months or years). -Organization is the key to this task. -The structure of files needs to be well-organized, -so that any material can be found when it is needed. -Data structures need to be organized, -so that the various steps in creating datasets -can always be traced and revised without massive effort. -The structure of version histories and backups need to be organized, -so that different workstreams can exist simultaneously -and experimental analyses can be tried without a complex workflow. -Finally, the outputs need to be organized, -so that it is clear what results go with what analyses, -and that each individual output is a readable element in its own right. +When it comes to code, though, analysis is the easy part, +as long as you have organized your data well. +The econometrics behind data analysis is complex, +but since this is not a book on econometrics, +this will focus a on how to prepare your data for analysis. +In fact, most of a Research Assistant time is spent cleaning data +and preparing getting it to the right format. This chapter outlines how to stay organized -so that you and the team can focus on getting the work right -rather than trying to understand what you did in the past. +so that you and the team can focus on the research. +If the practices recommended here are adopted, +it becomes much easier to use the commands already implemented +in any statistical software to analyze the data. + \end{fullwidth} %------------------------------------------------ -\section{Organizing primary survey data} +\section{Data management} + +The goal of data management is to organize +code, folders and data sets in such a manner that +it is as easy as possible to follow a project's data work. +In our experience, are four key elements to good data management are: +folder structure, task breakdown, a master script and version control. +A good folder structure organizes files +so that any material can be found when it is needed. +It creates a connection between code, data sets and outputs +that makes it clear how they are connected, +so they can always be traced and revised without massive effort. +This folder structure reflects a task breakdown into clear steps +with well-defined inputs, tasks and outputs, +so it is clear at what point a data set or table was created, +and by which script. +The master do-file connects folder structure and code, +runs all the scripts in the project in one click, +and creates a clear map of the tasks done by each piece of code, +as well as what file it requires and creates. +Finally, version histories and backups enable the team +to make changes without fear of losing information, +and create the ability to check instantly how an edit affected the other files in the project. +so that different work streams can exist simultaneously. + +% Task breakdown +We divide the processing of raw data to analysis data into three stages: +data cleaning, variable construction, and data analysis. +Though they frequently happen simultaneously, +we find that breaking them into separate scripts and data sets helps prevent mistakes. +It will be easier to understand this division in the following sections, +as we discuss what they comprise, +but the point is that each of these stages has well-defined inputs and outputs. +This makes it easier to tracking the tasks implemented in each script, +and avoids duplication of code that could lead to inconsistent results. +For each stage, there will be a code folder and a corresponding data folder +so that it is clear where each of them is implemented and where its outputs live. +The scripts and data sets for each of these should have self-explanatory and connected names. -There are many schemes to organize primary survey data. +% Folder structure +There are many schemes to organize research data. +Our preferred scheme reflects this task breakdown. \index{data organization} -We provide the \texttt{iefolder}\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/iefolder}} +We created the \texttt{iefolder}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/iefolder}} package (part of \texttt{ietoolkit}\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/ietoolkit}}) so that -a large number of teams will have identical folder structures. + \url{https://dimewiki.worldbank.org/wiki/ietoolkit}}) +based on our experience with primary survey data, +but its principles can be used for different types of data. +\texttt{iefolder} is designed to standardize folder structures across teams and projects. This means that PIs and RAs face very small costs when switching between projects, because all the materials -will be organized in exactly the same basic way.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} -Namely, within each survey round folder,\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork_Survey_Round}} -there will be dedicated folders for raw (encrypted) data; +will be organized in exactly the same basic way. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} +The first level in this folder are what we call survey round folders. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork_Survey_Round}} +You can think of a round as one source of data, +that will be cleaned in the same manner. +Inside round folders, there will be dedicated folders for raw (encrypted) data; for de-identified data; for cleaned data; and for final (constructed) data. There will be a folder for raw results, as well as for final outputs. The folders that hold code will be organized in parallel to these, so that the progression through the whole project can be followed -by anyone new to the team. Additionally, \texttt{iefolder} -provides \textbf{master do-files}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master_Do-files}} so that the entire order -of the project is maintained in a top-level dofile, +by anyone new to the team. +Additionally, \texttt{iefolder} creates \textbf{master do-files} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master_Do-files}} +so that the entire order of the project is maintained in a top-level script, ensuring that no complex instructions are needed to exactly replicate the project. -Master do-files are equally important as a tool to allow all the analysis to -be executed from one do-file that runs all other files in the correct order, -as a human readable map to the file and folder structure used for all the -code. By reading the master do-file anyone not familiar to the project should -understand which are the main tasks, what are the do-files that execute those -tasks and where in the project folder they can be found. +% Master do file +Master scripts allow all the project code to be executed +from one file. +This file connects code and folder structure through globals or path objects, +runs all other code files in the correct order, +and tracks inputs and outputs for each of them. +It is a key component of data management: +a human-readable map to the file and folder structure used for the whole project. +By reading the master do-file anyone not familiar to the project should +understand which are the main tasks, +what are the scripts that execute those tasks, +and where in the project folder they can be found. +This way, when you or someone else finds a problems or wants to make a change, +the master script should contain all the information needed +to know what files to open. -Raw data should contain only materials that are received directly from the field. -These datasets will invariably come in a host of file formats +% Version control ---------------------------------------------- +Finally, everything that can be version-controlled should be. +Version control allows you to effectively manage the history of your code, +including tracking the addition and deletion of files. +This lets you get rid of code you no longer need +simply by deleting it in the working directory, +and allows you to recover those chunks easily +if you ever need to get back previous work. +Both analysis results and data sets will change with the code. +You should have each of them stored with the code that created it. +If you are writing your code in Git/GitHub, +you can output plain text files such as \texttt{.tex} tables, +and meta data saved in \texttt{.txt} or \texttt{.csv} to that directory. +Binary files that compile the tables, as well as the complete data sets, +on the other hand, are stored in your team's shared folder. +Whenever edits are made to data cleaning and data construction, +we the master script to run all the code for you project. +Git will highlight the changes that were in data sets and results +that they entail. + +%------------------------------------------------ + +\section{Data cleaning} + +% intro: what is data cleaning ------------------------------------------------- +Data cleaning is the first stage towards transforming +the data received from the field into data that can be analyzed. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Cleaning}} +\index{data cleaning} +The cleaning process involves +(1) making the data set easily usable and understandable, and +(2) documenting individual data points and patterns that may bias the analysis. +The underlying data structure is unchanged: +it should contain only data that was collected in the field, +without any modifications to data points, +except for corrections of mistaken entries. +Cleaning is probably the most time consuming of the stages discussed in this chapter. +This is the time when you obtain an extensive understanding of +the contents and structure of the data that was collected. +You should use this time to understand the types of responses collected, +both within each survey question and across respondents. +Knowing your data set well will make it possible to do analysis. + +% Deidentification ------------------------------------------------------------------ +The initial input for data cleaning is the raw data. +It should contain only materials that are received directly from the field. +They will invariably come in a host of file formats and nearly always contain personally-identifying information. -These should be retained in the raw data folder +These files should be retained in the raw data folder \textit{exactly as they were received}, -including the precise filename that was submitted, -along with detailed documentation about the source and contents -of each of the files. This data must be encrypted -if it is shared in an insecure fashion, +The folder must be encrypted if it is shared in an insecure fashion, +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}}. and it must be backed up in a secure offsite location. Everything else can be replaced, but raw data cannot. Therefore, raw data should never be interacted with directly. -Instead, the first step upon receipt of data is \textbf{de-identification}.\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} -\index{de-identification} -There will be a code folder and a corresponding data folder -so that it is clear how de-identification is done and where it lives. -Typically, this process only involves stripping fields from raw data, -naming, formatting, and optimizing the file for storage, -and placing it as a \texttt{.dta} or other data-format file -into the de-identified data folder. -The underlying data structure is unchanged: -it should contain only fields that were collected in the field, -without any modifications to the responses collected there. -This creates a survey-based version of the file that is able -to be shared among the team without fear of data corruption or exposure. -Only a core set of team members will have access to the underlying -raw data necessary to re-generate these datasets, -since the encrypted raw data will be password-protected. -The de-identified data will therefore be the underlying source +Secure storage of the raw data +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Security}}, +means access to it will be restricted even inside the research team, +and opening it can be a bureaucratic process. +To facilitate the handling of the data, +any personally identifiable information should be removed from the data set, +and a de-identified data set can be saved in a non-encrypted folder. +De-identification, at this stage, mean stripping the data set of sensitive fields +such as names, phone numbers, addresses and geolocations.\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} +The resulting de-identified data will be the underlying source for all cleaned and constructed data. -It will also become the template dataset for final public release. - -%------------------------------------------------ +Because the identifying information contained in the raw data +\index{personally-identifying information} +is typically only used during data collection, +to find and confirm their identity of interviewees, +de-identification should not affect the usability of the data. +In fact, most cases identifying information can be converted +into non-identified variables for analysis +(e.g. GPS coordinates can be translated into distances). +However, if sensitive information is strictly needed for analysis, +all the tasks described in this chapter must be performed in a secure environment. -\section{Cleaning and constructing derivative datasets} -Once the de-identified dataset is created, data must be \textbf{cleaned}.\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Data_Cleaning}} -\index{data cleaning} -This process does not involve the creation of any new data fields. -Cleaning can be a long process: this is the phase at which -you obtain an extensive understanding of the contents and structure -of the data you collected in the field. -You should use this time to understand the types of responses collected, -both within each survey question and across respondents. -Understanding this structure will make it possible to do analysis. +% Unique ID and data entry corrections --------------------------------------------- +The next two tasks in data cleaning are typically done during the data collection, +as part of data quality monitoring. +They are: ensuring that observations are uniquely and fully identified, +\sidenote{\url{https://dimewiki.worldbank.org/wiki/ID_Variable_Properties}} +and correcting mistakes in data entry. +Though modern survey tools create unique observation identifiers, +that is not the same as having a unique ID variable for each individual in the sample +that can be cross-referenced with other records, such as the Master Data Set\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master_Data_Set}} +and other rounds of data collection. +\texttt{ieduplicates} and \texttt{iecompdup}, +two commands included in the \texttt{ietoolkit} package, +create an automated workflow to identify, correct and document +occurrences of duplicate entries. +During data quality monitoring, mistakes in data entry may be found, +including typos and inconsistent values. +Theses mistakes should be fixed during data cleaning, +and you should keep a careful record of how they were identified, and +how the correct value was obtained. -There are a number of tasks involved in cleaning data. +% Data description ------------------------------------------------------------------ +Note that if you are using secondary data, +the tasks described above are most likely unnecessary to you. +However, the last step of data cleaning, describing the data, +will probably still be necessary. +This is a key step, but can be quite repetitive. The \texttt{iecodebook} command suite, part of \texttt{iefieldkit}, is designed to make some of the most tedious components of this process, such as renaming, relabeling, and value labeling, much easier (including in data appending).\sidenote{\url{https://dimewiki.worldbank.org/wiki/iecodebook}} \index{iecodebook} -A data cleaning do-file will load the de-identified data, -make sure all the fields are named and labelled concisely and descriptively, -apply corrections to all values in the dataset, -make sure value labels on responses are accurate and concise, -and attach any experimental-design data (sampling and randomization) -back to the clean dataset before saving. -It will ensure that ID variables\sidenote{\url{https://dimewiki.worldbank.org/wiki/ID_Variable_Properties}} are correctly structured and matching, -ensure that data storage types are correctly set up -(including tasks like dropping open-ended responses and encoding strings), -and make sure that data missingness is appropriately documented -using other primary inputs from the field.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} - -Clean data then becomes the basis for constructed (final) datasets. -While data cleaning typically takes one survey dataset -and mixes it with other data inputs to create a corresponding -cleaned version of that survey data (a one-to-one process), -\textbf{data construction} is a much more open-ended process. -\index{data construction} -Data construction files mix-and-match clean datasets -to create a large number of potential analysis datasets. -Data construction is the time to create derived variables -(binaries, indices, and interactions, to name a few), -to reshape data as necessary; -to create functionally-named variables, -and to sensibly subset data for analysis. +We have a few recommendations on how to use this command for data cleaning. +First, we suggest keeping the same variable names as in the survey instrument, +so it's easy to connect the two files. +Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Applying\_Labels}} +Recodes should be used to turn codes for "Don't know", "Refused to answer", and +other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} +String variables need to be encoded, and open-ended responses, categorized or dropped\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Strings}} +(unless you are using qualitative or classification analyses, which are less common). +Finally, any additional information collected for quality monitoring purposes, +such as notes and duration field, can also be dropped using \texttt{iecodebook}. + +% Outputs ----------------------------------------------------------------- + +% Data set +The most important output of data cleaning is the cleaned data set. +It should contain the same information as the raw data set, +with no changes to data points. +It should also be easily traced back to the survey instrument, +and be accompanied by a dictionary or codebook. +Typically, one cleaned data set will be created for each data source. +Each row in the cleaned data set represents one survey entry or unit of observation. +\sidenote{\cite{tidy-data}} +If the raw data set is very large, or the survey instrument is very complex, +you may want to break the data cleaning into sub-steps, +and create intermediate cleaned data sets +(for example, one per survey module). +That can be very helpful, but having a single, ``final", data set +will help you with sharing and publishing the data. +To make sure this file doesn't get too big to be handled, +use commands such as \texttt{compress} in Stata to make sure the data +is always stored in the most efficient format. + +% Documentation +Throughout the data cleaning process, you will need inputs from the field, +including enumerator manuals, survey instruments, +supervisor notes, and data quality monitoring reports. +These materials are part of what we call data documentation +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} +\index{Documentation}, +and should be stored in the corresponding data folder, +as you will probably need them during analysis and publication. +Include in the \texttt{Documentation} folder records of any +corrections made to the data, including to duplicated entries, +as well as communications from the field where theses issues are reported. +Be very careful not to include sensitive information in +documentation that is not securely stored, +or that you intend to release as part of a replication package or data publication. + +\section{Indicator construction} + +% What is construction ------------------------------------- +Construction is the time to transform the data received from the field +into data that can be analyzed. +This is done by creating derived variables +(binaries, indices, and interactions, to name a few) +To understand why this is needed, think of a consumption module from a survey. +There will be separate variables indicating the +amount of each item in the bundle that was consumed. +There may be variables indicating the cost of these items. +You cannot run a meaningful regression on these variables. +You need to manipulate them into something that has \textit{economic} meaning. + +During this process, the data points will typically be reshaped and aggregated +so that level of the data set goes from the unit of observation to the unit of analysis. +In the consumption module example, the unit of observation is one good. +The survey will probably also include a household composition module, +where age and gender variable will be defined for each household member. +The typical unit of analysis for such indicators, on the other hand, is the household. +So to calculate the final indicator, +the number members will have to be aggregated to the household level, +taking gender and age into account; +then the expenditure or quantity consumed will be aggregated as well, +and finally the result of the latter aggregation will be divided by the result of the former. +% Why it is a separate process ------------------------------- + +% From cleaning +Construction is done separately from data cleaning for two reasons. +First, so you have a clear cut of what was the data originally received, +and what it the result of data processing that could have been done differently. +Second, because if you have different data sources, +say a baseline and an endline survey, +unless the two instruments were exactly the same, +the data cleaning will differ between the two, +but you want to make sure that variable definition is consistent across sources. +So you want to first merge the data sets and then create the variables only once. + +% From analysis Data construction is never a finished process. It comes ``before'' data analysis only in a limited sense: the construction code must be run before the analysis code. -Typically, however it is the case that the construction and analysis code +Typically, however, it is the case that the construction and analysis code are written concurrently -- both do-files will be open at the same time. It is only in the process of writing the analysis that you will come to know which constructed variables are necessary, and which subsets and alterations of the data are desired. +Still, construction should be done separately from analysis +to make sure there is consistency across different pieces of analysis. +If every do file that creates a table starts by loading a data set, +then aggregating variables, dropping observation, etc., +any change that needs to be made has to be replicated in all do files, +increasing the chances that at least one of them will use a different +sample or variable definition. + +% What to do during construction ----------------------------------------- +Construction is the step where you face the largest risk +of making a mistake that will affect your results. +Keep in mind that details and scales matter. +It is important to check and double check the value-assignments of questions +and their scales before constructing new variables based on these. +Are they in percentages or proportions? +Are all of the variables you are combining into an index or average in the same scale? +Are yes or no questions coded as 0 and 1, or 1 and 2? +This is when you use the knowledge of the data you acquired and +the documentation you created during the cleaning step. + +Adding comments to the code explaining what you are doing and why is crucial here. +There are always ways for things to go wrong that you never anticipated, +but two issues to pay extra attention to are missing values and dropped observations. +If you are subsetting a data set, drop observations explicitly, +indicating why you are doing that and how many they are. +Merging, reshaping and aggregating data sets can change both the total number of observations +and the number of observations with missing values. +Make sure to read about how each command treats missing observations and, +whenever possible, add automatic checks in the script that throw an error message +if the result is changing. + +At this point, you will also need to address some of the issues in the data +that you identified during data cleaning, +the most common of them being the presence of outliers. +How to treat outliers is a research question, +but make sure to note what we the decision made by the research team, +and how you came to it. +Results can be sensitive to the treatment of outliers, +so try to keep the original variable in the data set +so you can test how much it affects the estimates. -You should never attempt to create a ``canonical'' analysis dataset. -Each analysis dataset should be purpose-built to answer an analysis question. -It is typical to have many purpose-built analysis datasets: +% Outputs ----------------------------------------------------------------- + +% Data set +The outputs of construction are the data sets that will be used for analysis. +The level of observation of a constructed data set is the unit analysis. +Each data set is purpose-built to answer an analysis question. +Since different pieces of analysis may require different samples, +or even different units of observation, +you may have one or multiple constructed data sets, +depending on how your analysis is structured. +So don't worry if you cannot create a single, ``canonical'' analysis dataset. +It is common to have many purpose-built analysis datasets: there may be a \texttt{data-wide.dta}, \texttt{data-wide-children-only.dta}, \texttt{data-long.dta}, \texttt{data-long-counterfactual.dta}, and many more as needed. -Data construction should never be done in analysis files, -and since much data construction is specific to the planned analysis, -organizing analysis code into many small files allows that level of specificity -to be introduced at the correct part of the code process. -Then, when it comes time to finalize analysis files, -it is substantially easier to organize and prune that code -since there is no risk of losing construction code it depends on. +One thing all constructed data sets should have in common are functionally-named variables. +As you no longer need to worry about keeping variable names +consistent with the survey, they should be as intuitive as possible. + +% Documentation +The creation of final data sets and indicators must be carefully recorded, +not only through comments in the code, but also in documentation. +Someone unfamiliar with the project should be able to understand +the contents of the analysis data sets, +the steps taken to get to them, +and the decision-making process. +This should complement the reports and notes +created during data cleaning, +for example recording data patterns such as +outliers and non-responses, +creating a detailed account of the data processing. + + %------------------------------------------------ \section{Writing data analysis code} -Data analysis is the stage of the process to create research outputs. +% Intro -------------------------------------------------------------- +Data analysis is the stage of the process when research outputs are created. \index{data analysis} -Many introductions to common code skills and analytical frameworks exist: +Many introductions to common code skills and analytical frameworks exist, such as \textit{R for Data Science};\sidenote{\url{https://r4ds.had.co.nz/}} \textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical_introduction_to_stata.pdf}} \textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844_Mostly_Harmless_Econometrics_An_Empiricist's_Companion}} and @@ -182,69 +414,114 @@ \section{Writing data analysis code} as these are highly specialized and require field and statistical expertise. Instead, we will outline the structure of writing analysis code, assuming you have completed the process of data cleaning and construction. -As mentioned above, typically you will continue to keep -the data-construction files open, and add to them as needed, -while writing analysis code. - -It is important to ignore the temptation to write lots of analysis -into one big, impressive, start-to-finish analysis file. -This is counterproductive because it subtly encourages poor practices -such as not clearing the workspace or loading fresh data. -Every output table or figure should have + +% Exploratory and final data analysis ----------------------------------------- +Data analysis can be divided into two steps. +The first, which we will call exploratory data analysis, +is when you are trying different things and looking for patterns in your data. +The second step, the final analysis, +happens when the research has matured, +and your team has decided what pieces of analysis will make into the research output. + +% Organizing scripts --------------------------------------------------------- +During exploratory data analysis, +you will be tempted to write lots of analysis +into one big, impressive, start-to-finish analysis file. +Though it's fine to write such a script during a long analysis meeting, +this practice is error-prone, +because it subtly encourages poor practices such as +not clearing the workspace or loading fresh data. +It's important to take the time to organize scripts in a clean manner and to avoid mistakes. + +A well-organized analysis script starts with a completely fresh workspace and load its data directly prior to running that specific analysis. This encourages data manipulation to be done upstream (in construction), and prevents you from accidentally writing pieces of code that depend on each other, leading to the too-familiar ``run this part, then that part, then this part'' process. -Each output should be able to run completely independently -of all other code sections. +Each script should run completely independently of all other code. You can go as far as coding every output in a separate file. There is nothing wrong with code files being short and simple -- -as long as they directly correspond to specific published tables and figures. +as long as they directly correspond to specific pieces of analysis. To accomplish this, you will need to make sure that you have an effective system of naming, organization, and version control. -Version control allows you to effectively manage the history of your code, -including tracking the addition and deletion of files. -This lets you get rid of code you no longer need -simply by deleting it in the working directory, -and allows you to recover those chunks easily -if you ever need to get back previous work. -Therefore, version control supports a tidy workspace. -(It also allows effective collaboration in this space, -but that is a more advanced tool not covered here.) - -Organizing your workspace effectively is now essential. -\texttt{iefolder} will have created a \texttt{/Dofiles/Analysis/} folder; -you should feel free to add additional subfolders as needed. -It is completely acceptable to have folders for each task, -and compartmentalize each analysis as much as needed. -It is always better to have more code files open -than to have to scroll around repeatedly inside a given file. Just like you named each of the analysis datasets, each of the individual analysis files should be descriptively named. Code files such as \path{spatial-diff-in-diff.do}, \path{matching-villages.do}, and \path{summary-statistics.do} are clear indicators of what each file is doing and allow code to be found very quickly. -Eventually, you will need to compile a ``release package'' -that links the outputs more structurally to exhibits, -but at this stage function is more important than structure. - -Outputs should, whenever possible, be created in lightweight formats. -For graphics,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Checklist:_Reviewing_Graphs}} \texttt{.eps} or \texttt{.tif} files are acceptable; -for tables,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Checklist:_Submit_Table}} \texttt{.tex} is preferred (don't worry about making it -really nicely formatted until you are ready to publish). -Excel \texttt{.xlsx} files are also acceptable, -although if you are working on a large report -they will become cumbersome to update after revisions. -If you are writing your code in Git/GitHub, -you should output results into that directory, -not the directory where data is stored (i.e. Dropbox). -Results will differ across different branches -and will constantly be re-writing each other: -you should have each result stored with the code that created it. + +Analysis files should be as simple as possible, +so you can focus on the econometrics. +Research decisions should be made very explicit in the code, +including clustering, sampling and use of control variables. +As you come a decision on what are the main specifications to be reported, +you can create globals or objects in the master do file +that set these options. +This is a good way to make sure specifications are consistent throughout the analysis, +apart from being very dynamic and making it easy to update all scripts if needed. +It is completely acceptable to have folders for each task, +and compartmentalize each analysis as much as needed. +It is always better to have more code files open +than to have to scroll around repeatedly inside a given file. + +% Outputs ----------------------------------------------------- + +% Exploratory analysis +It's ok to not export each and every table and graph created during exploratory analysis. +Instead, we suggest running them into markdown files using RMarkdown or +the different dynamic document options available in Stata. +This will allow you to update and present results quickly, +while maintaining a record the different analysis explored. +% Final analysis +Final analysis scripts, on the other hand, should export final outputs: +these are ready to be included to a paper or report; and +no manual edits, including formatting, should be necessary after running them. +Manual edits are difficult to replicate, +and you will end up having to make changes to the outputs, +so automating them will save you time by the end of the process. +Don't ever set a workflow that includes copying and pasting results printed in the console. +% Output content +Finally, a final outputs should be self-standing, +meaning they are easy to read and understand +with only the information they contain. +To accomplish this, labels and notes should cover all +relevant information such as +sample, unit of observation, unit of measurement and variable definition. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Checklist:_Reviewing_Graphs}} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Checklist:_Submit_Table}} + +% Output formats +Outputs should be saved in accessible and, whenever possible, lightweight formats. +% Figures +Accessible means that other people can easily open them -- +in Stata, that would mean always using \texttt{graph export} +to save images as \texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., +instead of \texttt{graph save}, which creates a \texttt{.gph} file +that can only be opened through a Stata installation. +\texttt{.tif} and \texttt{.eps} are two examples of accessible lightweight formats, +and \texttt{.eps} has the added advantage of allowing a designer to edit the images for printing. +% Tables +For tables, \texttt{.tex} is preferred. +A variety of packages in both R and Stata export tables in this format. +Excel \texttt{.xlsx} and \texttt{.csv} files are also acceptable, +although if you are working on a large report they will become cumbersome to update after revisions. + +% Formatting +Don't spend too much time formatting tables and graphs until you are ready to publish. +Polishing final outputs can be a time-consuming process, +and you want to avoid doing it multiple times. +If you need to create a table with a very particular format, +that is not automated by any command you know, +consider writing the table manually +(Stata's \texttt{filewrite}, for example, allows you to do that). +This will allow you to write a cleaner script that focuses on the econometrics, +and not on complicated commands to create and append intermediate matrices. +To avoid cluttering your scripts with formatting and ensure that formatting is consistent across outputs, +define formatting options in an R object or a Stata global and call them when needed. %------------------------------------------------ From fc8c46038c081f5fe79dc4aebcaeba15d2a0f174 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 13 Jan 2020 17:27:25 -0500 Subject: [PATCH 187/854] [ch 6] data management --- chapters/data-analysis.tex | 204 +++++++++++++++++-------------------- 1 file changed, 94 insertions(+), 110 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 377e9de7f..73df583c0 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -1,77 +1,65 @@ %------------------------------------------------ \begin{fullwidth} -Data analysis is hard. Making sense of a data set in such a way -that makes a substantial contribution to scientific knowledge -requires a mix of subject expertise, programming skills, -and statistical and econometric knowledge. -The process of data analysis is therefore typically -a back-and-forth discussion between the various people -who have differing experiences, perspectives, and research interests. -The research assistant usually ends up being the fulcrum for this discussion. -It is the RA's job to translate the data received from the field -into economically meaningful indicators, and to analyze them, -while making sure that code and outputs do not become -tangled and lost over time (typically months or years). +Data analysis is hard. +Transforming raw data into a substantial contribution to scientific knowledge +requires a mix of subject expertise, programming skills, +and statistical and econometric knowledge. +The process of data analysis is, therefore, +a back-and-forth discussion between people +with differing experiences, perspectives, and research interests. +The research assistant usually ends up being the pivot of this discussion. +It is their job to translate the data received from the field into +economically meaningful indicators and to analyze them +while making sure that code and outputs do not become tangled and lost over time. When it comes to code, though, analysis is the easy part, -as long as you have organized your data well. -The econometrics behind data analysis is complex, -but since this is not a book on econometrics, -this will focus a on how to prepare your data for analysis. -In fact, most of a Research Assistant time is spent cleaning data -and preparing getting it to the right format. -This chapter outlines how to stay organized -so that you and the team can focus on the research. -If the practices recommended here are adopted, -it becomes much easier to use the commands already implemented -in any statistical software to analyze the data. +as long as you have organized your data well. +Of course, the econometrics behind data analysis is complex, +but this is not a book on econometrics. +Instead, this chapter will focus on how to organize your data work. +Most of a Research Assistant's time is spent cleaning data and getting it into the right format. +When the practices recommended here are adopted, +it becomes much easier to analyze the data +using commands that are already implemented in any statistical software. + \end{fullwidth} %------------------------------------------------ \section{Data management} - -The goal of data management is to organize -code, folders and data sets in such a manner that -it is as easy as possible to follow a project's data work. -In our experience, are four key elements to good data management are: -folder structure, task breakdown, a master script and version control. -A good folder structure organizes files -so that any material can be found when it is needed. -It creates a connection between code, data sets and outputs -that makes it clear how they are connected, -so they can always be traced and revised without massive effort. -This folder structure reflects a task breakdown into clear steps -with well-defined inputs, tasks and outputs, -so it is clear at what point a data set or table was created, -and by which script. -The master do-file connects folder structure and code, -runs all the scripts in the project in one click, -and creates a clear map of the tasks done by each piece of code, -as well as what file it requires and creates. -Finally, version histories and backups enable the team -to make changes without fear of losing information, -and create the ability to check instantly how an edit affected the other files in the project. -so that different work streams can exist simultaneously. +The goal of data management is to organize the components of data work +so it can traced back and revised without massive effort. +In our experience, there are four key elements to good data management: +folder structure, task breakdown, master scripts, and version control. +A good folder structure organizes files so that any material can be found when needed. +It reflects a task breakdown into steps with well-defined inputs, tasks, and outputs. +This breakdown is applied to code, data sets, and outputs. +A master script connects folder structure and code. +It is a one-file summary of your whole project. +Finally, version histories and backups enable the team +to edit files without fear of losing information. +Smart use of version control also allows you to track +how each edit affects other files in the project. % Task breakdown -We divide the processing of raw data to analysis data into three stages: -data cleaning, variable construction, and data analysis. -Though they frequently happen simultaneously, -we find that breaking them into separate scripts and data sets helps prevent mistakes. -It will be easier to understand this division in the following sections, -as we discuss what they comprise, -but the point is that each of these stages has well-defined inputs and outputs. -This makes it easier to tracking the tasks implemented in each script, -and avoids duplication of code that could lead to inconsistent results. -For each stage, there will be a code folder and a corresponding data folder -so that it is clear where each of them is implemented and where its outputs live. -The scripts and data sets for each of these should have self-explanatory and connected names. +We divide the process of turning raw data into analysis data in three stages: +data cleaning, variable construction, and data analysis. +Though they are frequently implemented at the same time, +we find that creating separate scripts and data sets prevents mistakes. +It will be easier to understand this division as we discuss what each stage comprises. +What you should know by now is that each of these stages has well-defined inputs and outputs. +This makes it easier to track tasks across scripts, +and avoids duplication of code that could lead to inconsistent results. +For each stage, there should be a code folder and a corresponding data set. +The names of codes, data sets and outputs for each stage should be consistent, +making clear how they relate to one another. +So, for example, a script called \texttt{clean-section-1} would create +a data set called \texttt{cleaned-section-1}. % Folder structure -There are many schemes to organize research data. +There are many schemes to organize research data. Our preferred scheme reflects this task breakdown. \index{data organization} We created the \texttt{iefolder}\sidenote{ @@ -79,63 +67,54 @@ \section{Data management} package (part of \texttt{ietoolkit}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/ietoolkit}}) based on our experience with primary survey data, -but its principles can be used for different types of data. +but it can be used for different types of data. \texttt{iefolder} is designed to standardize folder structures across teams and projects. -This means that PIs and RAs face very small costs -when switching between projects, because all the materials -will be organized in exactly the same basic way. +This means that PIs and RAs face very small costs when switching between projects, +because they are organized in the same way. \sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} -The first level in this folder are what we call survey round folders. +At the first level of this folder are what we call survey round folders. \sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork_Survey_Round}} -You can think of a round as one source of data, -that will be cleaned in the same manner. -Inside round folders, there will be dedicated folders for raw (encrypted) data; -for de-identified data; for cleaned data; and for final (constructed) data. -There will be a folder for raw results, as well as for final outputs. -The folders that hold code will be organized in parallel to these, -so that the progression through the whole project can be followed -by anyone new to the team. +You can think of a round as one source of data, +that will be cleaned in the same manner. +Inside round folders, there are dedicated folders for +raw (encrypted) data; de-identified data; cleaned data; and final (constructed) data. +There is a folder for raw results, as well as for final outputs. +The folders that hold code are organized in parallel to these, +so that the progression through the whole project can be followed by anyone new to the team. Additionally, \texttt{iefolder} creates \textbf{master do-files} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Master_Do-files}} -so that the entire order of the project is maintained in a top-level script, -ensuring that no complex instructions are needed -to exactly replicate the project. +so all project code is reflected in a top-level script. % Master do file -Master scripts allow all the project code to be executed -from one file. -This file connects code and folder structure through globals or path objects, -runs all other code files in the correct order, -and tracks inputs and outputs for each of them. -It is a key component of data management: -a human-readable map to the file and folder structure used for the whole project. -By reading the master do-file anyone not familiar to the project should -understand which are the main tasks, -what are the scripts that execute those tasks, -and where in the project folder they can be found. -This way, when you or someone else finds a problems or wants to make a change, -the master script should contain all the information needed -to know what files to open. - -% Version control ---------------------------------------------- -Finally, everything that can be version-controlled should be. -Version control allows you to effectively manage the history of your code, -including tracking the addition and deletion of files. -This lets you get rid of code you no longer need -simply by deleting it in the working directory, -and allows you to recover those chunks easily -if you ever need to get back previous work. +Master scripts allow users to execute all the project code from a single file. +It briefly describes what each code, +and maps the files they require and create. +It also connects code and folder structure through globals or objects. +In short, a master script is a human-readable map to the tasks, +files and folder structure that comprise a project. +Having a master script eliminates the need for complex instructions to replicate results. +Reading the master do-file should be enough for anyone who's unfamiliar with the project +to understand what are the main tasks, which scripts execute them, +and where different files can be found in the project folder. +That is, it should contain all the information needed to interact with a project's data work. + +% Version control +Finally, everything that can be version-controlled should be. +Version control allows you to effectively track code edits, +including the addition and deletion of files. +This way you can delete code you no longer need, +and still recover it easily if you ever need to get back previous work. Both analysis results and data sets will change with the code. You should have each of them stored with the code that created it. -If you are writing your code in Git/GitHub, +If you are writing code in Git/GitHub, you can output plain text files such as \texttt{.tex} tables, and meta data saved in \texttt{.txt} or \texttt{.csv} to that directory. -Binary files that compile the tables, as well as the complete data sets, -on the other hand, are stored in your team's shared folder. -Whenever edits are made to data cleaning and data construction, -we the master script to run all the code for you project. -Git will highlight the changes that were in data sets and results -that they entail. +Binary files that compile the tables, +as well as the complete data sets, on the other hand, +should be stored in your team's shared folder. +Whenever data cleaning or data construction codes are edited, +use the master script to run all the code for your project. +Git will highlight the changes that were in data sets and results that they entail. %------------------------------------------------ @@ -160,6 +139,7 @@ \section{Data cleaning} both within each survey question and across respondents. Knowing your data set well will make it possible to do analysis. + % Deidentification ------------------------------------------------------------------ The initial input for data cleaning is the raw data. It should contain only materials that are received directly from the field. @@ -374,12 +354,13 @@ \section{Indicator construction} or even different units of observation, you may have one or multiple constructed data sets, depending on how your analysis is structured. -So don't worry if you cannot create a single, ``canonical'' analysis dataset. +So don't worry if you cannot create a single, ``canonical'' analysis data set. It is common to have many purpose-built analysis datasets: there may be a \texttt{data-wide.dta}, \texttt{data-wide-children-only.dta}, \texttt{data-long.dta}, \texttt{data-long-counterfactual.dta}, and many more as needed. -One thing all constructed data sets should have in common are functionally-named variables. +One thing all constructed data sets should have in common, though, +are functionally-named variables. As you no longer need to worry about keeping variable names consistent with the survey, they should be as intuitive as possible. @@ -436,7 +417,7 @@ \section{Writing data analysis code} A well-organized analysis script starts with a completely fresh workspace and load its data directly prior to running that specific analysis. -This encourages data manipulation to be done upstream (in construction), +This encourages data manipulation to be done upstream (that is, during construction), and prevents you from accidentally writing pieces of code that depend on each other, leading to the too-familiar ``run this part, then that part, then this part'' process. @@ -456,8 +437,11 @@ \section{Writing data analysis code} Analysis files should be as simple as possible, so you can focus on the econometrics. -Research decisions should be made very explicit in the code, -including clustering, sampling and use of control variables. +The first thing any analysis code does is to load a data set +and explicitly set the analysis sample. +In fact, all research decisions, not only the sampling, +should be made very explicit in the code. +This includes clustering, sampling and use of control variables. As you come a decision on what are the main specifications to be reported, you can create globals or objects in the master do file that set these options. From aaad5de74af7c31ce1efcb4aefc3065232d3269d Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 13 Jan 2020 18:30:38 -0500 Subject: [PATCH 188/854] [ch6] construction solves #273 and part of #95 --- chapters/data-analysis.tex | 106 +++++++++++++++---------------------- 1 file changed, 42 insertions(+), 64 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 73df583c0..3e8af63ce 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -258,35 +258,36 @@ \section{Data cleaning} \section{Indicator construction} % What is construction ------------------------------------- -Construction is the time to transform the data received from the field -into data that can be analyzed. +Any changes to the original data set happen during construction. +It is at this stage that the raw data is transformed into analysis data. This is done by creating derived variables -(binaries, indices, and interactions, to name a few) -To understand why this is needed, think of a consumption module from a survey. -There will be separate variables indicating the +(binaries, indices, and interactions, to name a few). +To understand why construction is necessary, +let's take the example of a survey's consumption module. +It will result in separate variables indicating the amount of each item in the bundle that was consumed. There may be variables indicating the cost of these items. You cannot run a meaningful regression on these variables. You need to manipulate them into something that has \textit{economic} meaning. -During this process, the data points will typically be reshaped and aggregated -so that level of the data set goes from the unit of observation to the unit of analysis. +\textcolor{red}{During this process, the data points will typically be reshaped and aggregated +so that level of the data set goes from the unit of observation in the survey to the unit of analysis. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} In the consumption module example, the unit of observation is one good. The survey will probably also include a household composition module, where age and gender variable will be defined for each household member. -The typical unit of analysis for such indicators, on the other hand, is the household. +The typical unit of analysis for such consumption indices, on the other hand, is the household. So to calculate the final indicator, -the number members will have to be aggregated to the household level, -taking gender and age into account; -then the expenditure or quantity consumed will be aggregated as well, -and finally the result of the latter aggregation will be divided by the result of the former. +the number of household members will be aggregated to the household level; +then the expenditure or quantity consumed will be aggregated as well; +and finally the result of the latter aggregation will be divided by the result of the former.} % Why it is a separate process ------------------------------- % From cleaning Construction is done separately from data cleaning for two reasons. First, so you have a clear cut of what was the data originally received, -and what it the result of data processing that could have been done differently. +and what is the result of data processing decisions. Second, because if you have different data sources, say a baseline and an endline survey, unless the two instruments were exactly the same, @@ -296,53 +297,33 @@ \section{Indicator construction} % From analysis Data construction is never a finished process. -It comes ``before'' data analysis only in a limited sense: -the construction code must be run before the analysis code. -Typically, however, it is the case that the construction and analysis code -are written concurrently -- both do-files will be open at the same time. -It is only in the process of writing the analysis -that you will come to know which constructed variables are necessary, -and which subsets and alterations of the data are desired. -Still, construction should be done separately from analysis -to make sure there is consistency across different pieces of analysis. -If every do file that creates a table starts by loading a data set, -then aggregating variables, dropping observation, etc., -any change that needs to be made has to be replicated in all do files, -increasing the chances that at least one of them will use a different -sample or variable definition. +It comes ``before'' data analysis only in a limited sense: the construction code must be run before the analysis code. +Typically, however, construction and analysis code are written concurrently. +As you write the analysis, different constructed variables will become necessary, as well as subsets and other alterations to the data. +Still, constructing variables in a separate script from the analysis will help you ensure consistency across different outputs. +If every script that creates a table starts by loading a data set, subsetting it and manipulating variables, any edits to construction need to be replicated in all scripts. +This increases the chances that at least one of them will have a different sample or variable definition. + % What to do during construction ----------------------------------------- -Construction is the step where you face the largest risk -of making a mistake that will affect your results. +Construction is the step where you face the largest risk of making a mistake that will affect your results. Keep in mind that details and scales matter. -It is important to check and double check the value-assignments of questions -and their scales before constructing new variables based on these. +It is important to check and double-check the value-assignments of questions and their scales before constructing new variables using them. Are they in percentages or proportions? -Are all of the variables you are combining into an index or average in the same scale? +Are all variables you are combining into an index or average using the same scale? Are yes or no questions coded as 0 and 1, or 1 and 2? -This is when you use the knowledge of the data you acquired and -the documentation you created during the cleaning step. +This is when you will use the knowledge of the data you acquired and the documentation you created during the cleaning step the most. Adding comments to the code explaining what you are doing and why is crucial here. -There are always ways for things to go wrong that you never anticipated, -but two issues to pay extra attention to are missing values and dropped observations. -If you are subsetting a data set, drop observations explicitly, -indicating why you are doing that and how many they are. -Merging, reshaping and aggregating data sets can change both the total number of observations -and the number of observations with missing values. -Make sure to read about how each command treats missing observations and, -whenever possible, add automatic checks in the script that throw an error message -if the result is changing. - -At this point, you will also need to address some of the issues in the data -that you identified during data cleaning, -the most common of them being the presence of outliers. -How to treat outliers is a research question, -but make sure to note what we the decision made by the research team, -and how you came to it. -Results can be sensitive to the treatment of outliers, -so try to keep the original variable in the data set -so you can test how much it affects the estimates. +There are always ways for things to go wrong that you never anticipated, but two issues to pay extra attention to are missing values and dropped observations. +If you are subsetting a data set, drop observations explicitly, indicating why you are doing that and how the data set changed. +Merging, reshaping and aggregating data sets can change both the total number of observations and the number of observations with missing values. +Make sure to read about how each command treats missing observations and, whenever possible, add automated checks in the script that throw an error message if the result is changing. + +At this point, you will also need to address some of the issues in the data that you identified during data cleaning. +The most common of them is the presence of outliers. +How to treat outliers is a research question, but make sure to note what we the decision made by the research team, and how you came to it. +Results can be sensitive to the treatment of outliers, so keeping the original variable in the data set will allow you to test how much it affects the estimates. % Outputs ----------------------------------------------------------------- @@ -365,17 +346,14 @@ \section{Indicator construction} consistent with the survey, they should be as intuitive as possible. % Documentation -The creation of final data sets and indicators must be carefully recorded, -not only through comments in the code, but also in documentation. -Someone unfamiliar with the project should be able to understand -the contents of the analysis data sets, -the steps taken to get to them, -and the decision-making process. -This should complement the reports and notes -created during data cleaning, -for example recording data patterns such as -outliers and non-responses, -creating a detailed account of the data processing. +It is wise to start an explanatory guide as soon as you start making changes to the data. +Carefully record how specific variables have been combined, recoded, and scaled. +This can be part of a wider discussion with your team about creating protocols for variable definition. +That will guarantee that indicators are defined consistently across projects. +Documentation is an output of construction as relevant as the codes. +Someone unfamiliar with the project should be able to understand the contents of the analysis data sets, the steps taken to create them, and the decision-making process through your documentation. +The construction documentation will complement the reports and notes created during data cleaning. +Together, they will form a detailed account of the data processing. From b3d5161442b76c94914b462a1231754837ea820f Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 13 Jan 2020 20:49:28 -0500 Subject: [PATCH 189/854] [ch6] data cleaning solves #302, #278, #95 --- chapters/data-analysis.tex | 111 +++++++++++++++++++------------------ 1 file changed, 57 insertions(+), 54 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 3e8af63ce..431f3acd1 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -123,85 +123,82 @@ \section{Data cleaning} % intro: what is data cleaning ------------------------------------------------- Data cleaning is the first stage towards transforming the data received from the field into data that can be analyzed. +Data cleaning is the first stage of transforming the data you received from the field into data that you can analyze. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Cleaning}} -\index{data cleaning} -The cleaning process involves -(1) making the data set easily usable and understandable, and -(2) documenting individual data points and patterns that may bias the analysis. -The underlying data structure is unchanged: -it should contain only data that was collected in the field, -without any modifications to data points, -except for corrections of mistaken entries. +The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. +The underlying data structure does not change. +The cleaned data set should contain only the data collected in the field. +No modifications to data points are made at this stage, except for corrections of mistaken entries. + Cleaning is probably the most time consuming of the stages discussed in this chapter. -This is the time when you obtain an extensive understanding of -the contents and structure of the data that was collected. -You should use this time to understand the types of responses collected, -both within each survey question and across respondents. +This is the time when you obtain an extensive understanding of the contents and structure of the data that was collected. +Explore your data set using tabulations, summaries, and descriptive plots. +You should use this time to understand the types of responses collected, both within each survey question and across respondents. Knowing your data set well will make it possible to do analysis. % Deidentification ------------------------------------------------------------------ The initial input for data cleaning is the raw data. It should contain only materials that are received directly from the field. -They will invariably come in a host of file formats -and nearly always contain personally-identifying information. -These files should be retained in the raw data folder -\textit{exactly as they were received}, +They will invariably come in a host of file formats and nearly always contain personally-identifying information.\index{personally-identifying information} +These files should be retained in the raw data folder \textit{exactly as they were received}. The folder must be encrypted if it is shared in an insecure fashion, \sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}}. and it must be backed up in a secure offsite location. Everything else can be replaced, but raw data cannot. Therefore, raw data should never be interacted with directly. -Secure storage of the raw data -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Security}}, -means access to it will be restricted even inside the research team, -and opening it can be a bureaucratic process. -To facilitate the handling of the data, -any personally identifiable information should be removed from the data set, -and a de-identified data set can be saved in a non-encrypted folder. -De-identification, at this stage, mean stripping the data set of sensitive fields -such as names, phone numbers, addresses and geolocations.\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} -The resulting de-identified data will be the underlying source -for all cleaned and constructed data. -Because the identifying information contained in the raw data -\index{personally-identifying information} -is typically only used during data collection, -to find and confirm their identity of interviewees, +Secure storage of the raw +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Security}} +data means access to it will be restricted even inside the research team. +Loading encrypted data multiple times it can be annoying. +To facilitate the handling of the data, remove any personally identifiable information from the data set. +This will create a de-identified data set, that can be saved in a non-encrypted folder. +De-identification, +\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} +at this stage, means stripping the data set of direct identifiers such as names, phone numbers, addresses, and geolocations. +\sidenote{\url{ https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf }} +The resulting de-identified data will be the underlying source for all cleaned and constructed data. +Because identifying information is typically only used during data collection, +to find and confirm the identity of interviewees, de-identification should not affect the usability of the data. -In fact, most cases identifying information can be converted -into non-identified variables for analysis +In fact, most identifying information can be converted into non-identified variables for analysis purposes (e.g. GPS coordinates can be translated into distances). -However, if sensitive information is strictly needed for analysis, +However, if sensitive information is strictly needed for analysis, all the tasks described in this chapter must be performed in a secure environment. - % Unique ID and data entry corrections --------------------------------------------- -The next two tasks in data cleaning are typically done during the data collection, -as part of data quality monitoring. -They are: ensuring that observations are uniquely and fully identified, -\sidenote{\url{https://dimewiki.worldbank.org/wiki/ID_Variable_Properties}} -and correcting mistakes in data entry. -Though modern survey tools create unique observation identifiers, -that is not the same as having a unique ID variable for each individual in the sample +There are two main cases when the raw data will be modified during data cleaning. +The first one is when there are duplicated entries in the data. +Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/wiki/ID_Variable_Properties}} +is possibly the most important step in data cleaning +(as anyone who ever tried to merge data sets that are not uniquely identified knows). +Modern survey tools create unique observation identifiers. +That, however, is not the same as having a unique ID variable for each individual in the sample. +You want to make sure the data set has a unique ID variable that can be cross-referenced with other records, such as the Master Data Set\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master_Data_Set}} and other rounds of data collection. \texttt{ieduplicates} and \texttt{iecompdup}, two commands included in the \texttt{ietoolkit} package, create an automated workflow to identify, correct and document occurrences of duplicate entries. -During data quality monitoring, mistakes in data entry may be found, -including typos and inconsistent values. -Theses mistakes should be fixed during data cleaning, -and you should keep a careful record of how they were identified, and -how the correct value was obtained. + +Looking for duplicated entries is usually part of data quality monitoring, +as is the only other reason to change the raw data during cleaning: +correcting mistakes in data entry. +During data quality monitoring, you will inevitably encounter data entry mistakes, +such as typos and inconsistent values. +If you don't, you are probably not doing a very good job at looking for them. +These mistakes should be fixed in the cleaned data set, +and you should keep a careful record of how they were identified, +and how the correct value was obtained. % Data description ------------------------------------------------------------------ Note that if you are using secondary data, -the tasks described above are most likely unnecessary to you. +the tasks described above can likely be skipped. However, the last step of data cleaning, describing the data, will probably still be necessary. -This is a key step, but can be quite repetitive. +This is a key step to making the data easy to use, but it can be quite repetitive. The \texttt{iecodebook} command suite, part of \texttt{iefieldkit}, is designed to make some of the most tedious components of this process, such as renaming, relabeling, and value labeling, @@ -210,13 +207,16 @@ \section{Data cleaning} We have a few recommendations on how to use this command for data cleaning. First, we suggest keeping the same variable names as in the survey instrument, so it's easy to connect the two files. +Don't skip the labelling! +Applying labels makes it easier to understand what the data is showing while exploring the data. +This minimizes the risk of small errors making their way through into the analysis stage. Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Applying\_Labels}} Recodes should be used to turn codes for "Don't know", "Refused to answer", and other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} String variables need to be encoded, and open-ended responses, categorized or dropped\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Strings}} (unless you are using qualitative or classification analyses, which are less common). -Finally, any additional information collected for quality monitoring purposes, -such as notes and duration field, can also be dropped using \texttt{iecodebook}. +Finally, any additional information collected only for quality monitoring purposes, +such as notes and duration field, can also be dropped. % Outputs ----------------------------------------------------------------- @@ -233,8 +233,9 @@ \section{Data cleaning} you may want to break the data cleaning into sub-steps, and create intermediate cleaned data sets (for example, one per survey module). -That can be very helpful, but having a single, ``final", data set -will help you with sharing and publishing the data. +Breaking cleaned data sets into the smallest unit of observation inside a roster +make the cleaning faster and the data easier to handle during construction. +But having a single cleaned data set will help you with sharing and publishing the data. To make sure this file doesn't get too big to be handled, use commands such as \texttt{compress} in Stata to make sure the data is always stored in the most efficient format. @@ -246,11 +247,13 @@ \section{Data cleaning} These materials are part of what we call data documentation \sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} \index{Documentation}, -and should be stored in the corresponding data folder, +and should be stored in the corresponding folder, as you will probably need them during analysis and publication. Include in the \texttt{Documentation} folder records of any corrections made to the data, including to duplicated entries, as well as communications from the field where theses issues are reported. +Make sure to also have a record of potentially problematic patterns you noticed +while exploring the data, such as outliers and variables with many missing values. Be very careful not to include sensitive information in documentation that is not securely stored, or that you intend to release as part of a replication package or data publication. From 85d8a2fd2b1471309e6ffbf84b7b17a46be1329c Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 13 Jan 2020 20:50:06 -0500 Subject: [PATCH 190/854] [ch6] missed two lines in the last commit --- chapters/data-analysis.tex | 2 -- 1 file changed, 2 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 431f3acd1..71478fe2d 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -121,8 +121,6 @@ \section{Data management} \section{Data cleaning} % intro: what is data cleaning ------------------------------------------------- -Data cleaning is the first stage towards transforming -the data received from the field into data that can be analyzed. Data cleaning is the first stage of transforming the data you received from the field into data that you can analyze. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Cleaning}} The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. From ad0e83dec39e5cf6fb17d3008b51cfcda7530cd7 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 14 Jan 2020 11:21:51 -0500 Subject: [PATCH 191/854] [ch6] analysis solves #77 and #275 --- chapters/data-analysis.tex | 238 ++++++++++++++----------------------- 1 file changed, 87 insertions(+), 151 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 71478fe2d..ddf0f62f5 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -270,18 +270,10 @@ \section{Indicator construction} There may be variables indicating the cost of these items. You cannot run a meaningful regression on these variables. You need to manipulate them into something that has \textit{economic} meaning. - -\textcolor{red}{During this process, the data points will typically be reshaped and aggregated +During this process, the data points will typically be reshaped and aggregated so that level of the data set goes from the unit of observation in the survey to the unit of analysis. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} -In the consumption module example, the unit of observation is one good. -The survey will probably also include a household composition module, -where age and gender variable will be defined for each household member. -The typical unit of analysis for such consumption indices, on the other hand, is the household. -So to calculate the final indicator, -the number of household members will be aggregated to the household level; -then the expenditure or quantity consumed will be aggregated as well; -and finally the result of the latter aggregation will be divided by the result of the former.} +To use the same example, the data on quantity consumed was collect for each item, and needs to be aggregated to the household level before analysis. % Why it is a separate process ------------------------------- @@ -363,170 +355,114 @@ \section{Indicator construction} \section{Writing data analysis code} % Intro -------------------------------------------------------------- -Data analysis is the stage of the process when research outputs are created. +Data analysis is the stage when research outputs are created. \index{data analysis} Many introductions to common code skills and analytical frameworks exist, such as \textit{R for Data Science};\sidenote{\url{https://r4ds.had.co.nz/}} \textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical_introduction_to_stata.pdf}} -\textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844_Mostly_Harmless_Econometrics_An_Empiricist's_Companion}} and -\textit{Causal Inference: The Mixtape}.\sidenote{\url{http://scunning.com/mixtape.html}} -This section will not include instructions on how to conduct specific analyses, -as these are highly specialized and require field and statistical expertise. +\textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844_Mostly_Harmless_Econometrics_An_Empiricist's_Companion}} +and \textit{Causal Inference: The Mixtape}.\sidenote{\url{http://scunning.com/mixtape.html}} +This section will not include instructions on how to conduct specific analyses. +That is a research question, and requires expertise beyond the scope of this book. Instead, we will outline the structure of writing analysis code, assuming you have completed the process of data cleaning and construction. % Exploratory and final data analysis ----------------------------------------- -Data analysis can be divided into two steps. -The first, which we will call exploratory data analysis, -is when you are trying different things and looking for patterns in your data. -The second step, the final analysis, -happens when the research has matured, -and your team has decided what pieces of analysis will make into the research output. +The analysis stage usually starts with a process we call exploratory data analysis. +This is when you are trying different things and looking for patterns in your data. +It progresses into final analysis when your team starts to decide what are the main results, those that will make it into the research output. +The way you deal with code and outputs for exploratory and final analysis is different, and this section will discuss how. % Organizing scripts --------------------------------------------------------- -During exploratory data analysis, -you will be tempted to write lots of analysis -into one big, impressive, start-to-finish analysis file. -Though it's fine to write such a script during a long analysis meeting, -this practice is error-prone, -because it subtly encourages poor practices such as -not clearing the workspace or loading fresh data. +During exploratory data analysis, you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. +Though it's fine to write such a script during a long analysis meeting, this practice is error-prone. +It subtly encourages poor practices such as not clearing the workspace or loading fresh data. It's important to take the time to organize scripts in a clean manner and to avoid mistakes. -A well-organized analysis script starts with -a completely fresh workspace and load its data directly -prior to running that specific analysis. -This encourages data manipulation to be done upstream (that is, during construction), -and prevents you from accidentally writing pieces of code -that depend on each other, leading to the too-familiar -``run this part, then that part, then this part'' process. +A well-organized analysis script starts with a completely fresh workspace and explicitly loads data before analyzing it. +This encourages data manipulation to be done earlier in the workflow (that is, during construction). +It also and prevents you from accidentally writing pieces of code that depend on one another, leading to the too-familiar ``run this part, then that part, then this part'' process. Each script should run completely independently of all other code. You can go as far as coding every output in a separate file. -There is nothing wrong with code files being short and simple -- -as long as they directly correspond to specific pieces of analysis. - -To accomplish this, you will need to make sure that you -have an effective system of naming, organization, and version control. -Just like you named each of the analysis datasets, -each of the individual analysis files should be descriptively named. -Code files such as \path{spatial-diff-in-diff.do}, -\path{matching-villages.do}, and \path{summary-statistics.do} -are clear indicators of what each file is doing -and allow code to be found very quickly. - -Analysis files should be as simple as possible, -so you can focus on the econometrics. -The first thing any analysis code does is to load a data set -and explicitly set the analysis sample. -In fact, all research decisions, not only the sampling, -should be made very explicit in the code. -This includes clustering, sampling and use of control variables. -As you come a decision on what are the main specifications to be reported, -you can create globals or objects in the master do file -that set these options. -This is a good way to make sure specifications are consistent throughout the analysis, -apart from being very dynamic and making it easy to update all scripts if needed. -It is completely acceptable to have folders for each task, -and compartmentalize each analysis as much as needed. -It is always better to have more code files open -than to have to scroll around repeatedly inside a given file. - -% Outputs ----------------------------------------------------- - +There is nothing wrong with code files being short and simple -- as long as they directly correspond to specific pieces of analysis. + +Analysis files should be as simple as possible, so you can focus on the econometrics. +All research decisions should be made very explicit in the code. +This includes clustering, sampling, and control variables, to name a few. +If you have multiple analysis data sets, each of them should have a descriptive name about its sample and unit of observation. +As your team comes to a decision about model specification, you can create globals or objects in the master script to use across scripts. +This is a good way to make sure specifications are consistent throughout the analysis. It's also very dynamic, making it easy to update all scripts if needed. +It is completely acceptable to have folders for each task, and compartmentalize each analysis as much as needed. +It is always better to have more code files open than to keep scrolling inside a given file. + +To accomplish this, you will need to make sure that you have an effective data management system, including naming, file organization, and version control. +Just like you did with each of the analysis datasets, name each of the individual analysis files descriptively. +Code files such as \path{spatial-diff-in-diff.do}, \path{matching-villages.do}, and \path{summary-statistics.do} +are clear indicators of what each file is doing, and allow you to find code quickly. + +% Self-promotion ------------------------------------------------ +Out team has created a few products to automate common outputs and save you precious research time. +The \texttt{ietoolkit} package includes two commands to export nicely formatted tables. +\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} creates and exports balance tables to excel and {\TeX}. \texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} does the same for difference-in-differences regressions. +The \textbf{Stata Visual Library}\sidenote{ \url{https://worldbank.github.io/Stata-IE-Visual-Library/}} +has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \href{https://www.r-graph-gallery.com/}{The R Graph Gallery}.} +\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_visualization}} \index{data visualization} +is increasingly popular, but a great deal lacks in quality.\cite{healy2018data,wilke2019fundamentals} +We attribute some of this to the difficulty of writing code to create them. +Making a visually compelling graph would already be hard enough if you didn't have to go through many rounds of googling to understand a command. +The trickiest part of using plot commands is to get the data in the right format. +This is why the \textbf{Stata Visual Library} includes example data sets to use with each do file. + +Whole books have been written on how to create good data visualizations, +so we will not attempt to give you advice on it. +Rather, here are a few resources we have found useful. +The Tapestry conference focuses on ``storytelling with data''.\sidenote{ + \url{https://www.youtube.com/playlist?list=PLb0GkPPcZCVE9EAm9qhlg5eXMgLrrfMRq}} +\textit{Fundamentals of Data Visualization} provides extensive details on practical application;\sidenote{ + \url{https://serialmentor.com/dataviz}} +as does \textit{Data Visualization: A Practical Introduction}.\sidenote{ + \url{http://socvis.co}} +Graphics tools like Stata are highly customizable. +There is a fair amount of learning curve associated with extremely-fine-grained adjustment, +but it is well worth reviewing the graphics manual\sidenote{\url{https://www.stata.com/manuals/g.pdf}} +For an easier way around it, Gray Kimbrough's \textit{Uncluttered Stata Graphs} code is an excellent default replacement for Stata graphics that is easy to install. +\sidenote{\url{https://graykimbrough.github.io/uncluttered-stata-graphs/}} +If you are a R user, the \textit{R Graphics Cookbook}\sidenote{\url{https://r-graphics.org/}} is a great resource for the most popular visualization package\texttt{ggplot}\sidenote{\url{https://ggplot2.tidyverse.org/}}. But there are a variety of other visualization packages, such as \href{http://jkunst.com/highcharter/}{\texttt{highcharter}}, \href{https://rstudio.github.io/r2d3/}{\texttt{r2d3}}, \href{https://rstudio.github.io/leaflet/}{leaflet}, and \href{https://plot.ly/r/}{plotly}, to name a few. +We have no intention of creating an exhaustive list, and this one is certainly missing very good references. +But at least it is a place to start. + +\section{Exporting analysis outputs} % Exploratory analysis It's ok to not export each and every table and graph created during exploratory analysis. -Instead, we suggest running them into markdown files using RMarkdown or -the different dynamic document options available in Stata. -This will allow you to update and present results quickly, -while maintaining a record the different analysis explored. +Instead, we suggest running them into markdown files using RMarkdown or the different dynamic document options available in Stata. +This will allow you to update and present results quickly while maintaining a record of the different analysis tried. % Final analysis -Final analysis scripts, on the other hand, should export final outputs: -these are ready to be included to a paper or report; and -no manual edits, including formatting, should be necessary after running them. -Manual edits are difficult to replicate, -and you will end up having to make changes to the outputs, -so automating them will save you time by the end of the process. -Don't ever set a workflow that includes copying and pasting results printed in the console. -% Output content -Finally, a final outputs should be self-standing, -meaning they are easy to read and understand -with only the information they contain. -To accomplish this, labels and notes should cover all -relevant information such as -sample, unit of observation, unit of measurement and variable definition. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Checklist:_Reviewing_Graphs}} -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Checklist:_Submit_Table}} - -% Output formats -Outputs should be saved in accessible and, whenever possible, lightweight formats. -% Figures -Accessible means that other people can easily open them -- -in Stata, that would mean always using \texttt{graph export} -to save images as \texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., -instead of \texttt{graph save}, which creates a \texttt{.gph} file -that can only be opened through a Stata installation. -\texttt{.tif} and \texttt{.eps} are two examples of accessible lightweight formats, -and \texttt{.eps} has the added advantage of allowing a designer to edit the images for printing. -% Tables +Final analysis scripts, on the other hand, should export final outputs, which are ready to be included to a paper or report. +No manual edits, including formatting, should be necessary after exporting final outputs. +Manual edits are difficult to replicate, and you will inevitably need to make changes to the outputs. +Automating them will save you time by the end of the process. +However, don't spend too much time formatting tables and graphs until you are ready to publish. +Polishing final outputs can be a time-consuming process, +and you want to it as few times as possible. + +We cannot stress this enough: don't ever set a workflow that requires copying and pasting results from the console. +There are numerous commands to export outputs from both R and Stata.\sidenote{Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, and \href{https://www.benjaminbdaniels.com/stata-code/outwrite/}{\texttt{outwrite}} in Stata, and \href{https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf}{\texttt{stargazer}} and \href{https://ggplot2.tidyverse.org/reference/ggsave.html}{\texttt{ggsave}} in R.} +Save outputs in accessible and, whenever possible, lightweight formats. +Accessible means that it's easy for other people to open them. +In Stata, that would mean always using \texttt{graph export} to save images as \texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., instead of \texttt{graph save}, which creates a \texttt{.gph} file that can only be opened through a Stata installation. For tables, \texttt{.tex} is preferred. -A variety of packages in both R and Stata export tables in this format. -Excel \texttt{.xlsx} and \texttt{.csv} files are also acceptable, -although if you are working on a large report they will become cumbersome to update after revisions. +Excel \texttt{.xlsx} and \texttt{.csv} files are also acceptable, although if you are working on a large report they will become cumbersome to update after revisions. % Formatting -Don't spend too much time formatting tables and graphs until you are ready to publish. -Polishing final outputs can be a time-consuming process, -and you want to avoid doing it multiple times. -If you need to create a table with a very particular format, -that is not automated by any command you know, -consider writing the table manually +If you need to create a table with a very particular format, that is not automated by any command you know, consider writing the it manually (Stata's \texttt{filewrite}, for example, allows you to do that). -This will allow you to write a cleaner script that focuses on the econometrics, -and not on complicated commands to create and append intermediate matrices. +This will allow you to write a cleaner script that focuses on the econometrics, and not on complicated commands to create and append intermediate matrices. To avoid cluttering your scripts with formatting and ensure that formatting is consistent across outputs, define formatting options in an R object or a Stata global and call them when needed. - -%------------------------------------------------ - -\section{Visualizing data} - -\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_visualization}} is increasingly popular, -\index{data visualization} -but a great deal of it is very low quality.\cite{healy2018data,wilke2019fundamentals} -The default graphics settings in Stata, for example, -are pretty bad.\sidenote{Gray Kimbrough's -\textit{Uncluttered Stata Graphs} code is an excellent -default replacement that is easy to install. -\url{https://graykimbrough.github.io/uncluttered-stata-graphs/}} -Thankfully, graphics tools like Stata are highly customizable. -There is a fair amount of learning curve associated with -extremely-fine-grained adjustment, -but it is well worth reviewing the graphics manual -to understand how to apply basic colors, alignments, and labelling tools\sidenote{ -\url{https://www.stata.com/manuals/g.pdf}} -The manual is worth skimming in full, as it provides -many visual examples and corresponding code examples -that detail how to produce them. - -Graphics are hard to get right because you, as the analyst, -will already have a pretty good idea of what you are trying to convey. -Someone seeing the illustration for the first time, -by contrast, will have no idea what they are looking at. -Therefore, a visual image has to be compelling in the sense -that it shows that there is something interesting going on -and compels the reader to figure out exactly what it is. -A variety of resources can help you -figure out how best to structure a given visual. -The \textbf{Stata Visual Library}\sidenote{ -\url{https://worldbank.github.io/Stata-IE-Visual-Library/}} - has many examples of figures that communicate effectively. -The Tapestry conference focuses on ``storytelling with data''.\sidenote{ -\url{https://www.youtube.com/playlist?list=PLb0GkPPcZCVE9EAm9qhlg5eXMgLrrfMRq}} -\textit{Fundamentals of Data Visualization} provides extensive details on practical application;\sidenote{ -\url{https://serialmentor.com/dataviz}} -as does \textit{Data Visualization: A Practical Introduction}.\sidenote{ -\url{http://socvis.co}} - +% Output content +Keep in mind that final outputs should be self-standing. +This means it should be easy to read and understand them with only the information they contain. +Make sure labels and notes cover all relevant information, such as sample, unit of observation, unit of measurement and variable definition.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Reviewing\_Graphs} \\ \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Submit\_Table}} From 94c08af31463bdacef9a1273d02d686d7abf325a Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 14 Jan 2020 11:33:27 -0500 Subject: [PATCH 192/854] [ch6] finished #269 --- chapters/data-analysis.tex | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index ddf0f62f5..f31c88606 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -58,9 +58,16 @@ \section{Data management} So, for example, a script called \texttt{clean-section-1} would create a data set called \texttt{cleaned-section-1}. +The division of a project in stages also helps the review workflow inside your team. +The code, data and outputs of each of these stages should go through at least one round of code review. +During the code review process, team members should read and run each other's codes. +Doing this at the end of each stage helps prevent the amount of work to be reviewed to become too overwhelming. +Code review is a common quality assurance practice among data scientists. +It helps to keep the level of the outputs high, and is also a great way to learn and improve your code. + % Folder structure There are many schemes to organize research data. -Our preferred scheme reflects this task breakdown. +Our preferred scheme reflects the task breakdown just discussed. \index{data organization} We created the \texttt{iefolder}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iefolder}} @@ -441,7 +448,7 @@ \section{Exporting analysis outputs} No manual edits, including formatting, should be necessary after exporting final outputs. Manual edits are difficult to replicate, and you will inevitably need to make changes to the outputs. Automating them will save you time by the end of the process. -However, don't spend too much time formatting tables and graphs until you are ready to publish. +However, don't spend too much time formatting tables and graphs until you are ready to publish.\sidenote{For a more detailed discussion on this, including different ways to export tables from Stata, see \url{https://github.com/bbdaniels/stata-tables}} Polishing final outputs can be a time-consuming process, and you want to it as few times as possible. From b7e511a8f737af81f9098ad75a45d28680050d74 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 14 Jan 2020 13:05:09 -0500 Subject: [PATCH 193/854] [ch6] Fix special characters --- chapters/data-analysis.tex | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index f31c88606..2d3514b1c 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -78,9 +78,9 @@ \section{Data management} \texttt{iefolder} is designed to standardize folder structures across teams and projects. This means that PIs and RAs face very small costs when switching between projects, because they are organized in the same way. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Folder}} At the first level of this folder are what we call survey round folders. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork_Survey_Round}} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Survey\_Round}} You can think of a round as one source of data, that will be cleaned in the same manner. Inside round folders, there are dedicated folders for @@ -89,7 +89,7 @@ \section{Data management} The folders that hold code are organized in parallel to these, so that the progression through the whole project can be followed by anyone new to the team. Additionally, \texttt{iefolder} creates \textbf{master do-files} -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master_Do-files}} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} so all project code is reflected in a top-level script. % Master do file @@ -129,7 +129,7 @@ \section{Data cleaning} % intro: what is data cleaning ------------------------------------------------- Data cleaning is the first stage of transforming the data you received from the field into data that you can analyze. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Cleaning}} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. The cleaned data set should contain only the data collected in the field. @@ -154,7 +154,7 @@ \section{Data cleaning} Therefore, raw data should never be interacted with directly. Secure storage of the raw -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Security}} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Security}} data means access to it will be restricted even inside the research team. Loading encrypted data multiple times it can be annoying. To facilitate the handling of the data, remove any personally identifiable information from the data set. @@ -175,13 +175,13 @@ \section{Data cleaning} % Unique ID and data entry corrections --------------------------------------------- There are two main cases when the raw data will be modified during data cleaning. The first one is when there are duplicated entries in the data. -Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/wiki/ID_Variable_Properties}} +Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/wiki/ID\_Variable\_Properties}} is possibly the most important step in data cleaning (as anyone who ever tried to merge data sets that are not uniquely identified knows). Modern survey tools create unique observation identifiers. That, however, is not the same as having a unique ID variable for each individual in the sample. You want to make sure the data set has a unique ID variable -that can be cross-referenced with other records, such as the Master Data Set\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master_Data_Set}} +that can be cross-referenced with other records, such as the Master Data Set\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Data\_Set}} and other rounds of data collection. \texttt{ieduplicates} and \texttt{iecompdup}, two commands included in the \texttt{ietoolkit} package, @@ -216,7 +216,7 @@ \section{Data cleaning} Applying labels makes it easier to understand what the data is showing while exploring the data. This minimizes the risk of small errors making their way through into the analysis stage. Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Applying\_Labels}} -Recodes should be used to turn codes for "Don't know", "Refused to answer", and +Recodes should be used to turn codes for ``Don't know'', ``Refused to answer'', and other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} String variables need to be encoded, and open-ended responses, categorized or dropped\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Strings}} (unless you are using qualitative or classification analyses, which are less common). @@ -250,7 +250,7 @@ \section{Data cleaning} including enumerator manuals, survey instruments, supervisor notes, and data quality monitoring reports. These materials are part of what we call data documentation -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Documentation}} \index{Documentation}, and should be stored in the corresponding folder, as you will probably need them during analysis and publication. @@ -279,7 +279,7 @@ \section{Indicator construction} You need to manipulate them into something that has \textit{economic} meaning. During this process, the data points will typically be reshaped and aggregated so that level of the data set goes from the unit of observation in the survey to the unit of analysis. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} To use the same example, the data on quantity consumed was collect for each item, and needs to be aggregated to the household level before analysis. % Why it is a separate process ------------------------------- @@ -366,8 +366,8 @@ \section{Writing data analysis code} \index{data analysis} Many introductions to common code skills and analytical frameworks exist, such as \textit{R for Data Science};\sidenote{\url{https://r4ds.had.co.nz/}} -\textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical_introduction_to_stata.pdf}} -\textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844_Mostly_Harmless_Econometrics_An_Empiricist's_Companion}} +\textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical\_introduction\_to\_stata.pdf}} +\textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion}} and \textit{Causal Inference: The Mixtape}.\sidenote{\url{http://scunning.com/mixtape.html}} This section will not include instructions on how to conduct specific analyses. That is a research question, and requires expertise beyond the scope of this book. @@ -413,7 +413,7 @@ \section{Writing data analysis code} \texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} creates and exports balance tables to excel and {\TeX}. \texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} does the same for difference-in-differences regressions. The \textbf{Stata Visual Library}\sidenote{ \url{https://worldbank.github.io/Stata-IE-Visual-Library/}} has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \href{https://www.r-graph-gallery.com/}{The R Graph Gallery}.} -\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_visualization}} \index{data visualization} +\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_visualization}} \index{data visualization} is increasingly popular, but a great deal lacks in quality.\cite{healy2018data,wilke2019fundamentals} We attribute some of this to the difficulty of writing code to create them. Making a visually compelling graph would already be hard enough if you didn't have to go through many rounds of googling to understand a command. From a05d9aba184ba851b9f25d48a0ab26d8dc8bcaa2 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 14 Jan 2020 13:50:22 -0500 Subject: [PATCH 194/854] Credibility without transparency --- chapters/handling-data.tex | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 477528b82..b338f5230 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -17,7 +17,7 @@ and that research participants are appropriately protected. What we call ethical standards in this chapter are a set of practices for research quality and data management that address these two principles. - + Neither transparency nor privacy is an ``all-or-nothing'' objective. We expect that teams will do as much as they can to make their work conform to modern practices of credibility, transparency, and reproducibility. @@ -29,8 +29,10 @@ low-quality studies from reputable sources may be used as evidence when in fact they don't warrant it, and high-quality studies from sources without an international reputation may be ignored. Both these outcomes reduce the quality of evidence overall. - Even more importantly, they usually mean that credibility in development research accumulates at international institutions - and top global universities instead of the people and places directly involved in and affected by it. + Even more importantly, the only way to determine credibility without transparency + is to judge research solely based on where it is done and by whom, + which concentrates credibility at better-known international institutions and global universities, + at the expense of the people and organization directly involved in and affected by it. Simple transparency standards mean that it is easier to judge research quality, and making high-quality research identifiable also increases its impact. This section provides some basic guidelines and resources @@ -89,7 +91,7 @@ \subsection{Research reproducibility} based on the valuable work you have already done.\sidenote{ \url{https://blogs.worldbank.org/opendata/making-analytics-reusable}} Services like \index{GitHub}GitHub\sidenote{ - \url{https://github.com}, GitHub will be discussed more in later chapters} + \url{https://github.com}, GitHub will be discussed more in later chapters} that log your research process are valuable resources here. Such services can show things like modifications made in response to referee comments, by having tagged version histories at each major revision. @@ -177,7 +179,7 @@ \subsection{Research transparency} not a one-time requirement or retrospective task. New decisions are always being made as the plan begins contact with reality, and there is nothing wrong with sensible adaptation so long as it is recorded and disclosed. -(Email is \textit{not} a note-taking service, because communications are rarely well-ordered, +(Email is \textit{not} a note-taking service, because communications are rarely well-ordered, can be easily deleted, and are not available for future team members.) There are various software solutions for building documentation over time. @@ -256,7 +258,7 @@ \section{Ensuring privacy and security in research data} Anytime you are collecting primary data in a development research project, you are almost certainly handling data that include \textbf{personally-identifying information (PII)}\index{personally-identifying information}\index{primary data}\sidenote{ - \textbf{Personally-identifying information:} any piece or set of information + \textbf{Personally-identifying information:} any piece or set of information that can be used to identify an individual research subject. \url{https://dimewiki.worldbank.org/wiki/De-identification\#Personally\_Identifiable\_Information}}. PII data contains information that can, without any transformation, be used to identify From 729dcff168f79bd321a1c296d338d0d659b9ee6a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 14 Jan 2020 13:53:46 -0500 Subject: [PATCH 195/854] Research quality standards --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index b338f5230..afb0d4c1b 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -69,7 +69,7 @@ \section{Protecting confidence in development research} primary data that researchers use for such studies has never been reviewed by anyone else, so it is hard for others to verify that it was collected, handled, and analyzed appropriately.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} -Maintaining confidence in research via the components of credibility, transparency, and reproducibility +Maintaining research quality standards via credibility, transparency, and reproducibility tools is the most important way that researchers using primary data can avoid serious error, and therefore these are not by-products but core components of research output. From 22f5c20ee33bcaaa666ca444f39e536f3e335c38 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 14 Jan 2020 13:56:02 -0500 Subject: [PATCH 196/854] GitHub is one of many --- chapters/handling-data.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index afb0d4c1b..b4dd69be3 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -90,9 +90,10 @@ \subsection{Research reproducibility} is a great way to have new questions asked and answered based on the valuable work you have already done.\sidenote{ \url{https://blogs.worldbank.org/opendata/making-analytics-reusable}} -Services like \index{GitHub}GitHub\sidenote{ - \url{https://github.com}, GitHub will be discussed more in later chapters} -that log your research process are valuable resources here. +Services that log your research process are valuable resources here -- +GitHub is one of many that can do so.\sidenote{ + \url{https://github.com}} + \index{GitHub} Such services can show things like modifications made in response to referee comments, by having tagged version histories at each major revision. These services can also use issue trackers and abandoned work branches From a47e7867df40bf1af9ef4f3c5be393fe7eeceea9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 14 Jan 2020 14:02:17 -0500 Subject: [PATCH 197/854] Citations for journal policy --- bibliography.bib | 22 ++++++++++++++++++++++ chapters/handling-data.tex | 4 ++-- 2 files changed, 24 insertions(+), 2 deletions(-) diff --git a/bibliography.bib b/bibliography.bib index 823bc39a7..31ad6a424 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -1,3 +1,25 @@ +@article{nosek2015promoting, + title={Promoting an open research culture}, + author={Nosek, Brian A and Alter, George and Banks, George C and Borsboom, Denny and Bowman, Sara D and Breckler, Steven J and Buck, Stuart and Chambers, Christopher D and Chin, Gilbert and Christensen, Garret and others}, + journal={Science}, + volume={348}, + number={6242}, + pages={1422--1425}, + year={2015}, + publisher={American Association for the Advancement of Science} +} + +@article{stodden2013toward, + title={Toward reproducible computational research: an empirical analysis of data and code policy adoption by journals}, + author={Stodden, Victoria and Guo, Peixuan and Ma, Zhaokun}, + journal={PloS one}, + volume={8}, + number={6}, + pages={e67111}, + year={2013}, + publisher={Public Library of Science} +} + @article{simonsohn2015specification, title={Specification curve: Descriptive and inferential statistics on all reasonable specifications}, author={Simonsohn, Uri and Simmons, Joseph P and Nelson, Leif D}, diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index b4dd69be3..97ddb4be4 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -222,7 +222,7 @@ \subsection{Research credibility} \index{pre-registration} Garden varieties of research standards from journals, funders, and others feature both ex ante -(or ``regulation'') and ex post (or ``verification'') policies. +(or ``regulation'') and ex post (or ``verification'') policies.\cite{stodden2013toward} Ex ante policies require that authors bear the burden of ensuring they provide some set of materials before publication and their quality meet some minimum standard. @@ -230,7 +230,7 @@ \subsection{Research credibility} but their quality is not a direct condition for publication. Still, others have suggested ``guidance'' policies that would offer checklists for which practices to adopt, such as reporting on whether and how -various practices were implemented. +various practices were implemented.\cite{nosek2015promoting} With the ongoing rise of empirical research and increased public scrutiny of scientific evidence, this is no longer enough to guarantee that findings will hold their credibility. From 35f4599c8eafb5b0cbd507ed8054f62a2290c46f Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 14 Jan 2020 14:05:52 -0500 Subject: [PATCH 198/854] Census citation --- bibliography.bib | 9 +++++++++ chapters/handling-data.tex | 2 +- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/bibliography.bib b/bibliography.bib index 31ad6a424..fcf0850a1 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -1,3 +1,12 @@ +@inproceedings{abowd2018us, + title={The US Census Bureau adopts differential privacy}, + author={Abowd, John M}, + booktitle={Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining}, + pages={2867--2867}, + year={2018}, + organization={ACM} +} + @article{nosek2015promoting, title={Promoting an open research culture}, author={Nosek, Brian A and Alter, George and Banks, George C and Borsboom, Denny and Bowman, Sara D and Breckler, Steven J and Buck, Stuart and Chambers, Christopher D and Chin, Gilbert and Christensen, Garret and others}, diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 97ddb4be4..28133a9c7 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -468,7 +468,7 @@ \subsection{De-identifying and anonymizing information} and simple measures of the identifiability of records from that. Additional options to protect privacy in data that will become public exist, and you should expect and intend to release your datasets at some point. -One option is to add noise to data, as the US Census Bureau has proposed, +One option is to add noise to data, as the US Census Bureau has proposed,\cite{abowd2018us} as it makes the trade-off between data accuracy and privacy explicit. But there are no established norms for such ``differential privacy'' approaches: most approaches fundamentally rely on judging ``how harmful'' information disclosure would be. From 0de5565432c7b1b6d81b0d203a0dfc8f0a3adce4 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 14 Jan 2020 14:09:23 -0500 Subject: [PATCH 199/854] Remove redundant material --- chapters/handling-data.tex | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 28133a9c7..754b777ad 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -116,20 +116,6 @@ \subsection{Research reproducibility} unless for legal or ethical reasons it cannot be.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} -Reproducibility and transparency are not binary concepts: -there's a spectrum, starting with simple materials publication. -But even getting that first stage right is a challenge. -An analysis of 203 empirical papers published in top economics journals in 2016 -showed that less than 1 in 7 provided all the data and code -needed to assess computational reproducibility.\cite{galiani2017incentives} -A scan of the 90,000 datasets on the Harvard Dataverse -found that only 10\% had the necessary files and documentation -for computational reproducibility -(and a check of 3,000 of those that met requirements -found that 85\% did not replicate). -People seem to systematically underestimate the benefits -and overestimate the costs to adopting modern research practices. - \subsection{Research transparency} Transparent research will expose not only the code, From c6115b276ad388484a61f618344d866021e39854 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 14 Jan 2020 14:09:46 -0500 Subject: [PATCH 200/854] Ch 5 re-write New Intro --- chapters/data-collection.tex | 21 +++++++++------------ 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 8c0e0f519..bbfd1f165 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -1,12 +1,9 @@ %------------------------------------------------ \begin{fullwidth} - - %PLACEHOLDER FOR NEW INTRO - Here we focus on tools and workflows that are primarily conceptual, rather than software-specific. This chapter should provide a motivation for - planning data structure during survey design, - developing surveys that are easy to control for quality and security, - and having proper file storage ready for sensitive PII data. +High quality research begins with a thoughtfully-designed, field-tested survey instrument, and a carefully supervised survey. +Much of the recent push toward credibiity in the social sciences has focused on analytical practices. +We contest that credible research depends, first and foremost, on the quality of the raw data. This chapter focuses on data generation, from questionnaire design to field monitoring. As there are many excellent resources on questionnaire design and field supervision \sidenote{\url{Grosh, Margaret; Glewwe, Paul. 2000. Designing Household Survey Questionnaires for Developing Countries : Lessons from 15 Years of the Living Standards Measurement Study, Volume 3. Washington, DC: World Bank. © World Bank. https://openknowledge.worldbank.org/handle/10986/15195 License: CC BY 3.0 IGO.}}, we focus on the particularly challenges and opportunities presented by electronic surveys. As there are many electronic survey tools, we cover workflows and primary concepts, rather than software-specific tools. The chapter covers questionnaire design, piloting and programming; monitoring data quality during the survey; and how to ensure confidential data is handled securely from collection to storage and sharing. \end{fullwidth} @@ -58,12 +55,12 @@ \subsection{Content-focused Pilot} %------------------------------------------------ - \section{Programming CAPI questionnaires} -Most data collection is now done using software tools specifically designed for surveys. CAPI surveys \sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted_Personal_Interviews_(CAPI)}} -are typically created in a spreadsheet (e.g. Excel or Google Sheets), or software-specific form builder. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} -As these are typically accessible even to novice users, we will not address software-specific form design in this book. Rather, we focus coding conventions that are important to follow regardless of CAPI software choice. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. However, these are not fully automatic: you still need to actively design and manage the survey. Each software has specific practices that you need to follow to take advantage of CAPI features and ensure that the exported data is compatible with the software that will be used for analysis. +Most data collection is now done using electronic survey instruments, known as Computer Assisted Personal Interviews (CAPI). \sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted_Personal_Interviews_(CAPI)}} +Electronic questionnaires are typically created in a spreadsheet (e.g. Excel or Google Sheets) or software-specific form builder, accessible even to novice users. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} +We will not address software-specific form design in this book; rather, we focus on coding conventions that are important to follow for CAPI regardless of software choice. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} +CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. However, these are not fully automatic: you still need to actively design and manage the survey. Each software has specific practices that you need to follow to take advantage of CAPI features and ensure that the exported data is compatible with the software that will be used for analysis. \subsection{CAPI workflow} The starting point for questionnaire programming is a complete paper version of the questionnaire, piloted for content, and translated where needed. Starting the programming at this point reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. Most importantly, it means the research, not the technology, drives the questionnaire design. When you start programming, do not start with the first question and program your way to the last question. Instead, code from high level to small detail, following the same questionnaire outline established at design phase. The outline provides the basis for pseudocode, allowing you to start with high level structure and work down to the level of individual questions. This will save time and reduce errors. @@ -117,7 +114,7 @@ \subsection{High Frequency Checks} Tracking survey progress is important for monitoring attrition, so that it is clear early on if a change in protocols or additional tracking will be needed. It is also important to check interview completion rate and sample compliance by surveyor and survey team, to identify any under-performing individuals or teams. -When all data collection is complete, the survey team should have a final field report, which should report reasons for any deviations between the original sample and the dataset collected. +When all data collection is complete, the survey team should prepare a final field report, which should report reasons for any deviations between the original sample and the dataset collected. Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of survey data. It is important to structure this reporting in a way that not only group broads rationales into specific categories but also collects all the detailed, open-ended responses to questions the field team can provide for any observations that they were unable to complete. From 52b0751a5c9274478b7a8fdfa722b89af0317260 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 14 Jan 2020 14:12:21 -0500 Subject: [PATCH 201/854] Identifying when analyzed together --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 754b777ad..b994055ef 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -442,7 +442,7 @@ \subsection{De-identifying and anonymizing information} Note, however, that it is in practice impossible to \textbf{anonymize} data. There is always some statistical chance that an individual's identity will be re-linked to the data collected about them -by using some other set of data that are collectively unique. +by using some other data that becomes identifying when analyzed together. There are a number of tools developed to help researchers de-identify data and which you should use as appropriate at that stage of data collection. These include \texttt{PII\_detection}\sidenote{\url{https://github.com/PovertyAction/PII\_detection}} from IPA, From 8405b1d7efcc96e7f227fcb69c6430721c788925 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 14 Jan 2020 14:23:12 -0500 Subject: [PATCH 202/854] [ch 2] - code AND data is needed to go through the list below --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index cc2a9b5c7..e69bf138a 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -522,7 +522,7 @@ \subsection{Documenting and organizing code} which serves as a table of contents for the instructions that you code. Anyone should be able to follow and reproduce all your work from raw data to all outputs by simply running this single script. -By follow, we mean someone external to the project who has the master script can +By follow, we mean someone external to the project who has the master script and all the input data can (i) run all the code and recreate all outputs, (ii) have a general understanding of what is being done at every step, and (iii) see how codes and outputs are related. From a08b769f04a727dc92eddd4518203fea85719b51 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 14 Jan 2020 14:24:11 -0500 Subject: [PATCH 203/854] [ch 2] - restructure sentence so it explains "root" --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index e69bf138a..b90ae6c02 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -540,7 +540,7 @@ \subsection{Documenting and organizing code} Because the \texttt{DataWork} folder is shared by the whole team, its structure is the same in each team member's computer. The only difference between machines should be -the path to the project or \texttt{DataWork} folder (the highest-level shared folder). +the path to the project root folder, i.e. the highest-level shared folder, which in the context of \texttt{iefolder} is the \texttt{DataWork} folder. This is reflected in the master script in such a way that the only change necessary to run the entire code from a new computer is to change the path to the project folder to reflect the filesystem and username. From 469e5bc43c32193bc12e56c1590a8837185c7965 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 14 Jan 2020 14:24:55 -0500 Subject: [PATCH 204/854] [ch 2] links to how reader can learn how we wrote this book --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index b90ae6c02..4c2541357 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -625,7 +625,7 @@ \subsection{Output management} Creating documents in {\LaTeX} using an integrated writing environment such as TeXstudio, TeXmaker or LyX is great for outputs that focus mainly on text, but include small chunks of code and static code outputs. -This book, for example, was written in {\LaTeX} and managed on GitHub. +This book, for example, was written in {\LaTeX}\sidenote{\url{https://www.latex-project.org} and \url{https://github.com/worldbank/DIME-LaTeX-Templates}} and managed on GitHub\sidenote{\url{https://github.com/worldbank/d4di}}. Another option is to use the statistical software's dynamic document engines. This means you can write both text (in Markdown) and code in the script, From 66296030c8eccbf3948f6ff82dc238ab2e2299f2 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 14 Jan 2020 14:54:39 -0500 Subject: [PATCH 205/854] Ch5 re-write Edits to questionnaire design and programming --- chapters/data-collection.tex | 91 +++++++++++++++++++++++++----------- 1 file changed, 63 insertions(+), 28 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index bbfd1f165..768736807 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -2,8 +2,14 @@ \begin{fullwidth} High quality research begins with a thoughtfully-designed, field-tested survey instrument, and a carefully supervised survey. -Much of the recent push toward credibiity in the social sciences has focused on analytical practices. -We contest that credible research depends, first and foremost, on the quality of the raw data. This chapter focuses on data generation, from questionnaire design to field monitoring. As there are many excellent resources on questionnaire design and field supervision \sidenote{\url{Grosh, Margaret; Glewwe, Paul. 2000. Designing Household Survey Questionnaires for Developing Countries : Lessons from 15 Years of the Living Standards Measurement Study, Volume 3. Washington, DC: World Bank. © World Bank. https://openknowledge.worldbank.org/handle/10986/15195 License: CC BY 3.0 IGO.}}, we focus on the particularly challenges and opportunities presented by electronic surveys. As there are many electronic survey tools, we cover workflows and primary concepts, rather than software-specific tools. The chapter covers questionnaire design, piloting and programming; monitoring data quality during the survey; and how to ensure confidential data is handled securely from collection to storage and sharing. +Much of the recent push toward credibility in the social sciences has focused on analytical practices. +We contest that credible research depends, first and foremost, on the quality of the raw data. This chapter covers the data generation workflow, from questionnaire design to field monitoring, for electronic data collection. +There are many excellent resources on questionnaire design and field supervision, +\sidenote{\url{Grosh, Margaret; Glewwe, Paul. 2000. Designing Household Survey Questionnaires for Developing Countries : Lessons from 15 Years of the Living Standards Measurement Study, Volume 3. Washington, DC: World Bank. © World Bank. https://openknowledge.worldbank.org/handle/10986/15195 License: CC BY 3.0 IGO.}} +but few covering the particularly challenges and opportunities presented by electronic surveys (often referred to as Computer Assisted Personal Interviews (CAPI)). +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted_Personal_Interviews_(CAPI)}} +As there are many electronic survey tools, we focus on workflows and primary concepts, rather than software-specific tools. +The chapter covers questionnaire design, piloting and programming; monitoring data quality during the survey; and how to ensure confidential data is handled securely from collection to storage and sharing. \end{fullwidth} @@ -12,12 +18,18 @@ \section{Designing CAPI questionnaires} A well-designed questionnaire results from careful planning, consideration of analysis and indicators, close review of existing questionnaires, survey pilots, and research team and stakeholder review. -Although most surveys are now collected electronically (often referred to as Computer Assisted Personal Interviews (CAPI)) -- +Although most surveys are now collected electronically -- \textbf{Questionnaire design}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Design}} \index{questionnaire design} -(content development) and questionnaire programming (functionality development) should be seen as two strictly separate tasks. By focusing on content first and programming implementation later, the survey design quality is better than when the questionnaire is set up in a way which is technically convenient to program. The research team should agree on all questionnaire content and design a paper version before programming a CAPI version. This facilitates a focus on content during the design process and ensures teams have a readable, printable paper version of their questionnaire. +(content development) and questionnaire programming (functionality development) should be seen as two strictly separate tasks. +The research team should agree on all questionnaire content and design a paper version before programming a CAPI version. +This facilitates a focus on content during the design process and ensures teams have a readable, printable paper version of their questionnaire. +Most importantly, it means the research, not the technology, drives the questionnaire design. -An easy-to-read paper version of the questionnaire is particularly critical for training enumerators, so they can get an overview of the survey content and structure before diving into the programming. It is much easier for enumerators to understand all possible response pathways from a paper version, than from swiping question by question. Finalizing the questionnaire before programming also avoids version control concerns that arise from concurrent work on paper and electronic survey instruments. In addition, a paper questionnaire is an important documentation for data publication. +An easy-to-read paper version of the questionnaire is particularly critical for training enumerators, so they can get an overview of the survey content and structure before diving into the programming. +It is much easier for enumerators to understand all possible response pathways from a paper version, than from swiping question by question. +Finalizing the questionnaire before programming also avoids version control concerns that arise from concurrent work on paper and electronic survey instruments. +Finally, a paper questionnaire is an important documentation for data publication. The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: begin from broad concepts and slowly flesh out the specifics. It is essential to start with a clear understanding of the \textbf{theory of change} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} and \textbf{experimental design} for your project. @@ -30,20 +42,21 @@ \section{Designing CAPI questionnaires} At this point, it is useful to do a \textbf{content-focused pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. -\subsection{Questionnaire design considerations for quantitative analysis} +\subsection{Questionnaire design for quantitative analysis} This book covers surveys designed to yield datasets useful for quantitative analysis. This is a subset of surveys, and there are specific design considerations that will help to ensure the raw data outputs are ready for analysis. -\textit{Coded response options:} From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like \textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. -\textit{Sample tracking:} it is essential to document the reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up} - -\index{attrition}\index{contamination} are essential data components for completing CONSORT records. -\sidenote[][-3.5cm]{\textbf{CONSORT:} a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups. Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K. F., Simel, D., et al. (1996). Improving the quality of reporting of randomized controlled trials: The CONSORT statement. \textit{JAMA}, 276(8):637--639} - -\textit{How to name questions:} There is debate over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. -We recommend using descriptive names, but with clear prefixes so that variables within a module stay together when sorted alphabetically. {\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} Variable names should never include spaces or mixed cases (all lower case is best). Take care with the length: very long names will be cut off in certain software, which could result in a loss of uniqueness. +We recommend using descriptive names with clear prefixes so that variables within a module stay together when sorted alphabetically. {\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} Variable names should never include spaces or mixed cases (all lower case is best). Take care with the length: very long names will be cut off in certain software, which could result in a loss of uniqueness. We discourage explicit question numbering, as it discourages re-ordering, which is a common recommended change after the pilot. In the case of follow-up surveys, numbering can quickly become convoluted, too often resulting in variables names like 'ag_15a', 'ag_15_new', ag_15_fup2', etc. + +Questionnaires must include ways to document the reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up}. +\index{attrition}\index{contamination} +These are essential data components for completing CONSORT records, a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups. +\sidenote[][-3.5cm]{Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K. F., Simel, D., et al. (1996). Improving the quality of reporting of randomized controlled trials: The CONSORT statement. \textit{JAMA}, 276(8):637--639} + + \subsection{Content-focused Pilot} A \textbf{Survey Pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} is the final step of questionnaire design. @@ -56,34 +69,56 @@ \subsection{Content-focused Pilot} %------------------------------------------------ \section{Programming CAPI questionnaires} -Most data collection is now done using electronic survey instruments, known as Computer Assisted Personal Interviews (CAPI). \sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted_Personal_Interviews_(CAPI)}} +Electronic data collection has great potential to simplify survey implementation and improve data quality. Electronic questionnaires are typically created in a spreadsheet (e.g. Excel or Google Sheets) or software-specific form builder, accessible even to novice users. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} We will not address software-specific form design in this book; rather, we focus on coding conventions that are important to follow for CAPI regardless of software choice. \sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} -CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. However, these are not fully automatic: you still need to actively design and manage the survey. Each software has specific practices that you need to follow to take advantage of CAPI features and ensure that the exported data is compatible with the software that will be used for analysis. + +CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. +However, these are not fully automatic: you still need to actively design and manage the survey. +Here, we discuss specific practices that you need to follow to take advantage of CAPI features and ensure that the exported data is compatible with the software that will be used for analysis, and the importance of a data-focused pilot. \subsection{CAPI workflow} -The starting point for questionnaire programming is a complete paper version of the questionnaire, piloted for content, and translated where needed. Starting the programming at this point reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. Most importantly, it means the research, not the technology, drives the questionnaire design. When you start programming, do not start with the first question and program your way to the last question. Instead, code from high level to small detail, following the same questionnaire outline established at design phase. The outline provides the basis for pseudocode, allowing you to start with high level structure and work down to the level of individual questions. This will save time and reduce errors. +The starting point for questionnaire programming is a complete paper version of the questionnaire, piloted for content, and translated where needed. +Starting the programming at this point reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. +When you start programming, do not start with the first question and program your way to the last question. +Instead, code from high level to small detail, following the same questionnaire outline established at design phase. +The outline provides the basis for pseudocode, allowing you to start with high level structure and work down to the level of individual questions. This will save time and reduce errors. \subsection{CAPI features} +CAPI surveys are more than simply an electronic version of a paper questionnaire. +All common CAPI software allow you to automate survey logic and add in hard and soft constraints on survey responses. +These features make enumerators' work easier, and they create the opportunity to identify and resolve data issues in real-time, simplifying data cleaning and improving response quality. +Well-programmed questionnaires should include most or all of the following features: + \begin{itemize} - \item{Survey logic}: build all skip patterns into the survey instrument, to ensure that only relevant questions are asked. This covers simple skip codes, as well as more complex interdependencies (e.g., a child health module is only asked to households that report the presence of a child under 5) - \item{Range checks}: add range checks for all numeric variables to catch data entry mistakes (e.g. age must be less than 120) - \item{Confirmation of key variables}: require double entry of essential information (such as a contact phone number in a survey with planned phone follow-ups), with automatic validation that the two entries match - \item{Multimedia}: electronic questionnaires facilitate collection of images, video, and geolocation data directly during the survey, using the camera and GPS built into the tablet or phone. - \item{Preloaded data}: data from previous rounds or related surveys can be used to prepopulate certain sections of the questionnaire, and validated during the interview. - \item{Sortable response options}: filters reduce the number of response options dynamically (e.g. filtering the cities list based on the state provided). - \item{Location checks}: enumerators submit their actual location using in-built GPS, to confirm they are in the right place for the interview. - \item{Consistency checks}: check that answers to related questions align, and trigger a warning if not so that enumerators can probe further. For example, if a household reports producing 800 kg of maize, but selling 900 kg of maize from their own production. - \item{Calculations}: make the electronic survey instrument do all math, rather than relying on the enumerator or asking them to carry a calculator. + \item{\textbf{Survey logic}}: build all skip patterns into the survey instrument, to ensure that only relevant questions are asked. This covers simple skip codes, as well as more complex interdependencies (e.g., a child health module is only asked to households that report the presence of a child under 5) + \item{\textbf{Range checks}}: add range checks for all numeric variables to catch data entry mistakes (e.g. age must be less than 120) + \item{\textbf{Confirmation of key variables}}: require double entry of essential information (such as a contact phone number in a survey with planned phone follow-ups), with automatic validation that the two entries match + \item{\textbf{Multimedia}}: electronic questionnaires facilitate collection of images, video, and geolocation data directly during the survey, using the camera and GPS built into the tablet or phone. + \item{\textbf{Preloaded data}}: data from previous rounds or related surveys can be used to prepopulate certain sections of the questionnaire, and validated during the interview. + \item{\textbf{Filtered response options}}: filters reduce the number of response options dynamically (e.g. filtering the cities list based on the state provided). + \item{\textbf{Location checks}}: enumerators submit their actual location using in-built GPS, to confirm they are in the right place for the interview. + \item{\textbf{Consistency checks}}: check that answers to related questions align, and trigger a warning if not so that enumerators can probe further. For example, if a household reports producing 800 kg of maize, but selling 900 kg of maize from their own production. + \item{\textbf{Calculations}}: make the electronic survey instrument do all math, rather than relying on the enumerator or asking them to carry a calculator. \end{itemize} \subsection{Compatibility with analysis software} -The \texttt{ietestform} command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of -\texttt{iefieldkit}, implements form-checking routines for \textbf{SurveyCTO}, a proprietary implementation of \textbf{Open Data Kit (ODK)}. +All CAPI software include debugging and test options to correct syntax errors and make sure that the survey instruments will successfully compile. +This is not sufficient, however, to ensure that the resulting dataset will load without errors in your data analysis software of choice. +We developed the \texttt{ietestform} command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of +\texttt{iefieldkit}, to implement a form-checking routine for \textbf{SurveyCTO}, a proprietary implementation of \textbf{Open Data Kit (ODK)}. +Intended for use during questionnaire programming and before fieldwork, ietestform tests for best practices in coding, naming and labeling, and choice lists. +Although \texttt{ietestform} is software-specific, many of the tests it runs are general and important to consider regardless of software choice. +To give a few examples, ietestform tests that no variable names exceed 32 characters, the limit in Stata (variable names that exceed that limit will be truncated, and as a result may no longer be unique). It checks whether ranges are included for numeric variables. +\texttt{ietestform} also removes all leading and trailing blanks from response lists, which could be handled inconsistently across software. \subsection{Data-focused Pilot} -The final stage of questionnaire programming is another Survey Pilot. The objective of the Data-focused Pilot \sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. The Data-focused pilot should be done in advance of Enumerator training +The final stage of questionnaire programming is another Survey Pilot. +The objective of the Data-focused Pilot \sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. +Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. +It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. +The Data-focused pilot should be done in advance of Enumerator training From e82baf6b6f8cb175037a4217b4e3f3e4233e4043 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 14 Jan 2020 15:09:09 -0500 Subject: [PATCH 206/854] [ch6] Small notes from Ben --- chapters/data-analysis.tex | 112 ++++++++++++++++++++----------------- 1 file changed, 62 insertions(+), 50 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 2d3514b1c..9cb38b835 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -1,7 +1,6 @@ %------------------------------------------------ \begin{fullwidth} -Data analysis is hard. Transforming raw data into a substantial contribution to scientific knowledge requires a mix of subject expertise, programming skills, and statistical and econometric knowledge. @@ -15,8 +14,9 @@ When it comes to code, though, analysis is the easy part, as long as you have organized your data well. -Of course, the econometrics behind data analysis is complex, -but this is not a book on econometrics. +Of course, there is plenty of complexity behind it: +the econometrics, the theory of change, the measurement methods, and so much more. +But none of those are the subject of this book. Instead, this chapter will focus on how to organize your data work. Most of a Research Assistant's time is spent cleaning data and getting it into the right format. When the practices recommended here are adopted, @@ -44,7 +44,7 @@ \section{Data management} how each edit affects other files in the project. % Task breakdown -We divide the process of turning raw data into analysis data in three stages: +We divide the process of turning raw data into analysis data into three stages: data cleaning, variable construction, and data analysis. Though they are frequently implemented at the same time, we find that creating separate scripts and data sets prevents mistakes. @@ -77,30 +77,27 @@ \section{Data management} but it can be used for different types of data. \texttt{iefolder} is designed to standardize folder structures across teams and projects. This means that PIs and RAs face very small costs when switching between projects, -because they are organized in the same way. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Folder}} -At the first level of this folder are what we call survey round folders. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Survey\_Round}} -You can think of a round as one source of data, -that will be cleaned in the same manner. +because they are organized in the same way.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Folder}} +At the first level of this folder are what we call survey round folders.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Survey\_Round}} +You can think of a ``round'' as one source of data, +that will be cleaned in the same script. Inside round folders, there are dedicated folders for raw (encrypted) data; de-identified data; cleaned data; and final (constructed) data. There is a folder for raw results, as well as for final outputs. The folders that hold code are organized in parallel to these, so that the progression through the whole project can be followed by anyone new to the team. -Additionally, \texttt{iefolder} creates \textbf{master do-files} -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} +Additionally, \texttt{iefolder} creates \textbf{master do-files}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} so all project code is reflected in a top-level script. % Master do file Master scripts allow users to execute all the project code from a single file. -It briefly describes what each code, +They briefly describes what each code, and maps the files they require and create. -It also connects code and folder structure through globals or objects. +They also connects code and folder structure through globals or objects. In short, a master script is a human-readable map to the tasks, files and folder structure that comprise a project. Having a master script eliminates the need for complex instructions to replicate results. -Reading the master do-file should be enough for anyone who's unfamiliar with the project +Reading the master do-file should be enough for anyone unfamiliar with the project to understand what are the main tasks, which scripts execute them, and where different files can be found in the project folder. That is, it should contain all the information needed to interact with a project's data work. @@ -114,8 +111,8 @@ \section{Data management} Both analysis results and data sets will change with the code. You should have each of them stored with the code that created it. If you are writing code in Git/GitHub, -you can output plain text files such as \texttt{.tex} tables, -and meta data saved in \texttt{.txt} or \texttt{.csv} to that directory. +you can output plain text files such as \texttt{.tex} tables +and metadata saved in \texttt{.txt} or \texttt{.csv} to that directory. Binary files that compile the tables, as well as the complete data sets, on the other hand, should be stored in your team's shared folder. @@ -128,11 +125,11 @@ \section{Data management} \section{Data cleaning} % intro: what is data cleaning ------------------------------------------------- -Data cleaning is the first stage of transforming the data you received from the field into data that you can analyze. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} -The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. +Data cleaning is the first stage of transforming the data you received from the field into data that you can analyze.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} +The cleaning process involves (1) making the data set easily usable and understandable, +and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. -The cleaned data set should contain only the data collected in the field. +The cleaned data set should contain only the variables collected in the field. No modifications to data points are made at this stage, except for corrections of mistaken entries. Cleaning is probably the most time consuming of the stages discussed in this chapter. @@ -147,22 +144,18 @@ \section{Data cleaning} It should contain only materials that are received directly from the field. They will invariably come in a host of file formats and nearly always contain personally-identifying information.\index{personally-identifying information} These files should be retained in the raw data folder \textit{exactly as they were received}. -The folder must be encrypted if it is shared in an insecure fashion, -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}}. +The folder must be encrypted if it is shared in an insecure fashion,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} and it must be backed up in a secure offsite location. Everything else can be replaced, but raw data cannot. Therefore, raw data should never be interacted with directly. -Secure storage of the raw -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Security}} +Secure storage of the raw\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Security}} data means access to it will be restricted even inside the research team. Loading encrypted data multiple times it can be annoying. To facilitate the handling of the data, remove any personally identifiable information from the data set. This will create a de-identified data set, that can be saved in a non-encrypted folder. -De-identification, -\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} -at this stage, means stripping the data set of direct identifiers such as names, phone numbers, addresses, and geolocations. -\sidenote{\url{ https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf }} +De-identification,\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} +at this stage, means stripping the data set of direct identifiers such as names, phone numbers, addresses, and geolocations.\sidenote{\url{ https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf}} The resulting de-identified data will be the underlying source for all cleaned and constructed data. Because identifying information is typically only used during data collection, to find and confirm the identity of interviewees, @@ -184,7 +177,7 @@ \section{Data cleaning} that can be cross-referenced with other records, such as the Master Data Set\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Data\_Set}} and other rounds of data collection. \texttt{ieduplicates} and \texttt{iecompdup}, -two commands included in the \texttt{ietoolkit} package, +two commands included in the \texttt{iefieldkit} package,\index{iefieldkit} create an automated workflow to identify, correct and document occurrences of duplicate entries. @@ -212,7 +205,7 @@ \section{Data cleaning} We have a few recommendations on how to use this command for data cleaning. First, we suggest keeping the same variable names as in the survey instrument, so it's easy to connect the two files. -Don't skip the labelling! +Don't skip the labelling. Applying labels makes it easier to understand what the data is showing while exploring the data. This minimizes the risk of small errors making their way through into the analysis stage. Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Applying\_Labels}} @@ -221,7 +214,7 @@ \section{Data cleaning} String variables need to be encoded, and open-ended responses, categorized or dropped\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Strings}} (unless you are using qualitative or classification analyses, which are less common). Finally, any additional information collected only for quality monitoring purposes, -such as notes and duration field, can also be dropped. +such as notes and duration fields, can also be dropped. % Outputs ----------------------------------------------------------------- @@ -231,9 +224,9 @@ \section{Data cleaning} with no changes to data points. It should also be easily traced back to the survey instrument, and be accompanied by a dictionary or codebook. -Typically, one cleaned data set will be created for each data source. -Each row in the cleaned data set represents one survey entry or unit of observation. -\sidenote{\cite{tidy-data}} +Typically, one cleaned data set will be created for each data source, +i.e., per survey instrument. +Each row in the cleaned data set represents one survey entry or unit of observation.\sidenote{\cite{tidy-data}} If the raw data set is very large, or the survey instrument is very complex, you may want to break the data cleaning into sub-steps, and create intermediate cleaned data sets @@ -271,15 +264,14 @@ \section{Indicator construction} This is done by creating derived variables (binaries, indices, and interactions, to name a few). To understand why construction is necessary, -let's take the example of a survey's consumption module. +let's take the example of a household survey's consumption module. It will result in separate variables indicating the amount of each item in the bundle that was consumed. There may be variables indicating the cost of these items. You cannot run a meaningful regression on these variables. You need to manipulate them into something that has \textit{economic} meaning. During this process, the data points will typically be reshaped and aggregated -so that level of the data set goes from the unit of observation in the survey to the unit of analysis. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} +so that level of the data set goes from the unit of observation in the survey to the unit of analysis.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} To use the same example, the data on quantity consumed was collect for each item, and needs to be aggregated to the household level before analysis. % Why it is a separate process ------------------------------- @@ -294,6 +286,7 @@ \section{Indicator construction} the data cleaning will differ between the two, but you want to make sure that variable definition is consistent across sources. So you want to first merge the data sets and then create the variables only once. +Therefore, unlike cleaning, construction can create many outputs from many inputs. % From analysis Data construction is never a finished process. @@ -313,10 +306,11 @@ \section{Indicator construction} Are all variables you are combining into an index or average using the same scale? Are yes or no questions coded as 0 and 1, or 1 and 2? This is when you will use the knowledge of the data you acquired and the documentation you created during the cleaning step the most. +It is often useful to start looking at comparisons and other documentation outside the code editpr. Adding comments to the code explaining what you are doing and why is crucial here. There are always ways for things to go wrong that you never anticipated, but two issues to pay extra attention to are missing values and dropped observations. -If you are subsetting a data set, drop observations explicitly, indicating why you are doing that and how the data set changed. +If you are subsetting a data set, you should drop observations explicitly, indicating why you are doing that and how the data set changed. Merging, reshaping and aggregating data sets can change both the total number of observations and the number of observations with missing values. Make sure to read about how each command treats missing observations and, whenever possible, add automated checks in the script that throw an error message if the result is changing. @@ -324,6 +318,7 @@ \section{Indicator construction} The most common of them is the presence of outliers. How to treat outliers is a research question, but make sure to note what we the decision made by the research team, and how you came to it. Results can be sensitive to the treatment of outliers, so keeping the original variable in the data set will allow you to test how much it affects the estimates. +More generally, create derived measures in new variables instead of overwriting the original information. % Outputs ----------------------------------------------------------------- @@ -344,6 +339,7 @@ \section{Indicator construction} are functionally-named variables. As you no longer need to worry about keeping variable names consistent with the survey, they should be as intuitive as possible. +Remember to consider keeping related variables together and adding notes to each as necessary. % Documentation It is wise to start an explanatory guide as soon as you start making changes to the data. @@ -379,18 +375,17 @@ \section{Writing data analysis code} This is when you are trying different things and looking for patterns in your data. It progresses into final analysis when your team starts to decide what are the main results, those that will make it into the research output. The way you deal with code and outputs for exploratory and final analysis is different, and this section will discuss how. - % Organizing scripts --------------------------------------------------------- During exploratory data analysis, you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. Though it's fine to write such a script during a long analysis meeting, this practice is error-prone. -It subtly encourages poor practices such as not clearing the workspace or loading fresh data. +It subtly encourages poor practices such as not clearing the workspace and not loading fresh data. It's important to take the time to organize scripts in a clean manner and to avoid mistakes. A well-organized analysis script starts with a completely fresh workspace and explicitly loads data before analyzing it. This encourages data manipulation to be done earlier in the workflow (that is, during construction). -It also and prevents you from accidentally writing pieces of code that depend on one another, leading to the too-familiar ``run this part, then that part, then this part'' process. +It also and prevents you from accidentally writing pieces of analysis code that depend on one another, leading to the too-familiar ``run this part, then that part, then this part'' process. Each script should run completely independently of all other code. -You can go as far as coding every output in a separate file. +You can go as far as coding every output in a separate script. There is nothing wrong with code files being short and simple -- as long as they directly correspond to specific pieces of analysis. Analysis files should be as simple as possible, so you can focus on the econometrics. @@ -406,13 +401,16 @@ \section{Writing data analysis code} Just like you did with each of the analysis datasets, name each of the individual analysis files descriptively. Code files such as \path{spatial-diff-in-diff.do}, \path{matching-villages.do}, and \path{summary-statistics.do} are clear indicators of what each file is doing, and allow you to find code quickly. +If you intend to numerically order the code as they appear in a paper or report, +leave this to near publication time. % Self-promotion ------------------------------------------------ Out team has created a few products to automate common outputs and save you precious research time. The \texttt{ietoolkit} package includes two commands to export nicely formatted tables. -\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} creates and exports balance tables to excel and {\TeX}. \texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} does the same for difference-in-differences regressions. +\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} creates and exports balance tables to excel and {\TeX}. +\texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} does the same for difference-in-differences regressions. The \textbf{Stata Visual Library}\sidenote{ \url{https://worldbank.github.io/Stata-IE-Visual-Library/}} -has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \href{https://www.r-graph-gallery.com/}{The R Graph Gallery}.} +has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com/}} \textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_visualization}} \index{data visualization} is increasingly popular, but a great deal lacks in quality.\cite{healy2018data,wilke2019fundamentals} We attribute some of this to the difficulty of writing code to create them. @@ -434,11 +432,18 @@ \section{Writing data analysis code} but it is well worth reviewing the graphics manual\sidenote{\url{https://www.stata.com/manuals/g.pdf}} For an easier way around it, Gray Kimbrough's \textit{Uncluttered Stata Graphs} code is an excellent default replacement for Stata graphics that is easy to install. \sidenote{\url{https://graykimbrough.github.io/uncluttered-stata-graphs/}} -If you are a R user, the \textit{R Graphics Cookbook}\sidenote{\url{https://r-graphics.org/}} is a great resource for the most popular visualization package\texttt{ggplot}\sidenote{\url{https://ggplot2.tidyverse.org/}}. But there are a variety of other visualization packages, such as \href{http://jkunst.com/highcharter/}{\texttt{highcharter}}, \href{https://rstudio.github.io/r2d3/}{\texttt{r2d3}}, \href{https://rstudio.github.io/leaflet/}{leaflet}, and \href{https://plot.ly/r/}{plotly}, to name a few. +If you are a R user, the \textit{R Graphics Cookbook}\sidenote{\url{https://r-graphics.org/}} +is a great resource for the most popular visualization package \texttt{ggplot}\sidenote{\url{https://ggplot2.tidyverse.org/}}. +But there are a variety of other visualization packages, +such as \texttt{highcharter}\sidenote{\url{http://jkunst.com/highcharter/}}, +\texttt{r2d3}\sidenote{\url{https://rstudio.github.io/r2d3/}}, +\texttt{leaflet}\sidenote{\url{https://rstudio.github.io/leaflet/}}, +and \texttt{plotly}\sidenote{\url{https://plot.ly/r/}}, to name a few. We have no intention of creating an exhaustive list, and this one is certainly missing very good references. But at least it is a place to start. \section{Exporting analysis outputs} + % Exploratory analysis It's ok to not export each and every table and graph created during exploratory analysis. Instead, we suggest running them into markdown files using RMarkdown or the different dynamic document options available in Stata. @@ -453,12 +458,19 @@ \section{Exporting analysis outputs} and you want to it as few times as possible. We cannot stress this enough: don't ever set a workflow that requires copying and pasting results from the console. -There are numerous commands to export outputs from both R and Stata.\sidenote{Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, and \href{https://www.benjaminbdaniels.com/stata-code/outwrite/}{\texttt{outwrite}} in Stata, and \href{https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf}{\texttt{stargazer}} and \href{https://ggplot2.tidyverse.org/reference/ggsave.html}{\texttt{ggsave}} in R.} +There are numerous commands to export outputs from both R and Stata.\sidenote{Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, +and \href{https://www.benjaminbdaniels.com/stata-code/outwrite/}{\texttt{outwrite}} in Stata, +and \href{https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf}{\texttt{stargazer}} +and \href{https://ggplot2.tidyverse.org/reference/ggsave.html}{\texttt{ggsave}} in R.} Save outputs in accessible and, whenever possible, lightweight formats. Accessible means that it's easy for other people to open them. -In Stata, that would mean always using \texttt{graph export} to save images as \texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., instead of \texttt{graph save}, which creates a \texttt{.gph} file that can only be opened through a Stata installation. +In Stata, that would mean always using \texttt{graph export} to save images as \texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., +instead of \texttt{graph save}, which creates a \texttt{.gph} file that can only be opened through a Stata installation. +Some publications require ``lossless'' TIFF of EPS files, which are created by specifying the desired extension. For tables, \texttt{.tex} is preferred. -Excel \texttt{.xlsx} and \texttt{.csv} files are also acceptable, although if you are working on a large report they will become cumbersome to update after revisions. +Excel \texttt{.xlsx} and \texttt{.csv} files are also acceptable, +although if you are working on a large report they will become cumbersome to update after revisions. +Whichever format you decide to use, remember to always specify the file extension explicitly. % Formatting If you need to create a table with a very particular format, that is not automated by any command you know, consider writing the it manually @@ -469,7 +481,7 @@ \section{Exporting analysis outputs} % Output content Keep in mind that final outputs should be self-standing. This means it should be easy to read and understand them with only the information they contain. -Make sure labels and notes cover all relevant information, such as sample, unit of observation, unit of measurement and variable definition.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Reviewing\_Graphs} \\ \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Submit\_Table}} +Make sure labels and notes cover all relevant information, such as sample, unit of observation, unit of measurement and variable definition.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Checklist:\_Reviewing\_Graphs} \\ \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Submit\_Table}} From 6a7f30f4602f0527084ad157a6ea89a98cd9e1ee Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 14 Jan 2020 15:10:56 -0500 Subject: [PATCH 207/854] [ch6] Last note from Ben that was missing --- chapters/data-analysis.tex | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 9cb38b835..aa07d6486 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -20,8 +20,7 @@ Instead, this chapter will focus on how to organize your data work. Most of a Research Assistant's time is spent cleaning data and getting it into the right format. When the practices recommended here are adopted, -it becomes much easier to analyze the data -using commands that are already implemented in any statistical software. +analyzing the data is as simple as using a command that is already implemented in a statistical software. \end{fullwidth} From 04783c81b8208799b031048df9c762d71b68ccaf Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 14 Jan 2020 15:29:22 -0500 Subject: [PATCH 208/854] [ch6] One more note from Ben It said "clean up wording" --- chapters/data-analysis.tex | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index aa07d6486..c4ea28922 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -277,15 +277,13 @@ \section{Indicator construction} % From cleaning Construction is done separately from data cleaning for two reasons. -First, so you have a clear cut of what was the data originally received, -and what is the result of data processing decisions. -Second, because if you have different data sources, -say a baseline and an endline survey, -unless the two instruments were exactly the same, -the data cleaning will differ between the two, -but you want to make sure that variable definition is consistent across sources. -So you want to first merge the data sets and then create the variables only once. -Therefore, unlike cleaning, construction can create many outputs from many inputs. +The first one is to clear differentiation between the data originally collected and the result of data processing decisions. +The second is to ensure that variable definition is consistent across data sources. +Unlike cleaning, construction can create many outputs from many inputs. +Let's take the example of a project that has a baseline and an endline survey. +Unless the two instruments are exactly the same, which is unlikely, the data cleaning for them will require different steps, and therefore will be done separately. +However, you still want the constructed variables to be calculated in the same way, so they are comparable. +So you want to construct indicators for both rounds in the same code, after merging them. % From analysis Data construction is never a finished process. From 0fd7b5f129c0dacdfbe11d66fedf2a643894cb9b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 14 Jan 2020 16:07:05 -0500 Subject: [PATCH 209/854] Little fixes to compile --- chapters/data-collection.tex | 233 +++++++++++++++++------------------ 1 file changed, 116 insertions(+), 117 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 768736807..898af6c5a 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -1,67 +1,67 @@ %------------------------------------------------ \begin{fullwidth} -High quality research begins with a thoughtfully-designed, field-tested survey instrument, and a carefully supervised survey. -Much of the recent push toward credibility in the social sciences has focused on analytical practices. -We contest that credible research depends, first and foremost, on the quality of the raw data. This chapter covers the data generation workflow, from questionnaire design to field monitoring, for electronic data collection. -There are many excellent resources on questionnaire design and field supervision, -\sidenote{\url{Grosh, Margaret; Glewwe, Paul. 2000. Designing Household Survey Questionnaires for Developing Countries : Lessons from 15 Years of the Living Standards Measurement Study, Volume 3. Washington, DC: World Bank. © World Bank. https://openknowledge.worldbank.org/handle/10986/15195 License: CC BY 3.0 IGO.}} +High quality research begins with a thoughtfully-designed, field-tested survey instrument, and a carefully supervised survey. +Much of the recent push toward credibility in the social sciences has focused on analytical practices. +We contest that credible research depends, first and foremost, on the quality of the raw data. This chapter covers the data generation workflow, from questionnaire design to field monitoring, for electronic data collection. +There are many excellent resources on questionnaire design and field supervision, +\sidenote{Grosh, Margaret; Glewwe, Paul. 2000. Designing Household Survey Questionnaires for Developing Countries : Lessons from 15 Years of the Living Standards Measurement Study, Volume 3. Washington, DC: World Bank. © World Bank. \url{https://openknowledge.worldbank.org/handle/10986/15195 License: CC BY 3.0 IGO.}} but few covering the particularly challenges and opportunities presented by electronic surveys (often referred to as Computer Assisted Personal Interviews (CAPI)). -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted_Personal_Interviews_(CAPI)}} -As there are many electronic survey tools, we focus on workflows and primary concepts, rather than software-specific tools. -The chapter covers questionnaire design, piloting and programming; monitoring data quality during the survey; and how to ensure confidential data is handled securely from collection to storage and sharing. - +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted\_Personal\_Interviews\_(CAPI)}} +As there are many electronic survey tools, we focus on workflows and primary concepts, rather than software-specific tools. +The chapter covers questionnaire design, piloting and programming; monitoring data quality during the survey; and how to ensure confidential data is handled securely from collection to storage and sharing. + \end{fullwidth} %------------------------------------------------ \section{Designing CAPI questionnaires} -A well-designed questionnaire results from careful planning, consideration of analysis and indicators, close review of existing questionnaires, survey pilots, and research team and stakeholder review. -Although most surveys are now collected electronically -- +A well-designed questionnaire results from careful planning, consideration of analysis and indicators, close review of existing questionnaires, survey pilots, and research team and stakeholder review. +Although most surveys are now collected electronically -- \textbf{Questionnaire design}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Design}} -\index{questionnaire design} -(content development) and questionnaire programming (functionality development) should be seen as two strictly separate tasks. -The research team should agree on all questionnaire content and design a paper version before programming a CAPI version. -This facilitates a focus on content during the design process and ensures teams have a readable, printable paper version of their questionnaire. -Most importantly, it means the research, not the technology, drives the questionnaire design. - -An easy-to-read paper version of the questionnaire is particularly critical for training enumerators, so they can get an overview of the survey content and structure before diving into the programming. -It is much easier for enumerators to understand all possible response pathways from a paper version, than from swiping question by question. -Finalizing the questionnaire before programming also avoids version control concerns that arise from concurrent work on paper and electronic survey instruments. -Finally, a paper questionnaire is an important documentation for data publication. - -The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: begin from broad concepts and slowly flesh out the specifics. It is essential to start with a clear understanding of the +\index{questionnaire design} +(content development) and questionnaire programming (functionality development) should be seen as two strictly separate tasks. +The research team should agree on all questionnaire content and design a paper version before programming a CAPI version. +This facilitates a focus on content during the design process and ensures teams have a readable, printable paper version of their questionnaire. +Most importantly, it means the research, not the technology, drives the questionnaire design. + +An easy-to-read paper version of the questionnaire is particularly critical for training enumerators, so they can get an overview of the survey content and structure before diving into the programming. +It is much easier for enumerators to understand all possible response pathways from a paper version, than from swiping question by question. +Finalizing the questionnaire before programming also avoids version control concerns that arise from concurrent work on paper and electronic survey instruments. +Finally, a paper questionnaire is an important documentation for data publication. + +The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: begin from broad concepts and slowly flesh out the specifics. It is essential to start with a clear understanding of the \textbf{theory of change} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} and \textbf{experimental design} for your project. -The first step of questionnaire design is to list key outcomes of interest, as well as the main covariates and variables needed for experimental design. +The first step of questionnaire design is to list key outcomes of interest, as well as the main covariates and variables needed for experimental design. The ideal starting point for this is a \textbf{pre-analysis plan}. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} -Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). For each module, determine if the module is applicable to the full sample, the appropriate respondent, and whether (or how often), the module should be repeated. A few examples: a module on maternal health only applies to household with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. +Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). For each module, determine if the module is applicable to the full sample, the appropriate respondent, and whether (or how often), the module should be repeated. A few examples: a module on maternal health only applies to household with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. Each module should then be expanded into specific indicators to observe in the field. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} -At this point, it is useful to do a \textbf{content-focused pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. -Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. +At this point, it is useful to do a \textbf{content-focused pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. +Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. \subsection{Questionnaire design for quantitative analysis} -This book covers surveys designed to yield datasets useful for quantitative analysis. This is a subset of surveys, and there are specific design considerations that will help to ensure the raw data outputs are ready for analysis. +This book covers surveys designed to yield datasets useful for quantitative analysis. This is a subset of surveys, and there are specific design considerations that will help to ensure the raw data outputs are ready for analysis. -From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like +From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like \textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. There is debate over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. -We recommend using descriptive names with clear prefixes so that variables within a module stay together when sorted alphabetically. {\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} Variable names should never include spaces or mixed cases (all lower case is best). Take care with the length: very long names will be cut off in certain software, which could result in a loss of uniqueness. We discourage explicit question numbering, as it discourages re-ordering, which is a common recommended change after the pilot. In the case of follow-up surveys, numbering can quickly become convoluted, too often resulting in variables names like 'ag_15a', 'ag_15_new', ag_15_fup2', etc. +We recommend using descriptive names with clear prefixes so that variables within a module stay together when sorted alphabetically.\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} Variable names should never include spaces or mixed cases (all lower case is best). Take care with the length: very long names will be cut off in certain software, which could result in a loss of uniqueness. We discourage explicit question numbering, as it discourages re-ordering, which is a common recommended change after the pilot. In the case of follow-up surveys, numbering can quickly become convoluted, too often resulting in variables names like 'ag_15a', 'ag_15_new', ag_15_fup2', etc. -Questionnaires must include ways to document the reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up}. -\index{attrition}\index{contamination} +Questionnaires must include ways to document the reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up}. +\index{attrition}\index{contamination} These are essential data components for completing CONSORT records, a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups. \sidenote[][-3.5cm]{Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K. F., Simel, D., et al. (1996). Improving the quality of reporting of randomized controlled trials: The CONSORT statement. \textit{JAMA}, 276(8):637--639} \subsection{Content-focused Pilot} -A \textbf{Survey Pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} is the final step of questionnaire design. -A Content-focused Pilot \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} is best done on pen-and-paper, before the questionnaire is programmed. -The objective is to improve the structure and length of the questionnaire, refine the phrasing and translation of specific questions, and confirm coded response options are exhaustive.\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes}} +A \textbf{Survey Pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} is the final step of questionnaire design. +A Content-focused Pilot \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} is best done on pen-and-paper, before the questionnaire is programmed. +The objective is to improve the structure and length of the questionnaire, refine the phrasing and translation of specific questions, and confirm coded response options are exhaustive.\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes}} In addition, it is an opportunity to test and refine all survey protocols, such as how units will be sampled or pre-selected units identified. The pilot must be done out-of-sample, but in a context as similar as possible to the study sample. Once the content of the questionnaire is finalized and translated, it is time to proceed with programming the electronic survey instrument. @@ -69,55 +69,55 @@ \subsection{Content-focused Pilot} %------------------------------------------------ \section{Programming CAPI questionnaires} -Electronic data collection has great potential to simplify survey implementation and improve data quality. +Electronic data collection has great potential to simplify survey implementation and improve data quality. Electronic questionnaires are typically created in a spreadsheet (e.g. Excel or Google Sheets) or software-specific form builder, accessible even to novice users. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} -We will not address software-specific form design in this book; rather, we focus on coding conventions that are important to follow for CAPI regardless of software choice. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} +We will not address software-specific form design in this book; rather, we focus on coding conventions that are important to follow for CAPI regardless of software choice. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} -CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. -However, these are not fully automatic: you still need to actively design and manage the survey. -Here, we discuss specific practices that you need to follow to take advantage of CAPI features and ensure that the exported data is compatible with the software that will be used for analysis, and the importance of a data-focused pilot. +CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. +However, these are not fully automatic: you still need to actively design and manage the survey. +Here, we discuss specific practices that you need to follow to take advantage of CAPI features and ensure that the exported data is compatible with the software that will be used for analysis, and the importance of a data-focused pilot. \subsection{CAPI workflow} -The starting point for questionnaire programming is a complete paper version of the questionnaire, piloted for content, and translated where needed. -Starting the programming at this point reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. -When you start programming, do not start with the first question and program your way to the last question. -Instead, code from high level to small detail, following the same questionnaire outline established at design phase. -The outline provides the basis for pseudocode, allowing you to start with high level structure and work down to the level of individual questions. This will save time and reduce errors. +The starting point for questionnaire programming is a complete paper version of the questionnaire, piloted for content, and translated where needed. +Starting the programming at this point reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. +When you start programming, do not start with the first question and program your way to the last question. +Instead, code from high level to small detail, following the same questionnaire outline established at design phase. +The outline provides the basis for pseudocode, allowing you to start with high level structure and work down to the level of individual questions. This will save time and reduce errors. \subsection{CAPI features} -CAPI surveys are more than simply an electronic version of a paper questionnaire. -All common CAPI software allow you to automate survey logic and add in hard and soft constraints on survey responses. -These features make enumerators' work easier, and they create the opportunity to identify and resolve data issues in real-time, simplifying data cleaning and improving response quality. -Well-programmed questionnaires should include most or all of the following features: +CAPI surveys are more than simply an electronic version of a paper questionnaire. +All common CAPI software allow you to automate survey logic and add in hard and soft constraints on survey responses. +These features make enumerators' work easier, and they create the opportunity to identify and resolve data issues in real-time, simplifying data cleaning and improving response quality. +Well-programmed questionnaires should include most or all of the following features: \begin{itemize} \item{\textbf{Survey logic}}: build all skip patterns into the survey instrument, to ensure that only relevant questions are asked. This covers simple skip codes, as well as more complex interdependencies (e.g., a child health module is only asked to households that report the presence of a child under 5) \item{\textbf{Range checks}}: add range checks for all numeric variables to catch data entry mistakes (e.g. age must be less than 120) \item{\textbf{Confirmation of key variables}}: require double entry of essential information (such as a contact phone number in a survey with planned phone follow-ups), with automatic validation that the two entries match \item{\textbf{Multimedia}}: electronic questionnaires facilitate collection of images, video, and geolocation data directly during the survey, using the camera and GPS built into the tablet or phone. - \item{\textbf{Preloaded data}}: data from previous rounds or related surveys can be used to prepopulate certain sections of the questionnaire, and validated during the interview. - \item{\textbf{Filtered response options}}: filters reduce the number of response options dynamically (e.g. filtering the cities list based on the state provided). - \item{\textbf{Location checks}}: enumerators submit their actual location using in-built GPS, to confirm they are in the right place for the interview. - \item{\textbf{Consistency checks}}: check that answers to related questions align, and trigger a warning if not so that enumerators can probe further. For example, if a household reports producing 800 kg of maize, but selling 900 kg of maize from their own production. - \item{\textbf{Calculations}}: make the electronic survey instrument do all math, rather than relying on the enumerator or asking them to carry a calculator. + \item{\textbf{Preloaded data}}: data from previous rounds or related surveys can be used to prepopulate certain sections of the questionnaire, and validated during the interview. + \item{\textbf{Filtered response options}}: filters reduce the number of response options dynamically (e.g. filtering the cities list based on the state provided). + \item{\textbf{Location checks}}: enumerators submit their actual location using in-built GPS, to confirm they are in the right place for the interview. + \item{\textbf{Consistency checks}}: check that answers to related questions align, and trigger a warning if not so that enumerators can probe further. For example, if a household reports producing 800 kg of maize, but selling 900 kg of maize from their own production. + \item{\textbf{Calculations}}: make the electronic survey instrument do all math, rather than relying on the enumerator or asking them to carry a calculator. \end{itemize} \subsection{Compatibility with analysis software} -All CAPI software include debugging and test options to correct syntax errors and make sure that the survey instruments will successfully compile. +All CAPI software include debugging and test options to correct syntax errors and make sure that the survey instruments will successfully compile. This is not sufficient, however, to ensure that the resulting dataset will load without errors in your data analysis software of choice. We developed the \texttt{ietestform} command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of \texttt{iefieldkit}, to implement a form-checking routine for \textbf{SurveyCTO}, a proprietary implementation of \textbf{Open Data Kit (ODK)}. Intended for use during questionnaire programming and before fieldwork, ietestform tests for best practices in coding, naming and labeling, and choice lists. -Although \texttt{ietestform} is software-specific, many of the tests it runs are general and important to consider regardless of software choice. -To give a few examples, ietestform tests that no variable names exceed 32 characters, the limit in Stata (variable names that exceed that limit will be truncated, and as a result may no longer be unique). It checks whether ranges are included for numeric variables. -\texttt{ietestform} also removes all leading and trailing blanks from response lists, which could be handled inconsistently across software. +Although \texttt{ietestform} is software-specific, many of the tests it runs are general and important to consider regardless of software choice. +To give a few examples, ietestform tests that no variable names exceed 32 characters, the limit in Stata (variable names that exceed that limit will be truncated, and as a result may no longer be unique). It checks whether ranges are included for numeric variables. +\texttt{ietestform} also removes all leading and trailing blanks from response lists, which could be handled inconsistently across software. \subsection{Data-focused Pilot} -The final stage of questionnaire programming is another Survey Pilot. -The objective of the Data-focused Pilot \sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. -Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. -It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. +The final stage of questionnaire programming is another Survey Pilot. +The objective of the Data-focused Pilot \sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. +Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. +It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. The Data-focused pilot should be done in advance of Enumerator training @@ -125,29 +125,29 @@ \subsection{Data-focused Pilot} %------------------------------------------------ \section{Data quality assurance} A huge advantage of CAPI surveys, compared to traditional paper surveys, is the ability to access and analyze the data while the survey is ongoing. -Data issues can be identified and resolved in real-time. Designing systematic data checks, and running them routinely throughout data collection, simplifies monitoring and improves data quality. -As part of survey preparation, the research team should develop a \textbf{data quality assurance plan} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Quality_Assurance_Plan}}. -While data collection is ongoing, a research assistant or data analyst should work closely with the field team to ensure that the survey is progressing correctly, and perform \textbf{high-frequency checks (HFCs)} of the incoming data. +Data issues can be identified and resolved in real-time. Designing systematic data checks, and running them routinely throughout data collection, simplifies monitoring and improves data quality. +As part of survey preparation, the research team should develop a \textbf{data quality assurance plan} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Quality_Assurance_Plan}}. +While data collection is ongoing, a research assistant or data analyst should work closely with the field team to ensure that the survey is progressing correctly, and perform \textbf{high-frequency checks (HFCs)} of the incoming data. \sidenote{\url{https://github.com/PovertyAction/high-frequency-checks/wiki}} -Data quality assurance requires a combination of both real-time data checks, survey audits, and field monitoring. Although field monitoring is critical for a successful survey, we focus on the first two in this chapter, as they are the most directly data related. +Data quality assurance requires a combination of both real-time data checks, survey audits, and field monitoring. Although field monitoring is critical for a successful survey, we focus on the first two in this chapter, as they are the most directly data related. \subsection{High Frequency Checks} -High-frequency checks should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, and that additional field effort is centered where it is most important. -Data quality checks should be run on the data every time it is downloaded (ideally on a daily basis), to flag irregularities in survey progress, sample completeness or response quality. \texttt{ipacheck}\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks}} +High-frequency checks should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, and that additional field effort is centered where it is most important. +Data quality checks should be run on the data every time it is downloaded (ideally on a daily basis), to flag irregularities in survey progress, sample completeness or response quality. \texttt{ipacheck}\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks}} is a very useful command that automates some of these tasks. -It is important to check every day that the households interviewed match the survey sample. -Many CAPI software programs include case management features, through which sampled units are directly assigned to individual enumerators. -Even with careful management, it is often the case that raw data includes duplicate entries, which may occur due to field errors or duplicated submissions to the server. +It is important to check every day that the households interviewed match the survey sample. +Many CAPI software programs include case management features, through which sampled units are directly assigned to individual enumerators. +Even with careful management, it is often the case that raw data includes duplicate entries, which may occur due to field errors or duplicated submissions to the server. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Duplicates_and_Survey_Logs}} \texttt{ieduplicates}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ieduplicates}} provides a workflow for collaborating on the resolution of duplicate entries between you and the field team. Next, observed units in the data must be validated against the expected sample: this is as straightforward as \texttt{merging} the sample list with the survey data and checking for mismatches. Reporting errors and duplicate observations in real-time allows the field team to make corrections efficiently. -Tracking survey progress is important for monitoring attrition, so that it is clear early on if a change in protocols or additional tracking will be needed. -It is also important to check interview completion rate and sample compliance by surveyor and survey team, to identify any under-performing individuals or teams. +Tracking survey progress is important for monitoring attrition, so that it is clear early on if a change in protocols or additional tracking will be needed. +It is also important to check interview completion rate and sample compliance by surveyor and survey team, to identify any under-performing individuals or teams. When all data collection is complete, the survey team should prepare a final field report, which should report reasons for any deviations between the original sample and the dataset collected. Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of survey data. @@ -157,23 +157,23 @@ \subsection{High Frequency Checks} This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions and loss to follow-up occurred in the field and how they were implemented and resolved. -High frequency checks should also include survey-specific data checks. As CAPI software incorporates many data control features, discussed above, these checks should focus on issues CAPI software cannot check automatically. As most of these checks are survey specific, it is difficult to provide general guidance. An in-depth knowledge of the questionnaire, and a careful examination of the pre-analysis plan, is the best preparation.Examples include consistency across multiple responses, complex calculations suspicious patterns in survey timing or atypical response patters from specific enumerators. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} CAPI software typically provides rich metadata, which can be useful in assessing interview quality. For example, automatically collected time stamps show how long enumerators spent per question, and trace histories show how many times answers were changed before the survey was submitted. +High frequency checks should also include survey-specific data checks. As CAPI software incorporates many data control features, discussed above, these checks should focus on issues CAPI software cannot check automatically. As most of these checks are survey specific, it is difficult to provide general guidance. An in-depth knowledge of the questionnaire, and a careful examination of the pre-analysis plan, is the best preparation.Examples include consistency across multiple responses, complex calculations suspicious patterns in survey timing or atypical response patters from specific enumerators. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} CAPI software typically provides rich metadata, which can be useful in assessing interview quality. For example, automatically collected time stamps show how long enumerators spent per question, and trace histories show how many times answers were changed before the survey was submitted. \subsection{Data considerations for field monitoring} -Careful monitoring of field work is essential for high quality data. -\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. -For back-checks, a random subset of the field sample is chosen and a subset of information from the full survey is verified through a brief interview with the original respondent. -Design of the backcheck questionnaire follows the same survey design principles discussed above, in particular you should use the pre-analysis plan or list of key outcomes to establish which subset of variables to prioritize. - -Real-time access to the survey data increases the potential utility of backchecks dramatically, and both simplifies and improves the rigor of related workflows. -You can use the raw data to draw the backcheck sample; assuring it is appropriately apportioned across interviews and survey teams. -As soon as backchecks are complete, the backcheck data can be tested against the original data to identify areas of concern in real-time. +Careful monitoring of field work is essential for high quality data. +\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. +For back-checks, a random subset of the field sample is chosen and a subset of information from the full survey is verified through a brief interview with the original respondent. +Design of the backcheck questionnaire follows the same survey design principles discussed above, in particular you should use the pre-analysis plan or list of key outcomes to establish which subset of variables to prioritize. + +Real-time access to the survey data increases the potential utility of backchecks dramatically, and both simplifies and improves the rigor of related workflows. +You can use the raw data to draw the backcheck sample; assuring it is appropriately apportioned across interviews and survey teams. +As soon as backchecks are complete, the backcheck data can be tested against the original data to identify areas of concern in real-time. \texttt{bcstats} is a useful tool for analyzing back-check data in Stata module. \sidenote{\url{https://ideas.repec.org/c/boc/bocode/s458173.html}} -CAPI surveys also provide a unique opportunity to do audits through audio recordings of the interview, typically short recordings triggered at random throughout the questionnaire. -\textbf{Audio audits} are a useful means to assess whether the enumerator is conducting the interview as expected (and not sitting under a tree making up data). -Do note, however, that audio audits must be included in the Informed Consent. +CAPI surveys also provide a unique opportunity to do audits through audio recordings of the interview, typically short recordings triggered at random throughout the questionnaire. +\textbf{Audio audits} are a useful means to assess whether the enumerator is conducting the interview as expected (and not sitting under a tree making up data). +Do note, however, that audio audits must be included in the Informed Consent. \textcolor{red}{ \subsection{Dashboard} @@ -182,71 +182,70 @@ \subsection{Dashboard} %------------------------------------------------ \section{Collecting Data Securely} -Primary data collection almost always includes \textbf{personally-identifiable information (PII)} -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. -PII must be handled with great care at all points in the data collection and management process, to comply with ethical requirements and avoid breaches of confidentiality. Access to PII must be restricted to team members granted that permission by the applicable Institutional Review Board or a data licensing agreement with a partner agency. Research teams must maintain strict protocols for data security at each stage of the process: data collection, storage, and sharing. +Primary data collection almost always includes \textbf{personally-identifiable information (PII)} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. +PII must be handled with great care at all points in the data collection and management process, to comply with ethical requirements and avoid breaches of confidentiality. Access to PII must be restricted to team members granted that permission by the applicable Institutional Review Board or a data licensing agreement with a partner agency. Research teams must maintain strict protocols for data security at each stage of the process: data collection, storage, and sharing. \subsection{Secure data in the field} All mainstream data collection software automatically \textbf{encrypt} -\sidenote{\textbf{Encryption:} the process of making information unreadable to anyone without access to a specific deciphering key.} +\sidenote{\textbf{Encryption:} the process of making information unreadable to anyone without access to a specific deciphering key.} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} -all data submitted from the field while in transit (i.e., upload or download). Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. Therefore, as long as you are using an established CAPI software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, and accounts used have a logon password and are never left unlocked. +all data submitted from the field while in transit (i.e., upload or download). Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. Therefore, as long as you are using an established CAPI software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, and accounts used have a logon password and are never left unlocked. \subsection{Secure data storage} \textbf{Encryption at rest} is the only way to ensure that PII data remains private when it is stored on someone else's server on the open internet. You must keep your data encrypted on the server whenever PII data is collected. -Encryption makes data files completely unusable without access to a security key specific to that data -- a higher level of security than password-protection. +Encryption makes data files completely unusable without access to a security key specific to that data -- a higher level of security than password-protection. Encryption at rest requires active participation from the user; and you should be fully aware that if your private key is lost, there is absolutely no way to recover your data. -You should not assume that your data is encrypted by default: indeed, for most CAPI software platforms, encryption needs to be enabled by the user. +You should not assume that your data is encrypted by default: indeed, for most CAPI software platforms, encryption needs to be enabled by the user. To enable it, you must confirm you know how to operate the encryption system and understand the consequences if basic protocols are not followed. When you enable encryption, the service will allow you to download -- once -- the keyfile pair needed to decrypt the data. You must download and store this in a secure location, such as a password manager. Make sure you store keyfiles with descriptive names to match the survey to which they correspond. -Any time anyone accesses the data - either when viewing it in the browser or downloading it to your computer - they will be asked to provide the keyfile. -Only project teams members names in the IRB are allowed access to the private keyfile. +Any time anyone accesses the data - either when viewing it in the browser or downloading it to your computer - they will be asked to provide the keyfile. +Only project teams members names in the IRB are allowed access to the private keyfile. -To proceed with data analysis, you typically need a working copy of the data accessible from a personal computer. The following workflow allows you to receive data from the server and store it securely, without compromising data security. +To proceed with data analysis, you typically need a working copy of the data accessible from a personal computer. The following workflow allows you to receive data from the server and store it securely, without compromising data security. \begin{enumerate} \item Download data \item Store a ``master'' copy of the data into an encrypted location that will remain accessible on disk and be regularly backed up \item Secure a ``gold master'' copy of the raw data in a secure location, such as a long-term cloud storage service or an encrypted physical hard drive stored in a separate location. If you remain lucky, you will never have to access this copy -- you just want to know it is out there, safe, if you need it. - + \end{enumerate} -This handling satisfies the rule of three: there are two on-site copies of the data and one off-site copy, so the data can never be lost in case of hardware failure. +This handling satisfies the rule of three: there are two on-site copies of the data and one off-site copy, so the data can never be lost in case of hardware failure. In addition, you should ensure that all teams take basic precautions to ensure the security of data, as most problems are due to human error. -Ideally, the machine hard drives themselves should also be encrypted, as well as any external hard drives or flash drives used. +Ideally, the machine hard drives themselves should also be encrypted, as well as any external hard drives or flash drives used. All files sent to the field containing PII data, such as sampling lists, should at least be password protected. This can be done using a zip-file creator. -You must never share passwords by email; rather, use a secure password manager. +You must never share passwords by email; rather, use a secure password manager. This significantly mitigates the risk in case there is a security breach such as loss, theft, hacking, or a virus, with little impact on day-to-day utilization. \subsection{Secure data sharing} -To simplify workflow, it is best to remove PII variables from your data at the earliest possible opportunity, and save a de-identified copy of the data. -Once the data is de-identified, it no longer needs to be encrypted - therefore you can interact with it directly, without having to provide the keyfile. +To simplify workflow, it is best to remove PII variables from your data at the earliest possible opportunity, and save a de-identified copy of the data. +Once the data is de-identified, it no longer needs to be encrypted - therefore you can interact with it directly, without having to provide the keyfile. -We recommend de-identification in two stages: an initial process to remove direct identifiers to create a working de-identified dataset, and a final process to remove all possible identifiers to create a publishable dataset. +We recommend de-identification in two stages: an initial process to remove direct identifiers to create a working de-identified dataset, and a final process to remove all possible identifiers to create a publishable dataset. The \textbf{initial de-identification} should happen directly after the encrypted data is downloaded to disk. At this time, for each variable that contains PII, ask: will this variable be needed for analysis? -If not, the variable should be dropped. Examples include respondent names, enumerator names, interview date, respondent phone number. -If the variable is needed for analysis, ask: can I encode or otherwise construct a variable to use for the analysis that masks the PII, and drop the original variable? -Examples include: geocoordinates - after construction measures of distance or area, the specific location is often not necessary; and names for social network analysis, which can be encoded to unique numeric IDs. -If PII variables are directly required for the analysis itself, it will be necessary to keep at least a subset of the data encrypted through the data analysis process. +If not, the variable should be dropped. Examples include respondent names, enumerator names, interview date, respondent phone number. +If the variable is needed for analysis, ask: can I encode or otherwise construct a variable to use for the analysis that masks the PII, and drop the original variable? +Examples include: geocoordinates - after construction measures of distance or area, the specific location is often not necessary; and names for social network analysis, which can be encoded to unique numeric IDs. +If PII variables are directly required for the analysis itself, it will be necessary to keep at least a subset of the data encrypted through the data analysis process. -Flagging all potentially identifying variables in the questionnaire design stage, as recommended above, simplifies the initial de-identification. -You already have the list of variables to assess, and ideally have already assessed those against the pre-analysis plan. -If so, all you need to do is write a script to drop the variables that are not required for analysis, encode or otherwise mask those that are required, and save a working version of the data. +Flagging all potentially identifying variables in the questionnaire design stage, as recommended above, simplifies the initial de-identification. +You already have the list of variables to assess, and ideally have already assessed those against the pre-analysis plan. +If so, all you need to do is write a script to drop the variables that are not required for analysis, encode or otherwise mask those that are required, and save a working version of the data. -The \textbf{final de-identification} is a more involved process, with the objective of creating a dataset for publication that cannot be manipulated or linked to identify any individual research participant. +The \textbf{final de-identification} is a more involved process, with the objective of creating a dataset for publication that cannot be manipulated or linked to identify any individual research participant. You must remove all direct and indirect identifiers, and assess the risk of statistical disclosure. \sidenote{Disclosure risk: the likelihood that a released data record can be associated with an individual or organization}. \index{statistical disclosure} -There will almost always be a trade-off between accuracy and privacy. For publicly disclosed data, you should always favor privacy. -There are a number of useful tools for de-identification: PII scanners for Stata \sidenote{\url{https://github.com/J-PAL/stata_PII_scan}} or R \sidenote{\url{https://github.com/J-PAL/PII-Scan}}, -and tools for statistical disclosure control. \sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/#}} -In cases where PII data is required for analysis, we recommend embargoing the sensitive variables when publishing the data. -Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. +There will almost always be a trade-off between accuracy and privacy. For publicly disclosed data, you should always favor privacy. +There are a number of useful tools for de-identification: PII scanners for Stata \sidenote{\url{https://github.com/J-PAL/stata_PII_scan}} or R \sidenote{\url{https://github.com/J-PAL/PII-Scan}}, +and tools for statistical disclosure control. \sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/}} +In cases where PII data is required for analysis, we recommend embargoing the sensitive variables when publishing the data. +Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. With the raw data securely stored and backed up, and a de-identified dataset to work with, you are ready to move to data cleaning, and analysis. %------------------------------------------------ - From 3f9cfe05b868d63e26c9575df680fa5e80bdb8f8 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 14 Jan 2020 18:15:46 -0500 Subject: [PATCH 210/854] [ch 5] escape underscores --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 898af6c5a..915ce9287 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -49,7 +49,7 @@ \subsection{Questionnaire design for quantitative analysis} \textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. There is debate over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. -We recommend using descriptive names with clear prefixes so that variables within a module stay together when sorted alphabetically.\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} Variable names should never include spaces or mixed cases (all lower case is best). Take care with the length: very long names will be cut off in certain software, which could result in a loss of uniqueness. We discourage explicit question numbering, as it discourages re-ordering, which is a common recommended change after the pilot. In the case of follow-up surveys, numbering can quickly become convoluted, too often resulting in variables names like 'ag_15a', 'ag_15_new', ag_15_fup2', etc. +We recommend using descriptive names with clear prefixes so that variables within a module stay together when sorted alphabetically.\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} Variable names should never include spaces or mixed cases (all lower case is best). Take care with the length: very long names will be cut off in certain software, which could result in a loss of uniqueness. We discourage explicit question numbering, as it discourages re-ordering, which is a common recommended change after the pilot. In the case of follow-up surveys, numbering can quickly become convoluted, too often resulting in variables names like 'ag\_15a', 'ag\_15\_new', ag\_15\_fup2', etc. Questionnaires must include ways to document the reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up}. \index{attrition}\index{contamination} From 4d3e84a3eb323dfa0c0e9e25cbdae7f7cec8f4f0 Mon Sep 17 00:00:00 2001 From: Luiza Date: Wed, 15 Jan 2020 09:49:51 -0500 Subject: [PATCH 211/854] [ch4] adding changes from #316 --- chapters/sampling-randomization-power.tex | 26 +++++++++++------------ 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index d4c861e6e..de449e8ec 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -7,14 +7,14 @@ and what their status will be for the purpose of effect estimation. Since we only get one chance to implement a given experiment, we need to have a detailed understanding of how these processes work -and how to implment them properly. +and how to implement them properly. This allows us to ensure the field reality corresponds well to our experimental design. In quasi-experimental methods, sampling determines what populations the study will be able to make meaningful inferences about, and randomization analyses simulate counterfactual possibilities if the events being studied had happened differently. -These needs are particularly important in the intial phases of development studies -- +These needs are particularly important in the initial phases of development studies -- typically conducted well before any actual fieldwork occurs -- and often have implications for planning and budgeting. @@ -173,15 +173,15 @@ \section{Sampling and randomization} \subsection{Sampling} \textbf{Sampling} is the process of randomly selecting units of observation -from a master list for data collection.\sidenote{ +from a master list of individuals to be surveyed for data collection.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Sampling_\%26_Power_Calculations}} \index{sampling} That master list may be called a \textbf{sampling universe}, a \textbf{listing frame}, or something similar. -We refer to it as a \textbf{master data set}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Master_Data_Set}} -because it is the authoritative source for the existence and fixed characteristics -of each of the units that may be surveyed.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} +We recommend that this list be organized in a \textbf{master data set}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Master_Data_Set}}, +creating an authoritative source for the existence and fixed +characteristics of each of the units that may be surveyed.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} The master data set indicates how many individuals are eligible for data collection, and therefore contains statistical information about the likelihood that each will be chosen. @@ -424,10 +424,10 @@ \subsection{Power calculations} Furthermore, you should use real data whenever it is available, or you will have to make assumptions about the distribution of outcomes. -Using the concepts of minimum detectable effect -and minimum sample size in tandem can help answer a key question +Together, the concepts of minimum detectable effect +and minimum sample size can help answer a key question that typical approaches to power often do not. -Namely, they can help you determine what tradeoffs to make +Namely, they can help you determine what trade-offs to make in the design of your experiment. Can you support another treatment arm? Is it better to add another cluster, @@ -440,7 +440,7 @@ \subsection{Power calculations} than an output for reporting requirements. At the end of the day, you will probably have reduced the complexity of your experiment significantly. -For reporting purposes, such as grantwriting and registered reports, +For reporting purposes, such as grant writing and registered reports, simulation ensures you will have understood the key questions well enough to report standard measures of power once your design is decided. @@ -487,7 +487,7 @@ \subsection{Randomization inference} that your exact design is likely to produce. The range of these effects, again, may be very different from those predicted by standard approaches to power calculation, -and randomization inference futher allows visual inspection of results. +and randomization inference further allows visual inspection of results. If there is significant heaping at particular result levels, or if results seem to depend dramatically on the placement of a small number of individuals, randomization inference will flag those issues before the experiment is fielded From 5884fb92e57da908b1e2739ca1991c3876d3a984 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 15 Jan 2020 17:19:16 -0500 Subject: [PATCH 212/854] [ch 2] explain when \ instead of / cause errors No reason to not educate the reader in the specifics here --- chapters/planning-data-work.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 4c2541357..941f53d23 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -115,8 +115,9 @@ \subsection{Setting up your computer} and typically use only A-Z (the 26 English characters), dashes (\texttt{-}), and underscores (\texttt{\_}) in folder names and filenames. You should \textit{always} use forward slashes (\texttt{/}) in file paths in code, -just like an internet address, and no matter how your computer provides them, -because the other type will cause your code to break on many systems. +just like an internet address, even if you are using a Windows machine where +both forward and backward slashes are allowed, as your code will otherwise break +if anyone tries to run it on a Mac or Linux machine. Making the structure of your directories a core part of your workflow is very important, since otherwise you will not be able to reliably transfer the instructions for replicating or carrying out your analytical work. From fbec440a9424fdec0f9969c7ab54de777ec717b0 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 12:01:32 -0500 Subject: [PATCH 213/854] unless they have sound practices --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index b994055ef..118286ec9 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -32,7 +32,7 @@ Even more importantly, the only way to determine credibility without transparency is to judge research solely based on where it is done and by whom, which concentrates credibility at better-known international institutions and global universities, - at the expense of the people and organization directly involved in and affected by it. + at the expense of quality research done by people and organizations directly involved in and affected by it. Simple transparency standards mean that it is easier to judge research quality, and making high-quality research identifiable also increases its impact. This section provides some basic guidelines and resources From d550377e3f00a37709052939a8ca336c25723f40 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 12:15:38 -0500 Subject: [PATCH 214/854] simply making analysis code and data available --- chapters/handling-data.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 118286ec9..ba6a26175 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -219,7 +219,8 @@ \subsection{Research credibility} various practices were implemented.\cite{nosek2015promoting} With the ongoing rise of empirical research and increased public scrutiny of scientific evidence, -this is no longer enough to guarantee that findings will hold their credibility. +simply making analysis code and data available +is no longer sufficient on its own to guarantee that findings will hold their credibility. Even if your methods are highly precise, your evidence is only as good as your data -- and there are plenty of mistakes that can be made between From 8c98d2fc04dab58301511b60b721ec2f7e683b64 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 12:18:48 -0500 Subject: [PATCH 215/854] Maintaining credibility --- chapters/handling-data.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index ba6a26175..6c4eb420e 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -69,9 +69,10 @@ \section{Protecting confidence in development research} primary data that researchers use for such studies has never been reviewed by anyone else, so it is hard for others to verify that it was collected, handled, and analyzed appropriately.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} -Maintaining research quality standards via credibility, transparency, and reproducibility tools -is the most important way that researchers using primary data can avoid serious error, -and therefore these are not by-products but core components of research output. +Maintaining credibility in research via transparent and reproducibile methods +is key for researchers to avoid serious errors. +This is even more important in research using primary data, +and therefore these are not byproducts but core components of research output. \subsection{Research reproducibility} From e0df2958287ef6fb0cef7e6f28e036401a936c91 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 13:35:41 -0500 Subject: [PATCH 216/854] Re-identification --- chapters/handling-data.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 6c4eb420e..cf83cea46 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -444,6 +444,7 @@ \subsection{De-identifying and anonymizing information} Note, however, that it is in practice impossible to \textbf{anonymize} data. There is always some statistical chance that an individual's identity will be re-linked to the data collected about them +-- even if that data has had all directly identifying information removed -- by using some other data that becomes identifying when analyzed together. There are a number of tools developed to help researchers de-identify data and which you should use as appropriate at that stage of data collection. From e7cc091110e8b93cdc1b4f0dab384beca32cb537 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 13:40:23 -0500 Subject: [PATCH 217/854] Split sentence for clarity --- chapters/planning-data-work.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 941f53d23..9dd2e3e79 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -115,7 +115,7 @@ \subsection{Setting up your computer} and typically use only A-Z (the 26 English characters), dashes (\texttt{-}), and underscores (\texttt{\_}) in folder names and filenames. You should \textit{always} use forward slashes (\texttt{/}) in file paths in code, -just like an internet address, even if you are using a Windows machine where +just like an internet address. Do this even if you are using a Windows machine where both forward and backward slashes are allowed, as your code will otherwise break if anyone tries to run it on a Mac or Linux machine. Making the structure of your directories a core part of your workflow is very important, @@ -559,7 +559,7 @@ \subsection{Documenting and organizing code} If you wait for a long time to have your code reviewed, and it gets too complex, preparation and code review will require more time and work, and that is usually the reason why this step is skipped. -One other important advantage of code review if that +One other important advantage of code review if that making sure that the code is running properly on other machines, and that other people can read and understand the code easily, is the easiest way to be prepared in advance for a smooth project handover. From d32ed2428fa30a7e707e43b455d2eb0dc9de9ecc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 13:43:15 -0500 Subject: [PATCH 218/854] File paths --- chapters/planning-data-work.tex | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 9dd2e3e79..36eda9ac8 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -109,13 +109,11 @@ \subsection{Setting up your computer} \index{file paths} On MacOS this will be something like \path{/users/username/dropbox/project/...}, and on Windows, \path{C:/users/username/Github/project/...}. -We will write file paths such as \path{/Dropbox/project-title/DataWork/EncryptedData/}, -assuming the ``Dropbox'' folder lives inside your home folder. -File paths will use forward slashes (\texttt{/}) to indicate folders, -and typically use only A-Z (the 26 English characters), +Use forward slashes (\texttt{/}) in filepaths for folders, +and whenever possible use only A-Z (the 26 English characters), dashes (\texttt{-}), and underscores (\texttt{\_}) in folder names and filenames. -You should \textit{always} use forward slashes (\texttt{/}) in file paths in code, -just like an internet address. Do this even if you are using a Windows machine where +For emphasis: \textit{always} use forward slashes (\texttt{/}) in file paths in code, +just like in internet addresses. Do this even if you are using a Windows machine where both forward and backward slashes are allowed, as your code will otherwise break if anyone tries to run it on a Mac or Linux machine. Making the structure of your directories a core part of your workflow is very important, From abdb5e30f6d5d4f48b9b0872f043216b7fa3a135 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 13:48:35 -0500 Subject: [PATCH 219/854] Management tools --- chapters/planning-data-work.tex | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 36eda9ac8..1fd2a501a 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -185,9 +185,12 @@ \subsection{Documenting decisions and tasks} the records related to decision making on those tasks is permanently recorded and easy to find in the future when questions about that task come up. One popular and free implementation of this system is found in GitHub project boards. -Other systems which offer similar features (but are not explicitly Kanban-based) -are GitHub Issues and Dropbox Paper, which has a more chronological structure. -What is important is that your team chooses its system and sticks to it, +Other tools which currently offer similar features (but are not explicitly Kanban-based) +are GitHub Issues and Dropbox Paper. +Any specific list of software will quickly be outdated; +we mention these two as an example of one that is technically-organized and one that is chronologial. +Choosing the right tool for the right needs is essential to being satisfied with the workflow. +What is important is that your team chooses its systems and stick to those choices, so that decisions, discussions, and tasks are easily reviewable long after they are completed. Just like we use different file sharing tools for different types of files, From 178f473ac62ab2cd7cd14a2257290c1136c33cd2 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 13:50:39 -0500 Subject: [PATCH 220/854] Stata language --- chapters/planning-data-work.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 1fd2a501a..ef829c43d 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -271,8 +271,8 @@ \subsection{Choosing software} and simultaneous work with other types of files, without leaving the editor. In our field of development economics, -Stata is by far the most commonly used programming language, -and the Stata do-file editor the most common editor. +Stata is by far the most commonly used statistical software, +and the built-in do-file editor the most common editor for programming Stata. We focus on Stata-specific tools and instructions in this book. Hence, we will use the terms `script' and `do-file' interchangeably to refer to Stata code throughout. From 04dab1b1313e85c42ea7307cbb2853b320dcce80 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 13:53:22 -0500 Subject: [PATCH 221/854] Workflow thinking --- chapters/planning-data-work.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index ef829c43d..4102765b7 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -282,7 +282,8 @@ \subsection{Choosing software} and understand Stata to be encoding a set of tasks as a record for the future. We believe that this must change somewhat: in particular, we think that practitioners of Stata -must begin to think about their workflows more as programmers do, +must begin to think about their code and programming workflows +just as methodologically as they think about their research workflows. and that people who adopt this approach will be dramatically more capable in their analytical ability. This means that they will be more productive when managing teams, From 4f7ce7d80ad3c516a61ea5a620cd25e8fec8c5fa Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 13:55:11 -0500 Subject: [PATCH 222/854] Code interacts well --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 4102765b7..9938e90f8 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -321,7 +321,7 @@ \section{Organizing code and data} are able to interact well with your work, whether they are yours or those of others. File organization makes your own work easier as well as more transparent, -and interacts well with tools like version control systems +and will make your code easier to combine with tools like version control systems that aim to cut down on the amount of repeated tasks you have to perform. It is worth thinking in advance about how to store, name, and organize the different types of files you will be working with, From 35d91c3e22fd67afa2323ed9200f18377614ce96 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 14:09:40 -0500 Subject: [PATCH 223/854] Non-code files --- chapters/planning-data-work.tex | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 9938e90f8..da53c2b37 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -429,21 +429,26 @@ \subsection{Organizing files and folder structures} and ``non-technical'' files, which will not be accessed by code processes. The former takes precedent: an Excel file is a technical file even if it is a field log, because at some point it will be used by code. -We will not give much emphasis to non-technical files here; +We will not give much emphasis to files that are not linked to code here; but you should make sure to name them in an orderly fashion that works for your team. These rules will ensure you can find files within folders and reduce the amount of time others will spend opening files to find out what is inside them. -Some of the differences between the two file types are major and may be new to you. For example, -you can use spaces and datestamps in names of non-technical files, but not technical ones: -the former might have a name like \texttt{2019-10-30 Sampling Procedure Description.docx} -while the latter might have a name like \texttt{endline-sampling.do}. -Technical files have stricter requirements than non-technical ones.\sidenote{ - \url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-Git/slides/naming-slides/naming-slides.pdf}} -For example, you should never use spaces in technical names; -this can cause problems in code. (This includes all folder names.) -Similarly, technical files should never include capital letters. -One organizational practice that takes some getting used to +The main point to be considered is that files accessed by code face more restrictions\sidenote{ + \url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-Git/slides/naming-slides/naming-slides.pdf}}, +since different software and operating systems read file names in different ways. +Some of the differences between the two naming approaches are major and may be new to you, +so below are a few examples. +Introducing spaces between words in a file name (including the folder path) +can break a file's path when it's read by code, +so while a Word document may be called \texttt{2019-10-30 Sampling Procedure Description.docx}, +a related do file would have a name like \texttt{sampling-endline.do}. +Adding timestamps to binary files as in the example above can be useful, +as it is not straightforward to track changes using version control software. +However, for plaintext files tracked using Git, timestamps are an unnecessary distraction. +Similarly, technical files should never include capital letters, +as strings and file paths are case-sensitive in some software. +Finally, one organizational practice that takes some getting used to is the fact that the best names from a coding perspective are usually the opposite of those from an English perspective. For example, for a deidentified household dataset from the baseline round, From aab065bb7d0259b21614ce8565f411c36fbebcbc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 14:11:17 -0500 Subject: [PATCH 224/854] Stata is the most popular and shall always be --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index da53c2b37..9b7b31e6d 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -271,7 +271,7 @@ \subsection{Choosing software} and simultaneous work with other types of files, without leaving the editor. In our field of development economics, -Stata is by far the most commonly used statistical software, +Stata is currently the most commonly used statistical software, and the built-in do-file editor the most common editor for programming Stata. We focus on Stata-specific tools and instructions in this book. Hence, we will use the terms `script' and `do-file' From c17d4b1e4ffbf81f2492900a274face4fda34d68 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 14:13:27 -0500 Subject: [PATCH 225/854] ET no phone home folder --- chapters/planning-data-work.tex | 3 --- 1 file changed, 3 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 9b7b31e6d..8599bd84f 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -100,9 +100,6 @@ \subsection{Setting up your computer} because other users can alter or delete them. \index{Dropbox} -Find your \textbf{home folder}. It is never your desktop. -On MacOS, this will be a folder with your username. -On Windows, this will be something like ``This PC''. Ensure you know how to get the \textbf{absolute file path} for any given file. Using the absolute file path, starting from the filesystem root, means that the computer will never accidentally load the wrong file. From 106a3c90ebd7c609e1fe91762173cb4463c46301 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 14:14:21 -0500 Subject: [PATCH 226/854] Data map removed --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 8599bd84f..c0d295037 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -9,7 +9,7 @@ This means knowing which data sets and output you need at the end of the process, how they will stay organized, what types of data you'll handle, and whether the data will require special handling due to size or privacy considerations. -Identifying these details creates a \textbf{data map} for your project, +Identifying these details should help you map out the data needs for your project, giving you and your team a sense of how information resources should be organized. It's okay to update this map once the project is underway -- the point is that everyone knows what the plan is. From 329275d4496f10e26fe2e2a3044ca213d58b3089 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 14:16:40 -0500 Subject: [PATCH 227/854] $tata --- chapters/planning-data-work.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index c0d295037..b48a5fa7f 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -24,8 +24,9 @@ makes working together on outputs much easier from the very first discussion. This chapter will discuss some tools and processes that will help prepare you for collaboration and replication. -We will try to provide free, open-source, and platform-agnostic tools wherever possible, +We will provide free, open-source, and platform-agnostic tools wherever possible, and point to more detailed instructions when relevant. +(Stata is the notable exception here due to its current popularity in the field.) Most have a learning and adaptation process, meaning you will become most comfortable with each tool only by using it in real-world work. From 6510324e00564293b22c9292f3a377cc62531590 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 14:19:56 -0500 Subject: [PATCH 228/854] Original or irreplaceable --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index b48a5fa7f..1d26da5fa 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -88,7 +88,7 @@ \subsection{Setting up your computer} Make sure your computer is backed up to prevent information loss. \index{backup} -Follow the \textbf{3-2-1 rule}: maintain 3 copies of all critical data, +Follow the \textbf{3-2-1 rule}: maintain 3 copies of all original or irreplaceable data, on at least 2 different hardware devices you have access to, with 1 offsite storage method.\sidenote{ \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} From 7595fad5aa947bf20ac7ba38cd78f515d9f3ecb2 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 14:21:18 -0500 Subject: [PATCH 229/854] Cloud copies --- chapters/planning-data-work.tex | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 1d26da5fa..9a6a84fbb 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -92,13 +92,12 @@ \subsection{Setting up your computer} on at least 2 different hardware devices you have access to, with 1 offsite storage method.\sidenote{ \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} -One reasonable setup is having your primary computer, -a local hard drive managed with a tool like Time Machine -(alternatively, a fully synced secondary computer), -and either a remote copy maintained by a cloud backup service -or all original files stored on a remote server. -Dropbox and other synced files count only as local copies and never as remote backups, -because other users can alter or delete them. +One example of this setup is having one copy on your primary computer, +one copy on an external hard drive stored in a safe place, +and one copy in the cloud. +In this case, Dropbox and other automatic file sync services do not count as a cloud copy, +since other users can alter or delete them +unless you create a specific folder for this purpose that is not shared with anyone else. \index{Dropbox} Ensure you know how to get the \textbf{absolute file path} for any given file. From 717a0b861969d31f621b925db0406408c57ae781 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 14:23:03 -0500 Subject: [PATCH 230/854] git in both cases Changed to generic "git" --- chapters/planning-data-work.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 9a6a84fbb..8f694f873 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -104,8 +104,8 @@ \subsection{Setting up your computer} Using the absolute file path, starting from the filesystem root, means that the computer will never accidentally load the wrong file. \index{file paths} -On MacOS this will be something like \path{/users/username/dropbox/project/...}, -and on Windows, \path{C:/users/username/Github/project/...}. +On MacOS this will be something like \path{/users/username/git/project/...}, +and on Windows, \path{C:/users/username/git/project/...}. Use forward slashes (\texttt{/}) in filepaths for folders, and whenever possible use only A-Z (the 26 English characters), dashes (\texttt{-}), and underscores (\texttt{\_}) in folder names and filenames. From 5719c82dbeb25511551a1a2d465f32055a095d11 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 14:25:12 -0500 Subject: [PATCH 231/854] Future relevance --- chapters/planning-data-work.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 8f694f873..c9a01ed6f 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -164,8 +164,9 @@ \subsection{Documenting decisions and tasks} across a group of people, or to remind you when old information becomes relevant. They are not structured to allow people to collaborate over a long time or to review old discussions. It is therefore easy to miss or lose communications from the past when they have relevance in the present. -Everything that is communicated over e-mail or any other instant medium should -immediately be transferred into a system that is designed to keep records. +Everything with future relevance that is communicated over e-mail or any other instant medium +-- such as, for example, decisions about sampling -- +should immediately be recorded in a system that is designed to keep permanent records. We call these systems collaboration tools, and there are several that are very useful.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Collaboration_Tools}} \index{collaboration tools} From b77f61e14cc0d8d9e42143b5c1621a4991c6c47e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 14:26:16 -0500 Subject: [PATCH 232/854] I am root --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index c9a01ed6f..5606d9aec 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -331,7 +331,7 @@ \subsection{Organizing files and folder structures} Agree with your team on a specific directory structure, and set it up at the beginning of the research project -in your top-level shared folder (the one over which you can control access permissions). +in your root folder (the one over which you can control access permissions). This will prevent future folder reorganizations that may slow down your workflow and, more importantly, ensure that your code files are always able to run on any machine. To support consistent folder organization, DIME Analytics maintains \texttt{iefolder}\sidenote{ From 2fc2df63799ebd1d33dbd3ce669040edfd91a08e Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 16 Jan 2020 15:49:01 -0500 Subject: [PATCH 233/854] [ch2] better way to introduce iegitaddm --- chapters/planning-data-work.tex | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 5606d9aec..307859661 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -365,11 +365,12 @@ \subsection{Organizing files and folder structures} and for the files that manage final analytical work. The command also has some flexibility for the addition of folders for non-primary data sources, although this is less well developed. -The package also includes the \texttt{iegitaddmd} command, -which can place a \texttt{README.md} file in each of these folders. -These \textbf{Markdown} files provide an easy and Git-compatible way +The \texttt{ietoolkit} package also includes the \texttt{iegitaddmd} command, +which can place \texttt{README.md} placeholder files in your folders so that +your folder structure can be shared using git. Since these placeholder files are in +\textbf{Markdown} \index{Markdown} they also provide an easy way to document the contents of every folder in the structure. - \index{Markdown} + The \texttt{DataWork} folder may be created either inside an existing project-based folder structure, or it may be created separately. From ba52829bdc58b23e3b05ccd74cc94e8ba8af8a8f Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 16 Jan 2020 16:06:57 -0500 Subject: [PATCH 234/854] [ch2] move Git sidenote to first time we mention Git --- chapters/planning-data-work.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 5606d9aec..6879a8401 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -129,7 +129,9 @@ \subsection{Setting up your computer} which makes simultaneous editing difficult but other tasks easier. They also have some security concerns which we will address later. \textbf{Version control} is another method, -commonly implemented by tools like Git and GitHub. +commonly implemented by tools like Git\sidenote{ + \textbf{Git:} a multi-user version control system for collaborating on and tracking changes to code as it is written.} and GitHub\sidenote{ + \textbf{GitHub:} the biggest publicly available platform for hosting Git projects.}. \index{version control} Version control allows everyone to access different versions of files at the same time, making simultaneous editing easier but some other tasks harder. @@ -412,8 +414,6 @@ \subsection{Organizing files and folder structures} without needing to rely on dreaded filename-based versioning conventions. For code files, however, a more detailed version control system is usually desirable. We recommend using Git\sidenote{ - \textbf{Git:} a multi-user version control system for collaborating on and tracking changes to code as it is written.} -for all plaintext files. Git tracks all the changes you make to your code, and allows you to go back to previous versions without losing the information on changes made. It also makes it possible to work on multiple parallel versions of the code, From 6ced5209baf2e2b235eacd4de7a4a4e41d435e07 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 16 Jan 2020 16:07:20 -0500 Subject: [PATCH 235/854] [ch2] help the reader what plaintext is --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 6879a8401..34b4968d4 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -413,7 +413,7 @@ \subsection{Organizing files and folder structures} these are usually enough to manage changes to binary files (such as Word and PowerPoint documents) without needing to rely on dreaded filename-based versioning conventions. For code files, however, a more detailed version control system is usually desirable. -We recommend using Git\sidenote{ +We recommend using Git for all code and all other plaintext files ({\LaTeX} files, .csv/.txt tables etc.). Git tracks all the changes you make to your code, and allows you to go back to previous versions without losing the information on changes made. It also makes it possible to work on multiple parallel versions of the code, From eebcae4cc3a8664d9a0b02263b255359c0015577 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 16:10:25 -0500 Subject: [PATCH 236/854] \index{} commands should come only after full stops. --- chapters/planning-data-work.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 307859661..22fbbd82e 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -368,9 +368,9 @@ \subsection{Organizing files and folder structures} The \texttt{ietoolkit} package also includes the \texttt{iegitaddmd} command, which can place \texttt{README.md} placeholder files in your folders so that your folder structure can be shared using git. Since these placeholder files are in -\textbf{Markdown} \index{Markdown} they also provide an easy way +\textbf{Markdown} they also provide an easy way to document the contents of every folder in the structure. - + \index{Markdown} The \texttt{DataWork} folder may be created either inside an existing project-based folder structure, or it may be created separately. From 17c33b3d217a7578265787d374fee783d896f51f Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 16:11:56 -0500 Subject: [PATCH 237/854] Capitalize "Git" --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 22fbbd82e..18a4e7658 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -367,7 +367,7 @@ \subsection{Organizing files and folder structures} folders for non-primary data sources, although this is less well developed. The \texttt{ietoolkit} package also includes the \texttt{iegitaddmd} command, which can place \texttt{README.md} placeholder files in your folders so that -your folder structure can be shared using git. Since these placeholder files are in +your folder structure can be shared using Git. Since these placeholder files are in \textbf{Markdown} they also provide an easy way to document the contents of every folder in the structure. \index{Markdown} From 89a538ab1139f5c4871fca0d6bdb87cb34a3ab95 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 16:19:29 -0500 Subject: [PATCH 238/854] Sync & share --- chapters/planning-data-work.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 53e182461..b67ff2aae 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -383,9 +383,10 @@ \subsection{Organizing files and folder structures} \index{project folder} This is so the project folder can be maintained in a synced location like Dropbox, while the code folder can be maintained in a version-controlled location like GitHub. -(Remember, a version-controlled folder can \textit{never} be stored inside a synced folder, -because the versioning features are extremely disruptive to others -when the syncing utility operates on them, and vice versa.) +(Remember, a version-controlled folder \textit{should not} +be stored in a synced folder that is shared with other people. +Those two types of collaboration tools function very differently +and will almost always create undesired functionality if combined.) Nearly all code files and raw outputs (not datasets) are best managed this way. This is because code files are usually \textbf{plaintext} files, and non-technical files are usually \textbf{binary} files.\index{plaintext}\index{binary files} From 2ceeddf6e90367e9d2d9c0a0c75a9c15c37463be Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 16 Jan 2020 16:26:18 -0500 Subject: [PATCH 239/854] Principles --- chapters/handling-data.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index cf83cea46..fc9529d22 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -267,8 +267,8 @@ \section{Ensuring privacy and security in research data} \url{https://sdcpractice.readthedocs.io/en/latest/}} In all cases where this type of information is involved, -you must make sure that you adhere to several core processes, -including approval, consent, security, and privacy. +you must make sure that you adhere to several core principles. +These include ethical approval, participant consent, data security, and participant privacy. If you are a US-based researcher, you will become familiar with a set of governance standards known as ``The Common Rule''.\sidenote{ \url{https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html}} From 17f55ca77a6e5b8ae594ea00c6fe1ad3926a861b Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 16 Jan 2020 16:50:13 -0500 Subject: [PATCH 240/854] [ch2] duplicated side note --- chapters/planning-data-work.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index b67ff2aae..e34738edc 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -621,8 +621,8 @@ \subsection{Output management} is the final step in producing research outputs for public consumption. Though formatted text software such as Word and PowerPoint are still prevalent, researchers are increasingly choosing to prepare final outputs -like documents and presentations using {\LaTeX}.\sidenote{ - \url{https://www.latex-project.org}} \index{{\LaTeX}.} +like documents and presentations using {\LaTeX}\index{{\LaTeX}}.\sidenote{ + \url{https://www.latex-project.org} and \url{https://github.com/worldbank/DIME-LaTeX-Templates}.} {\LaTeX} is a document preparation system that can create both text documents and presentations. The main advantage is that {\LaTeX} uses plaintext for all formatting, and it is necessary to learn its specific markup convention to use it. @@ -633,7 +633,7 @@ \subsection{Output management} Creating documents in {\LaTeX} using an integrated writing environment such as TeXstudio, TeXmaker or LyX is great for outputs that focus mainly on text, but include small chunks of code and static code outputs. -This book, for example, was written in {\LaTeX}\sidenote{\url{https://www.latex-project.org} and \url{https://github.com/worldbank/DIME-LaTeX-Templates}} and managed on GitHub\sidenote{\url{https://github.com/worldbank/d4di}}. +This book, for example, was written in {\LaTeX} and managed on GitHub\sidenote{\url{https://github.com/worldbank/d4di}}. Another option is to use the statistical software's dynamic document engines. This means you can write both text (in Markdown) and code in the script, From ec0c7866386cb2ef4b2b1f6c5c8d64d99ee18c6c Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 16 Jan 2020 16:59:56 -0500 Subject: [PATCH 241/854] [ch3] typo --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 1d92e4ba6..99a87ab5a 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -67,7 +67,7 @@ \section{Causality, inference, and identification} identifies its estimate of treatment effects, so you can calculate and interpret those estimates appropriately. All the study designs we discuss here use the \textbf{potential outcomes} framework -to compare the a group that recieved some treatment to another, counterfactual group. +to compare a group that received some treatment to another, counterfactual group. Each of these types of approaches can be used in two contexts: \textbf{experimental} designs, in which the research team is directly responsible for creating the variation in treatment, From f3663661eef9124a3a3f4e5641ae275895d03b47 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 16 Jan 2020 17:03:03 -0500 Subject: [PATCH 242/854] [ch3] all four words hyphenated at a different location --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 99a87ab5a..4b9bb4938 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -252,7 +252,7 @@ \subsection{Cross-sectional designs} What needs to be carefully maintainted in data for cross-sectional RCTs is the treatment randomization process itself, as well as detailed field data about differences -in data quality and loss-to-follow up across groups.\cite{athey2017econometrics} +in data quality and loss-to-follow-up across groups.\cite{athey2017econometrics} Only these details are needed to construct the appropriate estimator: clustering of the estimate is required at the level at which the treatment is assigned to observations, From 496b84f8ee43a8cd2785b417ce45880591b14fb9 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 16 Jan 2020 17:07:17 -0500 Subject: [PATCH 243/854] [ch3] typos - receive is your Achilles heal... :) --- chapters/research-design.tex | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 4b9bb4938..c521eda41 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -165,7 +165,7 @@ \subsection{Experimental and quasi-experimental research designs} if they had not been treated, and it is particularly effective at doing so as evidenced by its broad credibility in fields ranging from clinical medicine to development. Therefore RCTs are very popular tools for determining the causal impact -of specific prorgrams or policy interventions. +of specific programs or policy interventions. However, there are many types of treatments that are impractical or unethical to effectively approach using an experimental strategy, and therefore many limitations to accessing ``big questions'' @@ -180,7 +180,7 @@ \subsection{Experimental and quasi-experimental research designs} Second, takeup and implementation fidelity are extremely important, since programs will by definition have no effect if they are not in fact accepted by or delivered to -the people who are supposed to recieve them. +the people who are supposed to receive them. Unfortunately, these effects kick in very quickly and are highly nonlinear: 70\% takeup or efficacy doubles the required sample, and 50\% quadruples it.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up}} @@ -234,7 +234,7 @@ \subsection{Cross-sectional designs} \textbf{Cross-sectional} surveys are the simplest possible study design: a program is implemented, surveys are conducted, and data is analyzed. When it is an RCT, a randomization process constructs the control group at random -from the population that is eligible to recieve each treatment. +from the population that is eligible to receive each treatment. When it is observational, we present other evidence that a similar equivalence holds. Therefore, by construction, each unit's receipt of the treatment is unrelated to any of its other characteristics @@ -249,7 +249,7 @@ \subsection{Cross-sectional designs} then the outcome values at that point in time already reflect the effect of the treatment. -What needs to be carefully maintainted in data for cross-sectional RCTs +What needs to be carefully maintained in data for cross-sectional RCTs is the treatment randomization process itself, as well as detailed field data about differences in data quality and loss-to-follow-up across groups.\cite{athey2017econometrics} @@ -332,7 +332,7 @@ \subsection{Differences-in-differences} are critically important to maintain alongside the survey results. In panel data structures, we attempt to observe the exact same units in the repeated rounds, so that we see the same individuals -both before and after they have recieved treatment (or not).\sidenote{ +both before and after they have received treatment (or not).\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/what-are-we-estimating-when-we-estimate-difference-differences}} This allows each unit's baseline outcome to be used as an additional control for its endline outcome, @@ -376,11 +376,11 @@ \subsection{Regression discontinuity} into comparable gorups of individuals who do and do not recieve a treatment.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Regression_Discontinuity}} These types of designs differ from cross-sectional and diff-in-diff designs -in that the group eligible to recieve treatment is not defined directly, +in that the group eligible to receive treatment is not defined directly, but instead created during the process of the treatment implementation. \index{regression discontinuity} In an RD design, there is typically some program or event -which has limited availability due to practical considerations or poicy choices +which has limited availability due to practical considerations or policy choices and is therefore made available only to individuals who meet a certain threshold requirement. The intuition of this design is that there is an underlying \textbf{running variable} which serves as the sole determinant of access to the program, @@ -421,7 +421,7 @@ \subsection{Regression discontinuity} \url{http://faculty.smu.edu/kyler/courses/7312/presentations/baumer/Baumer\_RD.pdf}} These presentations help to suggest both the functional form of the underlying relationship and the type of change observed at the discontinuity, -and help to avoid pitfalls in modelling that are difficult to detect with hypothesis tests.\sidenote{ +and help to avoid pitfalls in modeling that are difficult to detect with hypothesis tests.\sidenote{ \url{http://econ.lse.ac.uk/staff/spischke/ec533/RD.pdf}} Because these designs are so flexible compared to others, there is an extensive set of commands that help assess @@ -458,7 +458,7 @@ \subsection{Instrumental variables} However, instead of controlling for the running variable directly, the IV approach typically uses the \textbf{two-stage-least-squares (2SLS)} estimator.\sidenote{ \url{http://www.nuff.ox.ac.uk/teaching/economics/bond/IV\%20Estimation\%20Using\%20Stata.pdf}} -This estimator forms a prediction of the probability that the unit recieves treatment +This estimator forms a prediction of the probability that the unit receives treatment based on a regression against the instrumental variable. That prediction will, by assumption, be the portion of the actual treatment that is due to the instrument and not any other source, @@ -466,7 +466,7 @@ \subsection{Instrumental variables} this portion of the treatment can be used to assess its effects. Unfortunately, these estimators are known to have very high variances relative other methods, -particularly when the relationship between the intrument and the treatment is small.\cite{young2017consistency} +particularly when the relationship between the instrument and the treatment is small.\cite{young2017consistency} IV designs furthermore rely on strong but untestable assumptions about the relationship between the instrument and the outcome.\cite{bound1995problems} Therefore IV designs face intense scrutiny on the strength and exogeneity of the instrument, From 64d57a99785aa4e333f1a8bbf61cdf1b4bfa0cc6 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 16 Jan 2020 17:09:34 -0500 Subject: [PATCH 244/854] [ch3] Stata and not stata --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index c521eda41..6ee1d31e7 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -543,7 +543,7 @@ \subsection{Matching} the simplest design only requires controls (indicator variables) for each group or, in the case of propensity scoring and similar approaches, weighting the data appropriately in order to balance the analytical samples on the selected variables. -The \texttt{teffects} suite in stata provides a wide variety +The \texttt{teffects} suite in Stata provides a wide variety of estimators and analytical tools for various designs.\sidenote{ \url{https://ssc.wisc.edu/sscc/pubs/stata_psmatch.htm}} The coarsened exact matching (\texttt{cem}) package applies the nonparametric approach.\sidenote{ From f94cd054ca70014363ef415fb2403268e1b2a971 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 16 Jan 2020 17:10:27 -0500 Subject: [PATCH 245/854] [ch3] grammar typo, right? --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 6ee1d31e7..329e2f42b 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -559,7 +559,7 @@ \subsection{Matching} %----------------------------------------------------------------------------------------------- \subsection{Synthetic controls} -\textbf{Synthetic control} is a relative newer method +\textbf{Synthetic control} is a relatively newer method for the case when appropriate counterfactual individuals do not exist in reality and there are very few (often only one) treatment unit.\cite{abadie2015comparative} \index{synthetic controls} From 6b182f110546b4f30f3204014c842fbbd54370a2 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 16 Jan 2020 17:16:15 -0500 Subject: [PATCH 246/854] [ch3] when Stata specific software, say so --- chapters/research-design.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 329e2f42b..eb41d8703 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -354,7 +354,7 @@ \subsection{Differences-in-differences} As with cross-sectional designs, this set of study designs is widespread. Therefore there exist a large number of standardized tools for analysis. -Our \texttt{ietoolkit} package includes the \texttt{ieddtab} command +Our \texttt{ietoolkit} Stata package includes the \texttt{ieddtab} command which produces standardized tables for reporting results.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/ieddtab}} For more complicated versions of the model @@ -481,7 +481,7 @@ \subsection{Instrumental variables} In practice, there are a variety of packages that can be used to analyse data and report results from instrumental variables designs. -While the built-in command \texttt{ivregress} will often be used +While the built-in Stata command \texttt{ivregress} will often be used to create the final results, these are not sufficient on their own. The \textbf{first stage} of the design should be extensively tested, to demonstrate the strength of the relationship between @@ -548,7 +548,7 @@ \subsection{Matching} \url{https://ssc.wisc.edu/sscc/pubs/stata_psmatch.htm}} The coarsened exact matching (\texttt{cem}) package applies the nonparametric approach.\sidenote{ \url{https://gking.harvard.edu/files/gking/files/cem-stata.pdf}} -DIME's \texttt{iematch} package produces matchings based on a single continuous matching variable.\sidenote{ +DIME's \texttt{iematch} command in the \texttt{ietoolkit} package produces matchings based on a single continuous matching variable.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iematch}} In any of these cases, detailed reporting of the matching model is required, including the resulting effective weights of observations, From 7f842e62c725e64adc82d870515e2820bc6bb3e6 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 14:51:55 -0500 Subject: [PATCH 247/854] [ch1] fix issue #304 --- chapters/handling-data.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index fc9529d22..afa83cf10 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -208,7 +208,8 @@ \subsection{Research credibility} or the \textbf{OSF} registry\sidenote{\url{https://osf.io/registries}} as appropriate.\sidenote{\url{http://datacolada.org/12}} \index{pre-registration} -Garden varieties of research standards from journals, funders, and others feature both ex ante +Common research standards from journals, funders, and others feature both ex +ante (or ``regulation'') and ex post (or ``verification'') policies.\cite{stodden2013toward} Ex ante policies require that authors bear the burden of ensuring they provide some set of materials before publication @@ -366,8 +367,8 @@ \subsection{Transmitting and storing data securely} Proper encryption means that, even if the information were to be intercepted or made public, the files that would be obtained would be useless to the recipient. -In security parlance this person is often referred to as an ``intruder'' -but it is rare that data breaches are nefarious or even intentional. +In security language this person is often referred to as an ``intruder'' +but it is rare that data breaches are malicious or even intentional. The easiest way to protect personal information is not to use it. It is often very simple to conduct planning and analytical work From 255159ed226148b86c2844e692803dd80e4b3942 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 15:16:05 -0500 Subject: [PATCH 248/854] [ch4] new, and not the same seed, on each reboot of Stata --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index de449e8ec..323f97407 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -90,7 +90,7 @@ \subsection{Reproducibility in random Stata processes} Basically, it has a really long ordered list of numbers with the property that knowing the previous one gives you precisely zero information about the next one. Stata uses one of these numbers every time it has a task that is non-deterministic. -In ordinary use, it will cycle through these numbers starting from a fixed point +In ordinary use, it will cycle through these numbers starting from a new point every time you restart Stata, and by the time you get to any given script, the current state and the subsequent states will be as good as random.\sidenote{ \url{https://www.stata.com/manuals14/rsetseed.pdf}} From d743a62a11577b91d6911cf1fbf581f081657414 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 15:17:02 -0500 Subject: [PATCH 249/854] [ch4] scope of version is somewhere between local and global --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 323f97407..160b5df3c 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -113,7 +113,7 @@ \subsection{Reproducibility in random Stata processes} \url{https://dimewiki.worldbank.org/wiki/Master_Do-files}} However, note that testing your do-files without running them via the master do-file may produce different results, -since Stata's \texttt{version} expires after execution just like a \texttt{local}. +since Stata's \texttt{version} expires after each time you run your do-files. \textbf{Sorting} means that the actual data that the random process is run on is fixed. Because numbers are assigned to each observation in sequence, From ea1996d5893d6f907496c59db70c69b7f72ae35a Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 15:17:50 -0500 Subject: [PATCH 250/854] [ch4] adding row-by-row, easier to visualize than just "sequence" --- chapters/sampling-randomization-power.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 160b5df3c..37e7dda04 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -116,7 +116,8 @@ \subsection{Reproducibility in random Stata processes} since Stata's \texttt{version} expires after each time you run your do-files. \textbf{Sorting} means that the actual data that the random process is run on is fixed. -Because numbers are assigned to each observation in sequence, +Because numbers are assigned to each observation in row-by-row starting from +the top row, changing their order will change the result of the process. A corollary is that the underlying data must be unchanged between runs: you must make a fixed final copy of the data when you run a randomization for fieldwork. From 0c9bb6883c4754a85459265212c208bfc32eb20b Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 15:18:29 -0500 Subject: [PATCH 251/854] [ch4] recycled seed, this is more on point and more intuitive --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 37e7dda04..392980945 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -132,7 +132,7 @@ \subsection{Reproducibility in random Stata processes} In Stata, \texttt{set seed [seed]} will set the generator to that state. You should use exactly one seed per randomization process. The most important thing is that each of these seeds is truly random, -so do not use shortcuts such as the current date or a fixed seed. +so do not use shortcuts such as the current date or a seed you have used before. You will see in the code below that we include the source and timestamp for verification. Any process that includes a random component is a random process, including sampling, randomization, power calculation, and algorithms like bootstrapping. From 9046fbd444c6b53c5591571d7755df239fde86bb Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 15:18:57 -0500 Subject: [PATCH 252/854] [ch4] new paragraph as this summary is not specific to seed --- chapters/sampling-randomization-power.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 392980945..4f1c03d2b 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -139,6 +139,7 @@ \subsection{Reproducibility in random Stata processes} Other commands may induce randomness in the data or alter the seed without you realizing it, so carefully confirm exactly how your code runs before finalizing it.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Randomization_in_Stata}} + To confirm that a randomization has worked well before finalizing its results, save the outputs of the process in a temporary location, re-run the code, and use \texttt{cf} or \texttt{datasignature} to ensure nothing has changed. From 78518eadc470ff74744d99002ffcbece8cf46148 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 15:19:19 -0500 Subject: [PATCH 253/854] [ch4] gold standard, let someone else reproduce --- chapters/sampling-randomization-power.tex | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 4f1c03d2b..a35be13f8 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -142,7 +142,10 @@ \subsection{Reproducibility in random Stata processes} To confirm that a randomization has worked well before finalizing its results, save the outputs of the process in a temporary location, -re-run the code, and use \texttt{cf} or \texttt{datasignature} to ensure nothing has changed. +re-run the code, and use \texttt{cf} or \texttt{datasignature} to ensure +nothing has changed. It is also advisable to let someone else re-produce your +randomization results on their machine to remove any doubt that your results +are reproducable. %----------------------------------------------------------------------------------------------- From 7e2e193e36f812e9068c128923c1be7431535da6 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 15:23:05 -0500 Subject: [PATCH 254/854] [ch4] typo --- chapters/sampling-randomization-power.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index de449e8ec..d4b3bcabd 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -122,7 +122,8 @@ \subsection{Reproducibility in random Stata processes} you must make a fixed final copy of the data when you run a randomization for fieldwork. In Stata, the only way to guarantee a unique sorting order is to use \texttt{isid [id\_variable], sort}. (The \texttt{sort, stable} command is insufficient.) -You can additionally use the \texttt{datasignature} commannd to make sure the data is unchanged. +You can additionally use the \texttt{datasignature} command to make sure the +data is unchanged. \textbf{Seeding} means manually setting the start-point of the randomization algorithm. You can draw a six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. From 669314e58a0fde6c6d409b81b3966be550ce2b2a Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 15:42:44 -0500 Subject: [PATCH 255/854] [ch 4] typo --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index d4b3bcabd..fb0a0f6ad 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -391,7 +391,7 @@ \subsection{Power calculations} of your design are located, so you know the relative tradeoffs you will face by changing your randomization scheme for the final design. They also allow realistic interpretations of evidence: -results low-power studies can be very interesting, +results of low-power studies can be very interesting, but they have a correspondingly higher likelihood of reporting false positive results. From bf7d6ac9c0cfc0a82eb9678d6be9d2d690a39202 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 17 Jan 2020 16:17:39 -0500 Subject: [PATCH 256/854] [ch4] Solve #149 --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index a11958b77..8e3bb52c6 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -13,7 +13,7 @@ and that consumers of research can have confidence in its conclusions. What we call ethical standards in this chapter is a set of practices for research transparency and data privacy that address these two components. -Their adoption is an objective measure of to judge a research product's performance in both. +Their adoption is an objective measure to judge a research product's performance in both. Without these transparent measures of credibility, reputation is the primary signal for the quality of evidence, and two failures may occur: low-quality studies from reputable sources may be used as evidence when in fact they don't warrant it, and high-quality studies from sources without an international reputation may be ignored. From c20476cabcb6b2e765fc3372679f8c2cde6c59f3 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 17 Jan 2020 16:18:24 -0500 Subject: [PATCH 257/854] Revert "[ch4] Solve #149" This reverts commit bf7d6ac9c0cfc0a82eb9678d6be9d2d690a39202. --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 8e3bb52c6..a11958b77 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -13,7 +13,7 @@ and that consumers of research can have confidence in its conclusions. What we call ethical standards in this chapter is a set of practices for research transparency and data privacy that address these two components. -Their adoption is an objective measure to judge a research product's performance in both. +Their adoption is an objective measure of to judge a research product's performance in both. Without these transparent measures of credibility, reputation is the primary signal for the quality of evidence, and two failures may occur: low-quality studies from reputable sources may be used as evidence when in fact they don't warrant it, and high-quality studies from sources without an international reputation may be ignored. From 166fdac112bafdbba669c3ce7d311e03263deac7 Mon Sep 17 00:00:00 2001 From: ankritisingh <54277703+ankritisingh@users.noreply.github.com> Date: Fri, 17 Jan 2020 16:37:49 -0500 Subject: [PATCH 258/854] Removing the quote before r(table) --- code/code.do | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code/code.do b/code/code.do index 96a58123e..4c217674f 100644 --- a/code/code.do +++ b/code/code.do @@ -5,7 +5,7 @@ reg price mpg rep78 headroom , coefl * Transpose and store the output - matrix results = `r(table)' + matrix results = r(table)' * Load the results into memory clear From bc06355bc9bc02fb943de69b9d282754e77658fb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Fri, 17 Jan 2020 19:09:12 -0500 Subject: [PATCH 259/854] Update chapters/sampling-randomization-power.tex Co-Authored-By: Benjamin Daniels --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index a35be13f8..490322d08 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -113,7 +113,7 @@ \subsection{Reproducibility in random Stata processes} \url{https://dimewiki.worldbank.org/wiki/Master_Do-files}} However, note that testing your do-files without running them via the master do-file may produce different results, -since Stata's \texttt{version} expires after each time you run your do-files. +since Stata's \texttt{version} setting expires after each time you run your do-files. \textbf{Sorting} means that the actual data that the random process is run on is fixed. Because numbers are assigned to each observation in row-by-row starting from From 47937388d9b9c8c5f12e9527d56c1f161efec15e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Fri, 17 Jan 2020 19:10:04 -0500 Subject: [PATCH 260/854] Update chapters/sampling-randomization-power.tex Co-Authored-By: Benjamin Daniels --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 490322d08..03b4ff6b9 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -143,7 +143,7 @@ \subsection{Reproducibility in random Stata processes} To confirm that a randomization has worked well before finalizing its results, save the outputs of the process in a temporary location, re-run the code, and use \texttt{cf} or \texttt{datasignature} to ensure -nothing has changed. It is also advisable to let someone else re-produce your +nothing has changed. It is also advisable to let someone else reproduce your randomization results on their machine to remove any doubt that your results are reproducable. From 64a6b2886a54867b3bb8bf67d56a781eb12d7259 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Fri, 17 Jan 2020 19:10:28 -0500 Subject: [PATCH 261/854] Update chapters/sampling-randomization-power.tex Co-Authored-By: Benjamin Daniels --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 03b4ff6b9..e15984392 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -90,7 +90,7 @@ \subsection{Reproducibility in random Stata processes} Basically, it has a really long ordered list of numbers with the property that knowing the previous one gives you precisely zero information about the next one. Stata uses one of these numbers every time it has a task that is non-deterministic. -In ordinary use, it will cycle through these numbers starting from a new point +In ordinary use, it will cycle through these numbers starting from a fixed point every time you restart Stata, and by the time you get to any given script, the current state and the subsequent states will be as good as random.\sidenote{ \url{https://www.stata.com/manuals14/rsetseed.pdf}} From fed10601d7b41a92071fbc50d09b7d6023f669c0 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 23:06:46 -0500 Subject: [PATCH 262/854] [ch 5] break up very long line and missing ' --- chapters/data-collection.tex | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 915ce9287..44699d504 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -49,7 +49,16 @@ \subsection{Questionnaire design for quantitative analysis} \textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. There is debate over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. -We recommend using descriptive names with clear prefixes so that variables within a module stay together when sorted alphabetically.\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} Variable names should never include spaces or mixed cases (all lower case is best). Take care with the length: very long names will be cut off in certain software, which could result in a loss of uniqueness. We discourage explicit question numbering, as it discourages re-ordering, which is a common recommended change after the pilot. In the case of follow-up surveys, numbering can quickly become convoluted, too often resulting in variables names like 'ag\_15a', 'ag\_15\_new', ag\_15\_fup2', etc. +We recommend using descriptive names with clear prefixes so that variables +within a module stay together when sorted +alphabetically.\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} + Variable names should never include spaces or mixed cases (all lower case is +best). Take care with the length: very long names will be cut off in certain +software, which could result in a loss of uniqueness. We discourage explicit +question numbering, as it discourages re-ordering, which is a common +recommended change after the pilot. In the case of follow-up surveys, numbering +can quickly become convoluted, too often resulting in variables names like +'ag\_15a', 'ag\_15\_new', 'ag\_15\_fup2', etc. Questionnaires must include ways to document the reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up}. \index{attrition}\index{contamination} From 0e2e43b5ee80a25ab6552fe50f94ad0bc2ab5598 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 23:07:15 -0500 Subject: [PATCH 263/854] [ch5] always specify that this is a Stata package --- chapters/data-collection.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 44699d504..273d40e48 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -115,7 +115,9 @@ \subsection{CAPI features} \subsection{Compatibility with analysis software} All CAPI software include debugging and test options to correct syntax errors and make sure that the survey instruments will successfully compile. This is not sufficient, however, to ensure that the resulting dataset will load without errors in your data analysis software of choice. -We developed the \texttt{ietestform} command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of +We developed the \texttt{ietestform} +command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of +the Stata package \texttt{iefieldkit}, to implement a form-checking routine for \textbf{SurveyCTO}, a proprietary implementation of \textbf{Open Data Kit (ODK)}. Intended for use during questionnaire programming and before fieldwork, ietestform tests for best practices in coding, naming and labeling, and choice lists. Although \texttt{ietestform} is software-specific, many of the tests it runs are general and important to consider regardless of software choice. From 27e748f90b4705dd4d82a536a0f006fb51e3121a Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 23:08:12 -0500 Subject: [PATCH 264/854] [ch5] break up lines, always write itestform with code format --- chapters/data-collection.tex | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 273d40e48..cd4120d58 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -119,9 +119,14 @@ \subsection{Compatibility with analysis software} command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of the Stata package \texttt{iefieldkit}, to implement a form-checking routine for \textbf{SurveyCTO}, a proprietary implementation of \textbf{Open Data Kit (ODK)}. -Intended for use during questionnaire programming and before fieldwork, ietestform tests for best practices in coding, naming and labeling, and choice lists. +Intended for use during questionnaire programming and before fieldwork, +\texttt{ietestform} tests for best practices in coding, naming and labeling, +and choice lists. Although \texttt{ietestform} is software-specific, many of the tests it runs are general and important to consider regardless of software choice. -To give a few examples, ietestform tests that no variable names exceed 32 characters, the limit in Stata (variable names that exceed that limit will be truncated, and as a result may no longer be unique). It checks whether ranges are included for numeric variables. +To give a few examples, \texttt{ietestform} tests that no variable names exceed +32 characters, the limit in Stata (variable names that exceed that limit will +be truncated, and as a result may no longer be unique). It checks whether +ranges are included for numeric variables. \texttt{ietestform} also removes all leading and trailing blanks from response lists, which could be handled inconsistently across software. \subsection{Data-focused Pilot} From 7d8fafcb727d50c664e8c850df33192aadad77b4 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 23:14:14 -0500 Subject: [PATCH 265/854] [ch5] break up lines and missing space --- chapters/data-collection.tex | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index cd4120d58..4a1613fbc 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -173,7 +173,19 @@ \subsection{High Frequency Checks} This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions and loss to follow-up occurred in the field and how they were implemented and resolved. -High frequency checks should also include survey-specific data checks. As CAPI software incorporates many data control features, discussed above, these checks should focus on issues CAPI software cannot check automatically. As most of these checks are survey specific, it is difficult to provide general guidance. An in-depth knowledge of the questionnaire, and a careful examination of the pre-analysis plan, is the best preparation.Examples include consistency across multiple responses, complex calculations suspicious patterns in survey timing or atypical response patters from specific enumerators. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} CAPI software typically provides rich metadata, which can be useful in assessing interview quality. For example, automatically collected time stamps show how long enumerators spent per question, and trace histories show how many times answers were changed before the survey was submitted. +High frequency checks should also include survey-specific data checks. As CAPI +software incorporates many data control features, discussed above, these checks +should focus on issues CAPI software cannot check automatically. As most of +these checks are survey specific, it is difficult to provide general guidance. +An in-depth knowledge of the questionnaire, and a careful examination of the +pre-analysis plan, is the best preparation. Examples include consistency +across multiple responses, complex calculations suspicious patterns in survey +timing or atypical response patters from specific enumerators. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} +CAPI software typically provides rich metadata, which can be useful in +assessing interview quality. For example, automatically collected time stamps +show how long enumerators spent per question, and trace histories show how many +times answers were changed before the survey was submitted. \subsection{Data considerations for field monitoring} From 94812ed02de0a12f7f9a69834aa46d99e5bafda9 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 23:17:42 -0500 Subject: [PATCH 266/854] [ch5] break up lines and write backcheck back-check consistently --- chapters/data-collection.tex | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 4a1613fbc..a36ca2ed1 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -192,11 +192,17 @@ \subsection{Data considerations for field monitoring} Careful monitoring of field work is essential for high quality data. \textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. For back-checks, a random subset of the field sample is chosen and a subset of information from the full survey is verified through a brief interview with the original respondent. -Design of the backcheck questionnaire follows the same survey design principles discussed above, in particular you should use the pre-analysis plan or list of key outcomes to establish which subset of variables to prioritize. - -Real-time access to the survey data increases the potential utility of backchecks dramatically, and both simplifies and improves the rigor of related workflows. -You can use the raw data to draw the backcheck sample; assuring it is appropriately apportioned across interviews and survey teams. -As soon as backchecks are complete, the backcheck data can be tested against the original data to identify areas of concern in real-time. +Design of the back-check questionnaire follows the same survey design +principles discussed above, in particular you should use the pre-analysis plan +or list of key outcomes to establish which subset of variables to prioritize. + +Real-time access to the survey data increases the potential utility of +back-checks dramatically, and both simplifies and improves the rigor of related +workflows. +You can use the raw data to draw the back-check sample; assuring it is +appropriately apportioned across interviews and survey teams. +As soon as back-checks are complete, the back-check data can be tested against +the original data to identify areas of concern in real-time. \texttt{bcstats} is a useful tool for analyzing back-check data in Stata module. \sidenote{\url{https://ideas.repec.org/c/boc/bocode/s458173.html}} CAPI surveys also provide a unique opportunity to do audits through audio recordings of the interview, typically short recordings triggered at random throughout the questionnaire. From dc8f8d621de9c0bdcce8d8c6e1387049a9c9cce4 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 17 Jan 2020 23:21:27 -0500 Subject: [PATCH 267/854] [ch5] combine very similar side notes --- chapters/data-collection.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index a36ca2ed1..a4f5aeea1 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -222,8 +222,9 @@ \section{Collecting Data Securely} \subsection{Secure data in the field} All mainstream data collection software automatically \textbf{encrypt} -\sidenote{\textbf{Encryption:} the process of making information unreadable to anyone without access to a specific deciphering key.} -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} +\sidenote{\textbf{Encryption:} the process of making information unreadable to +anyone without access to a specific deciphering +key. \url{https://dimewiki.worldbank.org/wiki/Encryption}} all data submitted from the field while in transit (i.e., upload or download). Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. Therefore, as long as you are using an established CAPI software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, and accounts used have a logon password and are never left unlocked. \subsection{Secure data storage} From 9a3c0d3293da3420937b81481b72c62071965886 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 20 Jan 2020 12:18:34 -0500 Subject: [PATCH 268/854] More abbreviations --- appendix/stata-guide.tex | 64 +++++++++++++++++++++------------------- 1 file changed, 33 insertions(+), 31 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 941471fa8..831fd66b6 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -10,7 +10,7 @@ Recent Masters' program graduates that have joined our team tended to have very good knowledge in the theory of our trade, but also to require a lot of training in its practical skills. -To us, this is like graduating in architecture having learned +To us, this is like graduating in architecture having learned how to sketch, describe, and discuss the concepts and requirements of a new building very well, but without having the technical skill-set @@ -61,13 +61,13 @@ \section{Using the code examples in this book} \subsection{Understanding Stata code} Regardless of being new to Stata or having used it for decades, you will always run into commands that -you have not seen before or whose purpose you do not remember. -Every time that happens, you should always look that command up in the help file. -For some reason, we often encounter the conception that help files are only for beginners. -We could not disagree with that conception more, -as the only way to get better at Stata is to constantly read help files. -So if there is a command that you do not understand in any of our code examples, -for example \texttt{isid}, then write \texttt{help isid}, +you have not seen before or whose purpose you do not remember. +Every time that happens, you should always look that command up in the help file. +For some reason, we often encounter the conception that help files are only for beginners. +We could not disagree with that conception more, +as the only way to get better at Stata is to constantly read help files. +So if there is a command that you do not understand in any of our code examples, +for example \texttt{isid}, then write \texttt{help isid}, and the help file for the command \texttt{isid} will open. We cannot emphasize enough how important we think it is that you get into the habit of reading help files. @@ -76,24 +76,24 @@ \subsection{Understanding Stata code} and you will not be able to read their help files until you have installed the commands. Two examples of these in our code are \texttt{randtreat} or \texttt{ieboilstart}. The most common place to distribute user-written commands for Stata -is the Boston College Statistical Software Components (SSC) archive. +is the Boston College Statistical Software Components (SSC) archive. In our code examples, we only use either Stata's built-in commands or commands available from the -SSC archive. +SSC archive. So, if your installation of Stata does not recognize a command in our code, for example \texttt{randtreat}, then type \texttt{ssc install randtreat} in Stata. -Some commands on SSC are distributed in packages. -This is the case, for example, of \texttt{ieboilstart}. -That means that you will not be able to install it using \texttt{ssc install ieboilstart}. +Some commands on SSC are distributed in packages. +This is the case, for example, of \texttt{ieboilstart}. +That means that you will not be able to install it using \texttt{ssc install ieboilstart}. If you do, Stata will suggest that you instead use \texttt{findit ieboilstart}, which will search SSC (among other places) and see if there is a -package that contains a command called \texttt{ieboilstart}. -Stata will find \texttt{ieboilstart} in the package \texttt{ietoolkit}, +package that contains a command called \texttt{ieboilstart}. +Stata will find \texttt{ieboilstart} in the package \texttt{ietoolkit}, so to use this command you will type \texttt{ssc install ietoolkit} in Stata instead. -We understand that it can be confusing to work with packages for first time, -but this is the best way to set up your Stata installation to benefit from other -people's work that has been made publicly available, +We understand that it can be confusing to work with packages for first time, +but this is the best way to set up your Stata installation to benefit from other +people's work that has been made publicly available, and once you get used to installing commands like this it will not be confusing at all. All code with user-written commands, furthermore, is best written when it installs such commands at the beginning of the master do-file, so that the user does not have to search for packages manually. @@ -106,22 +106,22 @@ \subsection{Why we use a Stata style guide} non-official style guides like the JavaScript Standard Style\sidenote{\url{https://standardjs.com/\#the-rules}} for JavaScript or Hadley Wickham's\sidenote{\url{http://adv-r.had.co.nz/Style.html}} style guide for R. -Aesthetics is an important part of style guides, but not the main point. +Aesthetics is an important part of style guides, but not the main point. The existence of style guides improves the quality of the code in that language that is produced by all programmers in the community. It is through a style guide that unexperienced programmers can learn from more experienced programmers -how certain coding practices are more or less error-prone. -Broadly-accepted style guides make it easier to borrow solutions from each other and from examples online without causing bugs that might only be found too late. +how certain coding practices are more or less error-prone. +Broadly-accepted style guides make it easier to borrow solutions from each other and from examples online without causing bugs that might only be found too late. Similarly, globally standardized style guides make it easier to solve each others' problems and to collaborate or move from project to project, and from team to team. -There is room for personal preference in style guides, -but style guides are first and foremost about quality and standardization -- -especially when collaborating on code. +There is room for personal preference in style guides, +but style guides are first and foremost about quality and standardization -- +especially when collaborating on code. We believe that a commonly used Stata style guide would improve the quality of all code written in Stata, -which is why we have begun the one included here. -You do not necessarily need to follow our style guide precisely. -We encourage you to write your own style guide if you disagree with us. -The best style guide would be the one adopted the most widely. +which is why we have begun the one included here. +You do not necessarily need to follow our style guide precisely. +We encourage you to write your own style guide if you disagree with us. +The best style guide would be the one adopted the most widely. What is important is that you adopt a style guide and follow it consistently across your projects. \newpage @@ -198,9 +198,11 @@ \subsection{Abbreviating commands} \texttt{tab} & \texttt{tabulate} \\ \texttt{bys} & \texttt{bysort} \\ \texttt{qui} & \texttt{quietly} \\ + \texttt{noi} & \texttt{noisilt} \\ \texttt{cap} & \texttt{capture} \\ \texttt{forv} & \texttt{forvalues} \\ \texttt{prog} & \texttt{program} \\ + \texttt{hist} & \texttt{histogram} \\ \hline \end{tabular} \end{center} @@ -223,8 +225,8 @@ \subsection{Writing loops} and for looping across matrices with \texttt{i}, \texttt{j}. Other typical index names are \texttt{obs} or \texttt{var} when looping over observations or variables, respectively. But since Stata does not have arrays, such abstract syntax should not be used in Stata code otherwise. -Instead, index names should describe what the code is looping over -- -for example household members, crops, or medicines. +Instead, index names should describe what the code is looping over -- +for example household members, crops, or medicines. This makes code much more readable, particularly in nested loops. \codeexample{stata-loops.do}{./code/stata-loops.do} @@ -240,7 +242,7 @@ \subsection{Using whitespace} In the example below the exact same code is written twice, but in the better example whitespace is used to signal to the reader that the central object of this segment of code is the variable \texttt{employed}. -Organizing the code like this makes it much quicker to read, +Organizing the code like this makes it much quicker to read, and small typos stand out more, making them easier to spot. \codeexample{stata-whitespace-columns.do}{./code/stata-whitespace-columns.do} From 694a377e4801cb495997122a6fb6a799e6b424e7 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 20 Jan 2020 12:19:04 -0500 Subject: [PATCH 269/854] Typo --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 831fd66b6..bcd0586ac 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -198,7 +198,7 @@ \subsection{Abbreviating commands} \texttt{tab} & \texttt{tabulate} \\ \texttt{bys} & \texttt{bysort} \\ \texttt{qui} & \texttt{quietly} \\ - \texttt{noi} & \texttt{noisilt} \\ + \texttt{noi} & \texttt{noisily} \\ \texttt{cap} & \texttt{capture} \\ \texttt{forv} & \texttt{forvalues} \\ \texttt{prog} & \texttt{program} \\ From be32646d6cde18709a618066d30c6f5aed93afa7 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 20 Jan 2020 17:34:21 -0500 Subject: [PATCH 270/854] [ch6] comma typo --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index c4ea28922..d78bec82b 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -224,7 +224,7 @@ \section{Data cleaning} It should also be easily traced back to the survey instrument, and be accompanied by a dictionary or codebook. Typically, one cleaned data set will be created for each data source, -i.e., per survey instrument. +i.e. per survey instrument. Each row in the cleaned data set represents one survey entry or unit of observation.\sidenote{\cite{tidy-data}} If the raw data set is very large, or the survey instrument is very complex, you may want to break the data cleaning into sub-steps, From 0544750cfeff80ef38f7ada96aa317a4ba599157 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 20 Jan 2020 17:57:45 -0500 Subject: [PATCH 271/854] [ch6] typos --- chapters/data-analysis.tex | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index d78bec82b..dfe850fb9 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -303,7 +303,8 @@ \section{Indicator construction} Are all variables you are combining into an index or average using the same scale? Are yes or no questions coded as 0 and 1, or 1 and 2? This is when you will use the knowledge of the data you acquired and the documentation you created during the cleaning step the most. -It is often useful to start looking at comparisons and other documentation outside the code editpr. +It is often useful to start looking at comparisons and other documentation +outside the code editor. Adding comments to the code explaining what you are doing and why is crucial here. There are always ways for things to go wrong that you never anticipated, but two issues to pay extra attention to are missing values and dropped observations. @@ -402,7 +403,8 @@ \section{Writing data analysis code} leave this to near publication time. % Self-promotion ------------------------------------------------ -Out team has created a few products to automate common outputs and save you precious research time. +Our team has created a few products to automate common outputs and save you +precious research time. The \texttt{ietoolkit} package includes two commands to export nicely formatted tables. \texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} creates and exports balance tables to excel and {\TeX}. \texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} does the same for difference-in-differences regressions. From a99ff89679539be82c21bac19996281e01b85a97 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 20 Jan 2020 17:58:22 -0500 Subject: [PATCH 272/854] [ch6] do-file not do file, and master script not master do-file --- chapters/data-analysis.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index dfe850fb9..e31869b28 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -88,7 +88,7 @@ \section{Data management} Additionally, \texttt{iefolder} creates \textbf{master do-files}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} so all project code is reflected in a top-level script. -% Master do file +% Master scripts Master scripts allow users to execute all the project code from a single file. They briefly describes what each code, and maps the files they require and create. @@ -415,7 +415,8 @@ \section{Writing data analysis code} We attribute some of this to the difficulty of writing code to create them. Making a visually compelling graph would already be hard enough if you didn't have to go through many rounds of googling to understand a command. The trickiest part of using plot commands is to get the data in the right format. -This is why the \textbf{Stata Visual Library} includes example data sets to use with each do file. +This is why the \textbf{Stata Visual Library} includes example data sets to use +with each do-file. Whole books have been written on how to create good data visualizations, so we will not attempt to give you advice on it. From c89751b0daa3db20bf96f9164e07016067027c02 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 20 Jan 2020 18:02:24 -0500 Subject: [PATCH 273/854] [ch7] this type of apostrophe did not display correctly --- bibliography.bib | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bibliography.bib b/bibliography.bib index 3a2272318..325a89e44 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -31,7 +31,7 @@ @article{dafoe2014science } @article{flom2005latex, - title={{LaTeX} for academics and researchers who (think they) don’t need it}, + title={{LaTeX} for academics and researchers who (think they) don't need it}, author={Flom, Peter}, journal={The PracTEX Journal}, volume={4}, From 64d5053d0b082a6c7420e6cf4b76e22c25c6fde4 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 20 Jan 2020 18:04:10 -0500 Subject: [PATCH 274/854] [ch7] remove extra space --- chapters/publication.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 563c9ff4e..a34fc35fa 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -216,7 +216,7 @@ \subsection{Getting started with \LaTeX\ via Overleaf} so that different writers do not create conflicted or out-of-sync copies, and allows inviting collaborators to edit in a fashion similar to Google Docs. Overleaf also offers a basic version history tool that avoids having to use separate software. -Most importantly, it provides a `` rich text'' editor +Most importantly, it provides a ``rich text'' editor that behaves pretty similarly to familiar tools like Word, so that people can write into the document without worrying too much about the underlying \LaTeX\ coding. @@ -226,7 +226,8 @@ \subsection{Getting started with \LaTeX\ via Overleaf} On the downside, there is a small amount of up-front learning required, continous access to the Internet is necessary, and updating figures and tables requires a bulk file upload that is tough to automate. -One of the most common issues you will face using Overleaf's `` rich text'' editor will be special characters +One of the most common issues you will face using Overleaf's ``rich text'' +editor will be special characters which, because of code functions, need to be handled differently than in Word. Most critically, the ampersand (\texttt{\&}), percent (\texttt{\%}), and underscore (\texttt{\_}) need to be ``escaped'' (interpreted as text and not code) in order to render. From 27209f9962e7967ecf46a3ad89acb9d1168c6097 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 20 Jan 2020 18:04:22 -0500 Subject: [PATCH 275/854] [ch7 typo] --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index a34fc35fa..38a8f1527 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -334,7 +334,7 @@ \subsection{Publishing code for replication} By contrast, replication code usually has few legal and privacy constraints. In most cases code will not contain identifying information; check carefully that it does not. -Pubishing code also requires assigning a license to it; +Publishing code also requires assigning a license to it; in a majority of cases, code publishers like GitHub offer extremely permissive licensing options by default. (If you do not provide a license, nobody can use your code!) From 02a43d885219f94c9aa55efe8b245ee7a16b0a74 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 20 Jan 2020 18:46:11 -0500 Subject: [PATCH 276/854] [ch6] be specific when we talk about software specific tools --- chapters/data-analysis.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index e31869b28..ed3b0b0de 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -176,7 +176,8 @@ \section{Data cleaning} that can be cross-referenced with other records, such as the Master Data Set\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Data\_Set}} and other rounds of data collection. \texttt{ieduplicates} and \texttt{iecompdup}, -two commands included in the \texttt{iefieldkit} package,\index{iefieldkit} +two Stata commands included in the \texttt{iefieldkit} +package,\index{iefieldkit} create an automated workflow to identify, correct and document occurrences of duplicate entries. From 690b38483b8eb6cd485c3416e7b915403c5a7c72 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 20 Jan 2020 18:52:59 -0500 Subject: [PATCH 277/854] [ch6] add side note first time iefieldkit is mentioned in chapter --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index ed3b0b0de..fe13a484a 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -177,7 +177,7 @@ \section{Data cleaning} and other rounds of data collection. \texttt{ieduplicates} and \texttt{iecompdup}, two Stata commands included in the \texttt{iefieldkit} -package,\index{iefieldkit} +package\index{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}} create an automated workflow to identify, correct and document occurrences of duplicate entries. From 5d3fed185f7ee1869fa8fdd1f5cc8650602d5dda Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 20 Jan 2020 19:02:10 -0500 Subject: [PATCH 278/854] [ch6] add "also" to put in context of earlier parts of chapter --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index fe13a484a..e01b3f22f 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -197,7 +197,7 @@ \section{Data cleaning} However, the last step of data cleaning, describing the data, will probably still be necessary. This is a key step to making the data easy to use, but it can be quite repetitive. -The \texttt{iecodebook} command suite, part of \texttt{iefieldkit}, +The \texttt{iecodebook} command suite, also part of \texttt{iefieldkit}, is designed to make some of the most tedious components of this process, such as renaming, relabeling, and value labeling, much easier (including in data appending).\sidenote{\url{https://dimewiki.worldbank.org/wiki/iecodebook}} From 7572dcfbc1ea3e805da53f45197582eb5d9fa287 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 20 Jan 2020 22:15:41 -0500 Subject: [PATCH 279/854] Ch5 re-write - changed 'CAPI' to 'electronic surveys' - fixed sidenote in intro paragraph - fixed spacing of sidenotes - corrected typos and wording - defined covariates - added discussion of 'other, specify' as response category - changed 'open internet' to 'internet' --- chapters/data-collection.tex | 147 +++++++++++++++++++---------------- 1 file changed, 80 insertions(+), 67 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index a4f5aeea1..873bbe436 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -5,10 +5,8 @@ Much of the recent push toward credibility in the social sciences has focused on analytical practices. We contest that credible research depends, first and foremost, on the quality of the raw data. This chapter covers the data generation workflow, from questionnaire design to field monitoring, for electronic data collection. There are many excellent resources on questionnaire design and field supervision, -\sidenote{Grosh, Margaret; Glewwe, Paul. 2000. Designing Household Survey Questionnaires for Developing Countries : Lessons from 15 Years of the Living Standards Measurement Study, Volume 3. Washington, DC: World Bank. © World Bank. \url{https://openknowledge.worldbank.org/handle/10986/15195 License: CC BY 3.0 IGO.}} -but few covering the particularly challenges and opportunities presented by electronic surveys (often referred to as Computer Assisted Personal Interviews (CAPI)). -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted\_Personal\_Interviews\_(CAPI)}} -As there are many electronic survey tools, we focus on workflows and primary concepts, rather than software-specific tools. +but few covering the particularly challenges and opportunities presented by electronic surveys. +As there are many survey software, and the market is rapidly evolving, we focus on workflows and primary concepts, rather than software-specific tools. The chapter covers questionnaire design, piloting and programming; monitoring data quality during the survey; and how to ensure confidential data is handled securely from collection to storage and sharing. @@ -16,37 +14,45 @@ %------------------------------------------------ -\section{Designing CAPI questionnaires} -A well-designed questionnaire results from careful planning, consideration of analysis and indicators, close review of existing questionnaires, survey pilots, and research team and stakeholder review. -Although most surveys are now collected electronically -- -\textbf{Questionnaire design}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Design}} +\section{Designing electronic questionnaires} +A well-designed questionnaire results from careful planning, consideration of analysis and indicators, close review of existing questionnaires, survey pilots, and research team and stakeholder review. There are many excellent resources on questionnaire design, such as from the World Bank's Living Standards Measurement Survey. +\sidenote{Grosh, Margaret; Glewwe, Paul. 2000. Designing Household Survey Questionnaires for Developing Countries : Lessons from 15 Years of the Living Standards Measurement Study, Volume 3. Washington, DC: World Bank. © World Bank.\url{https://openknowledge.worldbank.org/handle/10986/15195 License: CC BY 3.0 IGO.}} +The focus of this chapter is the particular design challenges for electronic surveys (often referred to as Computer Assisted Personal Interviews (CAPI)). +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted\_Personal\_Interviews\_(CAPI)}} + +Although most surveys are now collected electronically, by tablet, mobile phone or web browser, +\textbf{questionnaire design}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Design}} \index{questionnaire design} (content development) and questionnaire programming (functionality development) should be seen as two strictly separate tasks. -The research team should agree on all questionnaire content and design a paper version before programming a CAPI version. +The research team should agree on all questionnaire content and design a paper version before programming an electronic version. This facilitates a focus on content during the design process and ensures teams have a readable, printable paper version of their questionnaire. Most importantly, it means the research, not the technology, drives the questionnaire design. An easy-to-read paper version of the questionnaire is particularly critical for training enumerators, so they can get an overview of the survey content and structure before diving into the programming. -It is much easier for enumerators to understand all possible response pathways from a paper version, than from swiping question by question. +It is much easier for enumerators to understand all possible response pathways from a paper version than from swiping question by question. Finalizing the questionnaire before programming also avoids version control concerns that arise from concurrent work on paper and electronic survey instruments. Finally, a paper questionnaire is an important documentation for data publication. -The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: begin from broad concepts and slowly flesh out the specifics. It is essential to start with a clear understanding of the -\textbf{theory of change} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} and \textbf{experimental design} for your project. -The first step of questionnaire design is to list key outcomes of interest, as well as the main covariates and variables needed for experimental design. -The ideal starting point for this is a \textbf{pre-analysis plan}. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} +The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: +begin from broad concepts and slowly flesh out the specifics. +It is essential to start with a clear understanding of the +\textbf{theory of change}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} and \textbf{experimental design} for your project. +The first step of questionnaire design is to list key outcomes of interest, as well as the main factors to control for (covariates) and variables needed for experimental design. +The ideal starting point for this is a \textbf{pre-analysis plan}.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} -Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). For each module, determine if the module is applicable to the full sample, the appropriate respondent, and whether (or how often), the module should be repeated. A few examples: a module on maternal health only applies to household with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. +Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). For each module, determine if the module is applicable to the full sample, the appropriate respondent, and whether or how often, the module should be repeated. A few examples: a module on maternal health only applies to household with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. -Each module should then be expanded into specific indicators to observe in the field. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} -At this point, it is useful to do a \textbf{content-focused pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. +Each module should then be expanded into specific indicators to observe in the field. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} +At this point, it is useful to do a \textbf{content-focused pilot} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. \subsection{Questionnaire design for quantitative analysis} This book covers surveys designed to yield datasets useful for quantitative analysis. This is a subset of surveys, and there are specific design considerations that will help to ensure the raw data outputs are ready for analysis. From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like -\textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. +\textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. Rigorous field testing is required to ensure that answer categories are comprehensive; however, it is best practice to include an \textit{other, specify} option. Keep track of those reponses in the first few weeks of fieldwork; adding an answer category for a response frequently showing up as \textit{other} can save time, as it avoids post-coding. There is debate over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. We recommend using descriptive names with clear prefixes so that variables @@ -68,8 +74,8 @@ \subsection{Questionnaire design for quantitative analysis} \subsection{Content-focused Pilot} -A \textbf{Survey Pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} is the final step of questionnaire design. -A Content-focused Pilot \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} is best done on pen-and-paper, before the questionnaire is programmed. +A \textbf{survey pilot}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} is the final step of questionnaire design. +A content-focused pilot\sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} is best done on pen-and-paper, before the questionnaire is programmed. The objective is to improve the structure and length of the questionnaire, refine the phrasing and translation of specific questions, and confirm coded response options are exhaustive.\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes}} In addition, it is an opportunity to test and refine all survey protocols, such as how units will be sampled or pre-selected units identified. The pilot must be done out-of-sample, but in a context as similar as possible to the study sample. @@ -77,43 +83,44 @@ \subsection{Content-focused Pilot} %------------------------------------------------ -\section{Programming CAPI questionnaires} +\section{Programming electronic questionnaires} Electronic data collection has great potential to simplify survey implementation and improve data quality. -Electronic questionnaires are typically created in a spreadsheet (e.g. Excel or Google Sheets) or software-specific form builder, accessible even to novice users. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} -We will not address software-specific form design in this book; rather, we focus on coding conventions that are important to follow for CAPI regardless of software choice. +Electronic questionnaires are typically created in a spreadsheet (e.g. Excel or Google Sheets) or software-specific form builder, accessible even to novice users. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} +We will not address software-specific form design in this book; rather, we focus on coding conventions that are important to follow for electronic surveys regardless of software choice. \sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} -CAPI software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. +Survey software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. However, these are not fully automatic: you still need to actively design and manage the survey. -Here, we discuss specific practices that you need to follow to take advantage of CAPI features and ensure that the exported data is compatible with the software that will be used for analysis, and the importance of a data-focused pilot. +Here, we discuss specific practices that you need to follow to take advantage of electronic survey features and ensure that the exported data is compatible with the software that will be used for analysis, and the importance of a data-focused pilot. -\subsection{CAPI workflow} +\subsection{Electronic survey workflow} The starting point for questionnaire programming is a complete paper version of the questionnaire, piloted for content, and translated where needed. Starting the programming at this point reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. -When you start programming, do not start with the first question and program your way to the last question. +When programming, do not start with the first question and proceed straight through to the last question. Instead, code from high level to small detail, following the same questionnaire outline established at design phase. The outline provides the basis for pseudocode, allowing you to start with high level structure and work down to the level of individual questions. This will save time and reduce errors. -\subsection{CAPI features} -CAPI surveys are more than simply an electronic version of a paper questionnaire. -All common CAPI software allow you to automate survey logic and add in hard and soft constraints on survey responses. +\subsection{Electronic survey features} +Electronic surveys are more than simply a paper questionnaire displayed on a mobile device or web browser. +All common survey software allow you to automate survey logic and add in hard and soft constraints on survey responses. These features make enumerators' work easier, and they create the opportunity to identify and resolve data issues in real-time, simplifying data cleaning and improving response quality. Well-programmed questionnaires should include most or all of the following features: \begin{itemize} - \item{\textbf{Survey logic}}: build all skip patterns into the survey instrument, to ensure that only relevant questions are asked. This covers simple skip codes, as well as more complex interdependencies (e.g., a child health module is only asked to households that report the presence of a child under 5) - \item{\textbf{Range checks}}: add range checks for all numeric variables to catch data entry mistakes (e.g. age must be less than 120) - \item{\textbf{Confirmation of key variables}}: require double entry of essential information (such as a contact phone number in a survey with planned phone follow-ups), with automatic validation that the two entries match + \item{\textbf{Survey logic}}: build in all logic, so that only relevant questions appear, rather than relying on enumerators to follow complex survey logic. This covers simple skip codes, as well as more complex interdependencies (e.g., a child health module is only asked to households that report the presence of a child under 5). + \item{\textbf{Range checks}}: add range checks for all numeric variables to catch data entry mistakes (e.g. age must be less than 120). + \item{\textbf{Confirmation of key variables}}: require double entry of essential information (such as a contact phone number in a survey with planned phone follow-ups), with automatic validation that the two entries match. \item{\textbf{Multimedia}}: electronic questionnaires facilitate collection of images, video, and geolocation data directly during the survey, using the camera and GPS built into the tablet or phone. \item{\textbf{Preloaded data}}: data from previous rounds or related surveys can be used to prepopulate certain sections of the questionnaire, and validated during the interview. \item{\textbf{Filtered response options}}: filters reduce the number of response options dynamically (e.g. filtering the cities list based on the state provided). \item{\textbf{Location checks}}: enumerators submit their actual location using in-built GPS, to confirm they are in the right place for the interview. - \item{\textbf{Consistency checks}}: check that answers to related questions align, and trigger a warning if not so that enumerators can probe further. For example, if a household reports producing 800 kg of maize, but selling 900 kg of maize from their own production. + \item{\textbf{Consistency checks}}: check that answers to related questions align, and trigger a warning if not so that enumerators can probe further (.e.g., if a household reports producing 800 kg of maize, but selling 900 kg of maize from their own production). \item{\textbf{Calculations}}: make the electronic survey instrument do all math, rather than relying on the enumerator or asking them to carry a calculator. \end{itemize} \subsection{Compatibility with analysis software} -All CAPI software include debugging and test options to correct syntax errors and make sure that the survey instruments will successfully compile. +All survey software include debugging and test options to correct syntax errors and make sure that the survey instruments will successfully compile. This is not sufficient, however, to ensure that the resulting dataset will load without errors in your data analysis software of choice. We developed the \texttt{ietestform} command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of @@ -129,32 +136,32 @@ \subsection{Compatibility with analysis software} ranges are included for numeric variables. \texttt{ietestform} also removes all leading and trailing blanks from response lists, which could be handled inconsistently across software. -\subsection{Data-focused Pilot} -The final stage of questionnaire programming is another Survey Pilot. -The objective of the Data-focused Pilot \sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. +\subsection{Data-focused pilot} +The final stage of questionnaire programming is another survey pilot. +The objective of the data-focused pilot\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. -The Data-focused pilot should be done in advance of Enumerator training +The data-focused pilot should be done in advance of enumerator training %------------------------------------------------ \section{Data quality assurance} -A huge advantage of CAPI surveys, compared to traditional paper surveys, is the ability to access and analyze the data while the survey is ongoing. +A huge advantage of electronic surveys, compared to traditional paper surveys, is the ability to access and analyze the data while the survey is ongoing. Data issues can be identified and resolved in real-time. Designing systematic data checks, and running them routinely throughout data collection, simplifies monitoring and improves data quality. As part of survey preparation, the research team should develop a \textbf{data quality assurance plan} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Quality_Assurance_Plan}}. While data collection is ongoing, a research assistant or data analyst should work closely with the field team to ensure that the survey is progressing correctly, and perform \textbf{high-frequency checks (HFCs)} of the incoming data. \sidenote{\url{https://github.com/PovertyAction/high-frequency-checks/wiki}} -Data quality assurance requires a combination of both real-time data checks, survey audits, and field monitoring. Although field monitoring is critical for a successful survey, we focus on the first two in this chapter, as they are the most directly data related. +Data quality assurance requires a combination of real-time data checks and survey audits. Careful field supervision is also essential for a successful survey; however, we focus on the first two in this chapter, as they are the most directly data related. -\subsection{High Frequency Checks} -High-frequency checks should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, and that additional field effort is centered where it is most important. +\subsection{High frequency checks} +High-frequency checks (HFCs) should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, and that additional field effort is centered where it is most important. Data quality checks should be run on the data every time it is downloaded (ideally on a daily basis), to flag irregularities in survey progress, sample completeness or response quality. \texttt{ipacheck}\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks}} is a very useful command that automates some of these tasks. -It is important to check every day that the households interviewed match the survey sample. -Many CAPI software programs include case management features, through which sampled units are directly assigned to individual enumerators. +It is important to check every day that the units interviewed match the survey sample. +Many survey software include case management features, through which sampled units are directly assigned to individual enumerators. Even with careful management, it is often the case that raw data includes duplicate entries, which may occur due to field errors or duplicated submissions to the server. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Duplicates_and_Survey_Logs}} \texttt{ieduplicates}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ieduplicates}} @@ -165,24 +172,25 @@ \subsection{High Frequency Checks} Tracking survey progress is important for monitoring attrition, so that it is clear early on if a change in protocols or additional tracking will be needed. It is also important to check interview completion rate and sample compliance by surveyor and survey team, to identify any under-performing individuals or teams. -When all data collection is complete, the survey team should prepare a final field report, which should report reasons for any deviations between the original sample and the dataset collected. +When all data collection is complete, the survey team should prepare a final field report, +which should report reasons for any deviations between the original sample and the dataset collected. Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of survey data. -It is important to structure this reporting in a way that not only group broads rationales into specific categories +It is important to structure this reporting in a way that not only groups broad rationales into specific categories but also collects all the detailed, open-ended responses to questions the field team can provide for any observations that they were unable to complete. This reporting should be validated and saved alongside the final raw data, and treated the same way. This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions and loss to follow-up occurred in the field and how they were implemented and resolved. -High frequency checks should also include survey-specific data checks. As CAPI +High frequency checks should also include survey-specific data checks. As electronic survey software incorporates many data control features, discussed above, these checks -should focus on issues CAPI software cannot check automatically. As most of +should focus on issues survey software cannot check automatically. As most of these checks are survey specific, it is difficult to provide general guidance. An in-depth knowledge of the questionnaire, and a careful examination of the pre-analysis plan, is the best preparation. Examples include consistency -across multiple responses, complex calculations suspicious patterns in survey -timing or atypical response patters from specific enumerators. +across multiple responses, complex calculations, suspicious patterns in survey +timing, or atypical response patters from specific enumerators. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} -CAPI software typically provides rich metadata, which can be useful in +survey software typically provides rich metadata, which can be useful in assessing interview quality. For example, automatically collected time stamps show how long enumerators spent per question, and trace histories show how many times answers were changed before the survey was submitted. @@ -193,7 +201,7 @@ \subsection{Data considerations for field monitoring} \textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. For back-checks, a random subset of the field sample is chosen and a subset of information from the full survey is verified through a brief interview with the original respondent. Design of the back-check questionnaire follows the same survey design -principles discussed above, in particular you should use the pre-analysis plan +principles discussed above: you should use the pre-analysis plan or list of key outcomes to establish which subset of variables to prioritize. Real-time access to the survey data increases the potential utility of @@ -203,9 +211,10 @@ \subsection{Data considerations for field monitoring} appropriately apportioned across interviews and survey teams. As soon as back-checks are complete, the back-check data can be tested against the original data to identify areas of concern in real-time. -\texttt{bcstats} is a useful tool for analyzing back-check data in Stata module. \sidenote{\url{https://ideas.repec.org/c/boc/bocode/s458173.html}} +\texttt{bcstats} is a useful tool for analyzing back-check data in Stata module. +\sidenote{\url{https://ideas.repec.org/c/boc/bocode/s458173.html}} -CAPI surveys also provide a unique opportunity to do audits through audio recordings of the interview, typically short recordings triggered at random throughout the questionnaire. +Electronic surveys also provide a unique opportunity to do audits through audio recordings of the interview, typically short recordings triggered at random throughout the questionnaire. \textbf{Audio audits} are a useful means to assess whether the enumerator is conducting the interview as expected (and not sitting under a tree making up data). Do note, however, that audio audits must be included in the Informed Consent. @@ -225,26 +234,26 @@ \subsection{Secure data in the field} \sidenote{\textbf{Encryption:} the process of making information unreadable to anyone without access to a specific deciphering key. \url{https://dimewiki.worldbank.org/wiki/Encryption}} -all data submitted from the field while in transit (i.e., upload or download). Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. Therefore, as long as you are using an established CAPI software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, and accounts used have a logon password and are never left unlocked. +all data submitted from the field while in transit (i.e., upload or download). Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. Therefore, as long as you are using an established survey software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, and accounts used have a logon password and are never left unlocked. \subsection{Secure data storage} -\textbf{Encryption at rest} is the only way to ensure that PII data remains private when it is stored on someone else's server on the open internet. You must keep your data encrypted on the server whenever PII data is collected. +\textbf{Encryption at rest} is the only way to ensure that PII data remains private when it is stored on someone else's server on the internet. You must keep your data encrypted on the server whenever PII data is collected. Encryption makes data files completely unusable without access to a security key specific to that data -- a higher level of security than password-protection. -Encryption at rest requires active participation from the user; and you should be fully aware that if your private key is lost, there is absolutely no way to recover your data. +Encryption at rest requires active participation from the user, and you should be fully aware that if your private key is lost, there is absolutely no way to recover your data. -You should not assume that your data is encrypted by default: indeed, for most CAPI software platforms, encryption needs to be enabled by the user. +You should not assume that your data is encrypted by default: indeed, for most survey software platforms, encryption needs to be enabled by the user. To enable it, you must confirm you know how to operate the encryption system and understand the consequences if basic protocols are not followed. When you enable encryption, the service will allow you to download -- once -- the keyfile pair needed to decrypt the data. You must download and store this in a secure location, such as a password manager. Make sure you store keyfiles with descriptive names to match the survey to which they correspond. -Any time anyone accesses the data - either when viewing it in the browser or downloading it to your computer - they will be asked to provide the keyfile. -Only project teams members names in the IRB are allowed access to the private keyfile. +Any time anyone accesses the data- either when viewing it in the browser or downloading it to your computer- they will be asked to provide the keyfile. +Only project team members named in the IRB are allowed access to the private keyfile. To proceed with data analysis, you typically need a working copy of the data accessible from a personal computer. The following workflow allows you to receive data from the server and store it securely, without compromising data security. \begin{enumerate} \item Download data \item Store a ``master'' copy of the data into an encrypted location that will remain accessible on disk and be regularly backed up - \item Secure a ``gold master'' copy of the raw data in a secure location, such as a long-term cloud storage service or an encrypted physical hard drive stored in a separate location. If you remain lucky, you will never have to access this copy -- you just want to know it is out there, safe, if you need it. + \item Create a ``gold master'' copy of the raw data in a secure location, such as a long-term cloud storage service or an encrypted physical hard drive stored in a separate location. If you remain lucky, you will never have to access this copy -- you just want to know it is out there, safe, if you need it. \end{enumerate} @@ -263,7 +272,7 @@ \subsection{Secure data sharing} The \textbf{initial de-identification} should happen directly after the encrypted data is downloaded to disk. At this time, for each variable that contains PII, ask: will this variable be needed for analysis? If not, the variable should be dropped. Examples include respondent names, enumerator names, interview date, respondent phone number. If the variable is needed for analysis, ask: can I encode or otherwise construct a variable to use for the analysis that masks the PII, and drop the original variable? -Examples include: geocoordinates - after construction measures of distance or area, the specific location is often not necessary; and names for social network analysis, which can be encoded to unique numeric IDs. +Examples include geocoordinates (after constructing measures of distance or area, drop the specific location), and names for social network analysis (can be encoded to unique numeric IDs). If PII variables are directly required for the analysis itself, it will be necessary to keep at least a subset of the data encrypted through the data analysis process. Flagging all potentially identifying variables in the questionnaire design stage, as recommended above, simplifies the initial de-identification. @@ -271,11 +280,15 @@ \subsection{Secure data sharing} If so, all you need to do is write a script to drop the variables that are not required for analysis, encode or otherwise mask those that are required, and save a working version of the data. The \textbf{final de-identification} is a more involved process, with the objective of creating a dataset for publication that cannot be manipulated or linked to identify any individual research participant. -You must remove all direct and indirect identifiers, and assess the risk of statistical disclosure. \sidenote{Disclosure risk: the likelihood that a released data record can be associated with an individual or organization}. +You must remove all direct and indirect identifiers, and assess the risk of statistical disclosure. +\sidenote{Disclosure risk: the likelihood that a released data record can be associated with an individual or organization}. \index{statistical disclosure} -There will almost always be a trade-off between accuracy and privacy. For publicly disclosed data, you should always favor privacy. -There are a number of useful tools for de-identification: PII scanners for Stata \sidenote{\url{https://github.com/J-PAL/stata_PII_scan}} or R \sidenote{\url{https://github.com/J-PAL/PII-Scan}}, -and tools for statistical disclosure control. \sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/}} +There will almost always be a trade-off between accuracy and privacy. For publicly disclosed data, you should favor privacy. +There are a number of useful tools for de-identification: PII scanners for Stata +\sidenote{\url{https://github.com/J-PAL/stata_PII_scan}} or R +\sidenote{\url{https://github.com/J-PAL/PII-Scan}}, +and tools for statistical disclosure control. +\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/}} In cases where PII data is required for analysis, we recommend embargoing the sensitive variables when publishing the data. Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. From 392ac7c17be3c792fd9d3b999979b379b2e74759 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 21 Jan 2020 08:52:30 -0500 Subject: [PATCH 280/854] [ch6] itallic sentence --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index e01b3f22f..ed8ce5075 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -13,7 +13,7 @@ while making sure that code and outputs do not become tangled and lost over time. When it comes to code, though, analysis is the easy part, -as long as you have organized your data well. +\textit{as long as you have organized your data well}. Of course, there is plenty of complexity behind it: the econometrics, the theory of change, the measurement methods, and so much more. But none of those are the subject of this book. From cc713b37325e9a53608f1d574762f190ce21dc16 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 21 Jan 2020 09:22:09 -0500 Subject: [PATCH 281/854] [ch7] add sidenote - Rmarkdown, stata pandoc and jupyter --- chapters/publication.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 38a8f1527..6416c36bb 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -81,10 +81,11 @@ \subsection{Dynamic documents} They fall into two broad groups -- the first compiles a document as part of code execution, and the second operates a separate document compiler. -In the first group are tools such as R's RMarkdown and Stata's \texttt{dyndoc}. +In the first group are tools such as R's RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com/}} +and Stata's \texttt{dyndoc}\sidenote{\url{https://www.stata.com/manuals/rptdyndoc.pdf}}. These tools ``knit'' or ``weave'' text and code together, and are programmed to insert code outputs in pre-specified locations. -Documents called ``notebooks'' (such as Jupyter) work similarly, +Documents called ``notebooks'' (such as Jupyter\sidenote{\url{https://jupyter.org/}}) work similarly, as they also use the underlying analytical software to create the document. These types of dynamic documents are usually appropriate for short or informal materials because they tend to offer limited editability outside the base software From c80a09586c29962ce4d9a9a2c769f6c50790486e Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 21 Jan 2020 09:28:27 -0500 Subject: [PATCH 282/854] [ch6] Adding Kris' review comments --- chapters/data-analysis.tex | 52 ++++++++++++++++++-------------------- 1 file changed, 24 insertions(+), 28 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index ed8ce5075..9c7cc6042 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -17,7 +17,7 @@ Of course, there is plenty of complexity behind it: the econometrics, the theory of change, the measurement methods, and so much more. But none of those are the subject of this book. -Instead, this chapter will focus on how to organize your data work. +\textit{Instead, this chapter will focus on how to organize your data work so that coding the analysis becomes easy}. Most of a Research Assistant's time is spent cleaning data and getting it into the right format. When the practices recommended here are adopted, analyzing the data is as simple as using a command that is already implemented in a statistical software. @@ -145,12 +145,11 @@ \section{Data cleaning} These files should be retained in the raw data folder \textit{exactly as they were received}. The folder must be encrypted if it is shared in an insecure fashion,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} and it must be backed up in a secure offsite location. -Everything else can be replaced, but raw data cannot. -Therefore, raw data should never be interacted with directly. +Every other file is created from the raw data, and therefore can be recreated. +The exception, of course, is the raw data itself, so it should never be edited directly. -Secure storage of the raw\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Security}} -data means access to it will be restricted even inside the research team. -Loading encrypted data multiple times it can be annoying. +Secure storage of the raw data means access to it will be restricted even inside the research team.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Security}} +Loading encrypted data frequently can be disruptive to the work flow. To facilitate the handling of the data, remove any personally identifiable information from the data set. This will create a de-identified data set, that can be saved in a non-encrypted folder. De-identification,\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} @@ -192,24 +191,21 @@ \section{Data cleaning} and how the correct value was obtained. % Data description ------------------------------------------------------------------ -Note that if you are using secondary data, -the tasks described above can likely be skipped. -However, the last step of data cleaning, describing the data, -will probably still be necessary. +Note that if you are using secondary data, the tasks described above can likely be skipped. +The last step of data cleaning, however, will most likely still be necessary. +It consists of describing the data, so that its users have all the information needed to interact with it. This is a key step to making the data easy to use, but it can be quite repetitive. The \texttt{iecodebook} command suite, also part of \texttt{iefieldkit}, is designed to make some of the most tedious components of this process, -such as renaming, relabeling, and value labeling, -much easier (including in data appending).\sidenote{\url{https://dimewiki.worldbank.org/wiki/iecodebook}} +such as renaming, relabeling, and value labeling, much easier.\sidenote{\url{https://dimewiki.worldbank.org/wiki/iecodebook}} \index{iecodebook} We have a few recommendations on how to use this command for data cleaning. -First, we suggest keeping the same variable names as in the survey instrument, -so it's easy to connect the two files. -Don't skip the labelling. +First, we suggest keeping the same variable names as in the survey instrument, so it's straightforward to link data points for a variable to the question that originated them. +Second, don't skip the labeling. Applying labels makes it easier to understand what the data is showing while exploring the data. This minimizes the risk of small errors making their way through into the analysis stage. Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Applying\_Labels}} -Recodes should be used to turn codes for ``Don't know'', ``Refused to answer'', and +Third, recodes should be used to turn codes for ``Don't know'', ``Refused to answer'', and other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} String variables need to be encoded, and open-ended responses, categorized or dropped\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Strings}} (unless you are using qualitative or classification analyses, which are less common). @@ -282,7 +278,7 @@ \section{Indicator construction} The second is to ensure that variable definition is consistent across data sources. Unlike cleaning, construction can create many outputs from many inputs. Let's take the example of a project that has a baseline and an endline survey. -Unless the two instruments are exactly the same, which is unlikely, the data cleaning for them will require different steps, and therefore will be done separately. +Unless the two instruments are exactly the same, which is preferable but often not the case, the data cleaning for them will require different steps, and therefore will be done separately. However, you still want the constructed variables to be calculated in the same way, so they are comparable. So you want to construct indicators for both rounds in the same code, after merging them. @@ -334,10 +330,11 @@ \section{Indicator construction} there may be a \texttt{data-wide.dta}, \texttt{data-wide-children-only.dta}, \texttt{data-long.dta}, \texttt{data-long-counterfactual.dta}, and many more as needed. -One thing all constructed data sets should have in common, though, -are functionally-named variables. -As you no longer need to worry about keeping variable names -consistent with the survey, they should be as intuitive as possible. +One thing all constructed data sets should have in common, though, are functionally-named variables. +Constructed variables are called ``constructed'' because they were not present in the survey to start with, +so making their names consistent with the survey form is not as crucial. +Of course, whenever possible, having variables names that are both intuitive and can be linked to the survey is ideal. +However, functionality should be prioritized here. Remember to consider keeping related variables together and adding notes to each as necessary. % Documentation @@ -376,18 +373,18 @@ \section{Writing data analysis code} The way you deal with code and outputs for exploratory and final analysis is different, and this section will discuss how. % Organizing scripts --------------------------------------------------------- During exploratory data analysis, you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. -Though it's fine to write such a script during a long analysis meeting, this practice is error-prone. -It subtly encourages poor practices such as not clearing the workspace and not loading fresh data. +Although it's fine to write such a script if you are coding in real-time during a long analysis meeting with your PIs, this practice is error-prone. +It subtly encourages poor practices such as not clearing the workspace and not loading fresh data for each analysis task. It's important to take the time to organize scripts in a clean manner and to avoid mistakes. A well-organized analysis script starts with a completely fresh workspace and explicitly loads data before analyzing it. This encourages data manipulation to be done earlier in the workflow (that is, during construction). -It also and prevents you from accidentally writing pieces of analysis code that depend on one another, leading to the too-familiar ``run this part, then that part, then this part'' process. +It also and prevents you from accidentally writing pieces of analysis code that depend on one another and requires manual instructions for all required code snippets be run in the right order. Each script should run completely independently of all other code. You can go as far as coding every output in a separate script. There is nothing wrong with code files being short and simple -- as long as they directly correspond to specific pieces of analysis. -Analysis files should be as simple as possible, so you can focus on the econometrics. +Analysis files should be as simple as possible, so whoever is reading it can focus on the econometrics. All research decisions should be made very explicit in the code. This includes clustering, sampling, and control variables, to name a few. If you have multiple analysis data sets, each of them should have a descriptive name about its sample and unit of observation. @@ -407,7 +404,7 @@ \section{Writing data analysis code} Our team has created a few products to automate common outputs and save you precious research time. The \texttt{ietoolkit} package includes two commands to export nicely formatted tables. -\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} creates and exports balance tables to excel and {\TeX}. +\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} creates and exports balance tables to excel or {\LaTeX}. \texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} does the same for difference-in-differences regressions. The \textbf{Stata Visual Library}\sidenote{ \url{https://worldbank.github.io/Stata-IE-Visual-Library/}} has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com/}} @@ -469,8 +466,7 @@ \section{Exporting analysis outputs} instead of \texttt{graph save}, which creates a \texttt{.gph} file that can only be opened through a Stata installation. Some publications require ``lossless'' TIFF of EPS files, which are created by specifying the desired extension. For tables, \texttt{.tex} is preferred. -Excel \texttt{.xlsx} and \texttt{.csv} files are also acceptable, -although if you are working on a large report they will become cumbersome to update after revisions. +Excel \texttt{.xlsx} and \texttt{.csv} files are also acceptable, but require the extra step of copying the tables into the final output, so it can be cumbersome to ensure that your paper or report is always up-to-date. Whichever format you decide to use, remember to always specify the file extension explicitly. % Formatting From cdaa82c77e787d8fa7deeb4eabb388c89edefc78 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 21 Jan 2020 09:47:15 -0500 Subject: [PATCH 283/854] [ch6] typo --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 9c7cc6042..9bf055430 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -149,7 +149,7 @@ \section{Data cleaning} The exception, of course, is the raw data itself, so it should never be edited directly. Secure storage of the raw data means access to it will be restricted even inside the research team.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Security}} -Loading encrypted data frequently can be disruptive to the work flow. +Loading encrypted data frequently can be disruptive to the workflow. To facilitate the handling of the data, remove any personally identifiable information from the data set. This will create a de-identified data set, that can be saved in a non-encrypted folder. De-identification,\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} From 986930ce31bea45f857c944877430b2232aae69e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:06:38 -0500 Subject: [PATCH 284/854] Comments and code header --- chapters/planning-data-work.tex | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index e34738edc..a218538a0 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -368,7 +368,7 @@ \subsection{Organizing files and folder structures} The command also has some flexibility for the addition of folders for non-primary data sources, although this is less well developed. The \texttt{ietoolkit} package also includes the \texttt{iegitaddmd} command, -which can place \texttt{README.md} placeholder files in your folders so that +which can place \texttt{README.md} placeholder files in your folders so that your folder structure can be shared using Git. Since these placeholder files are in \textbf{Markdown} they also provide an easy way to document the contents of every folder in the structure. @@ -474,7 +474,13 @@ \subsection{Documenting and organizing code} Code documentation is one of the main factors that contribute to readability. Start by adding a code header to every file. -This should include simple things such as the purpose of the script and the name of the person who wrote it. +A code header is a long \textbf{comment}\sidenote{ + \textbf{Comments:} Code components that have no function, + but describe in plain language what the code is supposed to do. +} +that details the functionality of the entire script. +This should include simple things such as +the purpose of the script and the name of the person who wrote it. If you are using a version control software, the last time a modification was made and the person who made it will be recorded by that software. Otherwise, you should include it in the header. @@ -622,7 +628,7 @@ \subsection{Output management} Though formatted text software such as Word and PowerPoint are still prevalent, researchers are increasingly choosing to prepare final outputs like documents and presentations using {\LaTeX}\index{{\LaTeX}}.\sidenote{ - \url{https://www.latex-project.org} and \url{https://github.com/worldbank/DIME-LaTeX-Templates}.} + \url{https://www.latex-project.org} and \url{https://github.com/worldbank/DIME-LaTeX-Templates}.} {\LaTeX} is a document preparation system that can create both text documents and presentations. The main advantage is that {\LaTeX} uses plaintext for all formatting, and it is necessary to learn its specific markup convention to use it. From 80a42c84df22acb662032c394e36305777a479b0 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:09:10 -0500 Subject: [PATCH 285/854] Chunk independence --- chapters/planning-data-work.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index a218538a0..c2ada16a7 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -508,9 +508,10 @@ \subsection{Documenting and organizing code} Code organization means keeping each piece of code in an easily findable location. \index{code organization} -Breaking your code into independently readable ``chunks'' is one good practice on code organization, -because it ensures each component does not depend on a complex program state -created by other chunks that are not obvious from the immediate context. +Breaking your code into independently readable ``chunks'' is one good practice on code organization. +You should write each functional element as a chunk that can run completely on its own, +to ensure that each component does not depend on a complex program state +created by other code chunks that are not obvious from the immediate context. One way to do this is to create sections where a specific task is completed. So, for example, if you want to find the line in your code where a variable was created, you can go straight to \texttt{PART 2: Create new variables}, From 9f7135f5359c5685c9fad35ad4c7acce86fcfea9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:11:19 -0500 Subject: [PATCH 286/854] Code index --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index c2ada16a7..90d08dd23 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -520,7 +520,7 @@ \subsection{Documenting and organizing code} and it compiles them into an interactive script index for you. In Stata, you can use comments to create section headers, though they're just there to make the reading easier and don't have functionality. -Adding an index to the header by copying and pasting section titles is the easiest way to create a code map. +You should also add an index in the code header by copying and pasting section titles. You can then add and navigate through them using the \texttt{find} command. Since Stata code is harder to navigate, as you will need to scroll through the document, it's particularly important to avoid writing very long scripts. From 92f4a942d26172f943ee9fe9f1359f6ae3555547 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:14:36 -0500 Subject: [PATCH 287/854] Release the code! --- chapters/planning-data-work.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 90d08dd23..57b958fb9 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -576,7 +576,8 @@ \subsection{Documenting and organizing code} One other important advantage of code review if that making sure that the code is running properly on other machines, and that other people can read and understand the code easily, -is the easiest way to be prepared in advance for a smooth project handover. +is the easiest way to be prepared in advance for a smooth project handover +or for release of the code to the general public. % ---------------------------------------------------------------------------------------------- \subsection{Output management} From c71b4ba071c4add8739dbcd88b55ab2861aa79d3 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:19:10 -0500 Subject: [PATCH 288/854] Simplify Git reference --- chapters/planning-data-work.tex | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 57b958fb9..64ffec249 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -598,12 +598,10 @@ \subsection{Output management} Raw outputs in plaintext formats like \texttt{.tex} and \texttt{.eps} can be created from most analytical software and managed with Git. Tracking plaintext outputs with Git makes it easier to identify changes that affect results. -If you are re-running all of your code from the master when significant changes to the code are made, -the outputs will be overwritten, and changes in coefficients and number of observations, for example, -will be highlighted for you to review. -In fact, one of the most effective ways to check code quickly -is simply to commit all your code and outputs using Git, -then re-run the entire thing and examine any flagged changes in the directory. +If you are re-running all of your code from the master script, +the outputs will be overwritten, +and any changes in coefficients and number of observations, for example, +will be automatically flagged for you or a reviewer to check. No matter what choices you make, you will need to make updates to your outputs quite frequently. From 825da0e33d7e13898d08ab76af09ab7bfef64487 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:20:02 -0500 Subject: [PATCH 289/854] Draft code --- chapters/planning-data-work.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 64ffec249..88f321f45 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -609,9 +609,9 @@ \subsection{Output management} that it can be hard to remember where you saved the code that created it. Here, naming conventions and code organization play a key role in not re-writing scripts again and again. -It is common for teams to maintain one analyisis file or folder with ``exploratory analysis'', +It is common for teams to maintain one analyisis file or folder with draft code or ``exploratory analysis'', which are pieces of code that are stored only to be found again in the future, -but not cleaned up to be included in any outputs yet. +but not cleaned up to be included in any final outputs yet. Once you are happy with a result or output, it should be named and moved to a dedicated location. It's typically desirable to have the names of outputs and scripts linked, From 271801d46060e8632ee3d7a30d4e9ecb3403c7a5 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:21:08 -0500 Subject: [PATCH 290/854] Special naming requirements --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 88f321f45..6646bb368 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -434,7 +434,7 @@ \subsection{Organizing files and folder structures} These rules will ensure you can find files within folders and reduce the amount of time others will spend opening files to find out what is inside them. -The main point to be considered is that files accessed by code face more restrictions\sidenote{ +The main point to be considered is that files accessed by code have special naming requirements\sidenote{ \url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-Git/slides/naming-slides/naming-slides.pdf}}, since different software and operating systems read file names in different ways. Some of the differences between the two naming approaches are major and may be new to you, From 498d58f8ec19797de9be6d2aed40b1fdae0b969e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:26:31 -0500 Subject: [PATCH 291/854] Fix #319 --- chapters/research-design.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index eb41d8703..0c897a03b 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -438,7 +438,7 @@ \subsection{Instrumental variables} \textbf{Instrumental variables (IV)} designs, unlike the previous approaches, begin by assuming that the treatment delivered in the study in question is -inextricably linked to the outcomes and therefore not directly identifiable. +linked to outcomes such that the effect is not directly identifiable. Instead, similar to regression discontinuity designs, IV attempts to focus on a subset of the variation in treatment uptake and assesses that limited window of variation that can be argued @@ -455,7 +455,7 @@ \subsection{Instrumental variables} As in regression discontinuity designs, the fundamental form of the regression is similar to either cross-sectional or differences-in-differences designs. -However, instead of controlling for the running variable directly, +However, instead of controlling for the instrument directly, the IV approach typically uses the \textbf{two-stage-least-squares (2SLS)} estimator.\sidenote{ \url{http://www.nuff.ox.ac.uk/teaching/economics/bond/IV\%20Estimation\%20Using\%20Stata.pdf}} This estimator forms a prediction of the probability that the unit receives treatment From 1b9d0dd9e1ac7b7396ee5dde08ff1263565509f1 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:37:21 -0500 Subject: [PATCH 292/854] Specific seed --- chapters/sampling-randomization-power.tex | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 7eb9f7027..40f9fb022 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -116,19 +116,19 @@ \subsection{Reproducibility in random Stata processes} since Stata's \texttt{version} setting expires after each time you run your do-files. \textbf{Sorting} means that the actual data that the random process is run on is fixed. -Because numbers are assigned to each observation in row-by-row starting from +Because numbers are assigned to each observation in row-by-row starting from the top row, changing their order will change the result of the process. A corollary is that the underlying data must be unchanged between runs: you must make a fixed final copy of the data when you run a randomization for fieldwork. In Stata, the only way to guarantee a unique sorting order is to use \texttt{isid [id\_variable], sort}. (The \texttt{sort, stable} command is insufficient.) -You can additionally use the \texttt{datasignature} command to make sure the +You can additionally use the \texttt{datasignature} command to make sure the data is unchanged. \textbf{Seeding} means manually setting the start-point of the randomization algorithm. -You can draw a six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. -(This link is a shortcut to a specific request on \url{https://www.random.org}.) +You can draw a uniformly distributed six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. +(This link is a just shortcut to request such a random seed on \url{https://www.random.org}.) There are many more seeds possible but this is a large enough set for most purposes. In Stata, \texttt{set seed [seed]} will set the generator to that state. You should use exactly one seed per randomization process. @@ -143,9 +143,9 @@ \subsection{Reproducibility in random Stata processes} To confirm that a randomization has worked well before finalizing its results, save the outputs of the process in a temporary location, -re-run the code, and use \texttt{cf} or \texttt{datasignature} to ensure -nothing has changed. It is also advisable to let someone else reproduce your -randomization results on their machine to remove any doubt that your results +re-run the code, and use \texttt{cf} or \texttt{datasignature} to ensure +nothing has changed. It is also advisable to let someone else reproduce your +randomization results on their machine to remove any doubt that your results are reproducable. %----------------------------------------------------------------------------------------------- @@ -185,7 +185,7 @@ \subsection{Sampling} That master list may be called a \textbf{sampling universe}, a \textbf{listing frame}, or something similar. We recommend that this list be organized in a \textbf{master data set}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Master_Data_Set}}, -creating an authoritative source for the existence and fixed +creating an authoritative source for the existence and fixed characteristics of each of the units that may be surveyed.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} The master data set indicates how many individuals are eligible for data collection, From 653727c585d7927ce6320397ffd38e25a0fe55b7 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:43:51 -0500 Subject: [PATCH 293/854] Don't set seed in master --- chapters/sampling-randomization-power.tex | 2 ++ 1 file changed, 2 insertions(+) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 40f9fb022..c43a414ce 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -132,6 +132,8 @@ \subsection{Reproducibility in random Stata processes} There are many more seeds possible but this is a large enough set for most purposes. In Stata, \texttt{set seed [seed]} will set the generator to that state. You should use exactly one seed per randomization process. +To be clear: you should not set a single seed once in the master do-file, +but instead you should set one in code right before each random process. The most important thing is that each of these seeds is truly random, so do not use shortcuts such as the current date or a seed you have used before. You will see in the code below that we include the source and timestamp for verification. From 9a86c144e3ed5b92bea6a3aa629ecbf2dce90d8e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:53:15 -0500 Subject: [PATCH 294/854] StackExchange is better than citing Wikipedia --- bibliography.bib | 9 +++++++++ chapters/sampling-randomization-power.tex | 2 +- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/bibliography.bib b/bibliography.bib index 3a2272318..0f0f16935 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -1,3 +1,12 @@ +@MISC {88491, + TITLE = {What is meant by the standard error of a maximum likelihood estimate?}, + AUTHOR = {{Alecos Papadopoulos (\url{https://stats.stackexchange.com/users/28746/alecos-papadopoulos})}}, + HOWPUBLISHED = {Cross Validated}, + NOTE = {\url{https://stats.stackexchange.com/q/88491} (version: 2014-03-04)}, + EPRINT = {https://stats.stackexchange.com/q/88491}, + URL = {https://stats.stackexchange.com/q/88491} +} + @article{blischak2016quick, title={A quick introduction to version control with {Git} and {GitHub}}, author={Blischak, John D and Davenport, Emily R and Wilson, Greg}, diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index c43a414ce..af34984ca 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -167,7 +167,7 @@ \section{Sampling and randomization} In reality, you have to work with exactly one of them, so we put a lot of effort into making sure that one is a good one by reducing the probability that we observe nonexistent, or ``spurious'', results. -In large studies, we can use what are called \textbf{asymptotic standard errors} +In large studies, we can use what are called \textbf{asymptotic standard errors}\cite{88491} to express how far away from the true population parameters our estimates are likely to be. These standard errors can be calculated with only two datapoints: the sample size and the standard deviation of the value in the chosen sample. From 1a40286e345647411dd07448c7e60fdbe2d660ea Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:55:35 -0500 Subject: [PATCH 295/854] Don't re-randomize --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index af34984ca..30b1574eb 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -273,7 +273,7 @@ \section{Clustering and stratification} They allow us to control the randomization process with high precision, which is often necessary for appropriate inference, particularly when samples or subgroups are small.\cite{athey2017econometrics} -(By contrast, re-randomizing or resampling are never appropriate for this.) +(By contrast, re-randomizing or resampling are never appropriate for this.\cite{bruhn2009pursuit}) These techniques can be used in any random process; their implementation is nearly identical in both sampling and randomization. From 7496599be200a565ffae8a3b8b88b36acbd6febd Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:57:12 -0500 Subject: [PATCH 296/854] Cluster --- chapters/sampling-randomization-power.tex | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 30b1574eb..907572472 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -292,13 +292,9 @@ \subsection{Clustering} Clustering is procedurally straightforward in Stata, although it typically needs to be performed manually. -To cluster sampling or randomization, -\texttt{preserve} the data, keep one observation from each cluster -using a command like \texttt{bys [cluster] : keep if \_n == 1}. -Then sort the data and set the seed, and generate the random assignment you need. -Save the assignment in a separate dataset or a \texttt{tempfile}, -then \texttt{restore} and \texttt{merge} the assignment back on to the original dataset. - +To cluster a sampling or randomization, +create or use a data set where each cluster unit is an observation, +randomize on that data set, and then merge back the results. When sampling or randomization is conducted using clusters, the clustering variable should be clearly identified since it will need to be used in subsequent statistical analysis. From b4bb5bd1a0bea30ff852d867429eea33c9e46736 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 10:59:39 -0500 Subject: [PATCH 297/854] Power power power --- chapters/sampling-randomization-power.tex | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 907572472..fdc5058af 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -412,8 +412,12 @@ \subsection{Power calculations} If, in your field, a ``large'' effect is just a few percentage points or a fraction of a standard deviation, then it is nonsensical to run a study whose MDE is much larger than that. -Conversely, the \textbf{minimum sample size} pre-specifies expected effects -and tells you how large a study's sample would need to be to detect that effect. +This is because, given the sample size and variation in the population, +the effect needs to be much larger to possibly be statistically detected, +so such a study would not be able to say anything about the effect size that is practically relevant. +Conversely, the \textbf{minimum sample size} pre-specifies expected effect sizes +and tells you how large a study's sample would need to be to detect that effect, +which can tell you what resources you would need to avoid that exact problem. Stata has some commands that can calculate power analytically for very simple designs -- \texttt{power} and \texttt{clustersampsi} -- From 882532318a6b7082310db847c304b14db94c6f9e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:00:37 -0500 Subject: [PATCH 298/854] Research design --- chapters/sampling-randomization-power.tex | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index fdc5058af..242e99e20 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -1,7 +1,7 @@ %----------------------------------------------------------------------------------------------- \begin{fullwidth} -Sampling and randomization are two core elements of study design. +Sampling and randomization are two core elements of research design. In experimental methods, sampling and randomization directly determine the set of individuals who are going to be observed and what their status will be for the purpose of effect estimation. @@ -211,7 +211,7 @@ \subsection{Sampling} deciding what population, if any, a sample is meant to represent (including subgroups); and deciding that different individuals should have different probabilities of being included in the sample. -These should be determined in advance by the study design, +These should be determined in advance by the research design, since otherwise the sampling process will not be clear, and the interpretation of measurements is directly linked to who is included in them. Often, data collection can be designed to keep complications to a minimum, @@ -307,7 +307,7 @@ \subsection{Clustering} \subsection{Stratification} -\textbf{Stratification} is a study design component +\textbf{Stratification} is a research design component that breaks the full set of observations into a number of subgroups before performing randomization within each subgroup.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Stratified_Random_Sample}} @@ -407,7 +407,7 @@ \subsection{Power calculations} of that definition that give actionable, quantitative results. The \textbf{minimum detectable effect (MDE)}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Minimum_Detectable_Effect}} -is the smallest true effect that a given study design can detect. +is the smallest true effect that a given research design can detect. This is useful as a check on whether a study is worthwhile. If, in your field, a ``large'' effect is just a few percentage points or a fraction of a standard deviation, From 2642e6f3215d97161d9025de336fff4e5b2fe2c2 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:03:52 -0500 Subject: [PATCH 299/854] Sampling = randomization --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 242e99e20..c86b77977 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -222,7 +222,7 @@ \subsection{Sampling} \subsection{Randomization} \textbf{Randomization}, in this context, is the process of assigning units into treatment arms. -Most of the Stata commands used for sampling can be directly transferred to randomization, +Most of the code processses used for sampling are the same as those used for randomization, since randomization is also a process of splitting a sample into groups. Where sampling determines whether a particular individual will be observed at all in the course of data collection, From d274508a9adc0537b0a6f2fe0dda83385439ab95 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:05:33 -0500 Subject: [PATCH 300/854] T/C State --- chapters/sampling-randomization-power.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index c86b77977..8b85e591e 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -226,7 +226,8 @@ \subsection{Randomization} since randomization is also a process of splitting a sample into groups. Where sampling determines whether a particular individual will be observed at all in the course of data collection, -randomization determines what state each individual will be observed in. +randomization determines if each individual will be observed +as a treatment unit or used as a counterfactual. Randomizing a treatment guarantees that, \textit{on average}, the treatment will not be correlated with anything it did not cause.\cite{duflo2007using} Causal inference from randomization therefore depends on a specific counterfactual: From 631fa1ea504c137bb2739425d21e100e14f07365 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:06:24 -0500 Subject: [PATCH 301/854] Correlation --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 8b85e591e..ed45c5c2e 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -229,7 +229,7 @@ \subsection{Randomization} randomization determines if each individual will be observed as a treatment unit or used as a counterfactual. Randomizing a treatment guarantees that, \textit{on average}, -the treatment will not be correlated with anything it did not cause.\cite{duflo2007using} +the treatment will not be correlated with anything but the results of that treatment.\cite{duflo2007using} Causal inference from randomization therefore depends on a specific counterfactual: that the units who received the treatment program might not have done so. Therefore, controlling the exact probability that each individual receives treatment From f7a79046a8d4623210b8804617979c5ae66de7fc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:07:35 -0500 Subject: [PATCH 302/854] T/C state 2 --- chapters/sampling-randomization-power.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index ed45c5c2e..6c934276d 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -231,7 +231,8 @@ \subsection{Randomization} Randomizing a treatment guarantees that, \textit{on average}, the treatment will not be correlated with anything but the results of that treatment.\cite{duflo2007using} Causal inference from randomization therefore depends on a specific counterfactual: -that the units who received the treatment program might not have done so. +that the units who received the treatment program +could just as well have been randomized into the control group. Therefore, controlling the exact probability that each individual receives treatment is the most important part of a randomization process, and must be carefully worked out in more complex designs. From b48a5c0dbe019bbc1d70b3cabe33fe938fecabc9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:08:08 -0500 Subject: [PATCH 303/854] Simulations --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 6c934276d..9e79223be 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -138,7 +138,7 @@ \subsection{Reproducibility in random Stata processes} so do not use shortcuts such as the current date or a seed you have used before. You will see in the code below that we include the source and timestamp for verification. Any process that includes a random component is a random process, -including sampling, randomization, power calculation, and algorithms like bootstrapping. +including sampling, randomization, power calculation simulations, and algorithms like bootstrapping. Other commands may induce randomness in the data or alter the seed without you realizing it, so carefully confirm exactly how your code runs before finalizing it.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Randomization_in_Stata}} From a06f49ada8358d88e52a450f27839ac705ae78c1 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:10:07 -0500 Subject: [PATCH 304/854] Code and data --- chapters/publication.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 6416c36bb..51636c7aa 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -22,7 +22,7 @@ and better understand the results you have obtained. Holding code and data to the same standards as written work is a new discipline for many researchers, -and here we provide some basic guidelines and responsibilities for both +and here we provide some basic guidelines and responsibilities for that process that will help you prepare a functioning and informative replication package. In all cases, we note that technology is rapidly evolving and that the specific tools noted here may not remain cutting-edge, @@ -227,7 +227,7 @@ \subsection{Getting started with \LaTeX\ via Overleaf} On the downside, there is a small amount of up-front learning required, continous access to the Internet is necessary, and updating figures and tables requires a bulk file upload that is tough to automate. -One of the most common issues you will face using Overleaf's ``rich text'' +One of the most common issues you will face using Overleaf's ``rich text'' editor will be special characters which, because of code functions, need to be handled differently than in Word. Most critically, the ampersand (\texttt{\&}), percent (\texttt{\%}), and underscore (\texttt{\_}) @@ -366,7 +366,7 @@ \subsection{Publishing code for replication} such as ensuring that the raw components of figures or tables are clearly identified. For example, outputs should clearly correspond by name to an exhibit in the paper, and vice versa. (Supplying a compiling \LaTeX\ document can support this.) -Code and outputs which are not used should be removed -- +Code and outputs which are not used should be removed -- if you are using GitHub, consider making them available in a different branch for transparency. \subsection{Releasing a replication package} From 3c0f1d829fd03488d3a371e6ac6c0975faf82aae Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 21 Jan 2020 11:12:24 -0500 Subject: [PATCH 305/854] Update chapters/data-collection.tex Co-Authored-By: Luiza Andrade --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 873bbe436..36e7d232b 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -40,7 +40,7 @@ \section{Designing electronic questionnaires} The first step of questionnaire design is to list key outcomes of interest, as well as the main factors to control for (covariates) and variables needed for experimental design. The ideal starting point for this is a \textbf{pre-analysis plan}.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} -Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). For each module, determine if the module is applicable to the full sample, the appropriate respondent, and whether or how often, the module should be repeated. A few examples: a module on maternal health only applies to household with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. +Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). For each module, determine if it is applicable to the full sample, who is the appropriate respondent, and whether or how often the module should be repeated. A few examples: a module on maternal health only applies to households with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. Each module should then be expanded into specific indicators to observe in the field. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} From f162fdc4e5fcc92b5e094cac33fb670c67cfef8a Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 21 Jan 2020 11:12:35 -0500 Subject: [PATCH 306/854] Update chapters/data-collection.tex Co-Authored-By: Luiza Andrade --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 36e7d232b..d549dd668 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -43,7 +43,7 @@ \section{Designing electronic questionnaires} Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). For each module, determine if it is applicable to the full sample, who is the appropriate respondent, and whether or how often the module should be repeated. A few examples: a module on maternal health only applies to households with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. Each module should then be expanded into specific indicators to observe in the field. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} +Each module should then be expanded into specific indicators to observe in the field.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} At this point, it is useful to do a \textbf{content-focused pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. From 72c31646a309215286f67e5e51ffcc0a56c661da Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 21 Jan 2020 11:13:29 -0500 Subject: [PATCH 307/854] Update chapters/data-collection.tex Co-Authored-By: Luiza Andrade --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index d549dd668..b7b26648d 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -226,7 +226,7 @@ \subsection{Dashboard} \section{Collecting Data Securely} Primary data collection almost always includes \textbf{personally-identifiable information (PII)} -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. +Primary data collection almost always includes \textbf{personally-identifiable information (PII)}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. PII must be handled with great care at all points in the data collection and management process, to comply with ethical requirements and avoid breaches of confidentiality. Access to PII must be restricted to team members granted that permission by the applicable Institutional Review Board or a data licensing agreement with a partner agency. Research teams must maintain strict protocols for data security at each stage of the process: data collection, storage, and sharing. \subsection{Secure data in the field} From 79aa4bfe7efeb0fbdfba1da61c9a1a78a22429c3 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:16:47 -0500 Subject: [PATCH 308/854] NOverleaf --- chapters/publication.tex | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 51636c7aa..a7b88d2bb 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -196,46 +196,46 @@ \subsection{Technical writing with \LaTeX} and use external tools like Word's compare feature to generate integrated tracked versions when needed. -\subsection{Getting started with \LaTeX\ via Overleaf} +\subsection{Getting started with \LaTeX\ in the cloud} \LaTeX\ is a challenging tool to get started using, but the control it offers over the writing process is invaluable. In order to make it as easy as possible for your team to use \LaTeX\ without all members having to invest in new skills, -we suggest using the web-based Overleaf implementation as your first foray into \LaTeX\ writing.\sidenote{ - \url{https://www.overleaf.com}} -While the Overleaf site has a subscription feature that offers some useful extensions, -its free-to-use version offers basic tools that are sufficient -for a broad variety of basic applications, +we suggest using a web-based implementation as your first foray into \LaTeX\ writing. +Most such sites offer a subscription feature with useful extensions and various sharing permissions, +and some offer free-to-use versions with basic tools that are sufficient +for a broad variety of applications, up to and including writing a complete academic paper with coauthors. -Overleaf's implementation of \LaTeX\ is suggested here for several reasons. -Since it is completely hosted online, -it avoids the inevitable troubleshooting of setting up a \LaTeX\ installation +Cloud-based implementations of \LaTeX\ are suggested here for several reasons. +Since they are completely hosted online, +they avoids the inevitable troubleshooting of setting up a \LaTeX\ installation on various personal computers run by the different members of your team. -It also automatically maintains a single master copy of the document +They also typically maintain a single continuously synced master copy of the document so that different writers do not create conflicted or out-of-sync copies, -and allows inviting collaborators to edit in a fashion similar to Google Docs. -Overleaf also offers a basic version history tool that avoids having to use separate software. -Most importantly, it provides a ``rich text'' editor +or need to deal with Git themselves to maintain that sync. +They typically allow inviting collaborators to edit in a fashion similar to Google Docs, +though different services vary the number of collaborators and documents allowed at each tier. +Most importantly, some tools provide a ``rich text'' editor that behaves pretty similarly to familiar tools like Word, -so that people can write into the document without worrying too much +so that collaborators can write text directly into the document without worrying too much about the underlying \LaTeX\ coding. -Overleaf also offers a convenient selection of templates -so it is easy to start up a project and see results right away. +Cloud services also usually offer a convenient selection of templates +so it is easy to start up a project and see results right away +without needing to know a lot of the code that controls document formatting. On the downside, there is a small amount of up-front learning required, continous access to the Internet is necessary, and updating figures and tables requires a bulk file upload that is tough to automate. -One of the most common issues you will face using Overleaf's ``rich text'' -editor will be special characters +One of the most common issues you will face using online editors will be special characters which, because of code functions, need to be handled differently than in Word. Most critically, the ampersand (\texttt{\&}), percent (\texttt{\%}), and underscore (\texttt{\_}) need to be ``escaped'' (interpreted as text and not code) in order to render. This is done by by writing a backslash (\texttt{\textbackslash}) before them, such as writing \texttt{40\textbackslash\%} for the percent sign to appear in text. Despite this, we believe that with minimal learning and workflow adjustments, -Overleaf is often the easiest way to allow coauthors to write and edit in \LaTeX\, +cloud-based implementations are often the easiest way to allow coauthors to write and edit in \LaTeX\, so long as you make sure you are available to troubleshoot minor issues like these. %------------------------------------------------ From 97b2ccdf913732122147831237a6f66495a54920 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 21 Jan 2020 11:18:06 -0500 Subject: [PATCH 309/854] [ch6] some more of Kris' reviews addressed --- chapters/data-analysis.tex | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 9bf055430..b46bd20f3 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -6,11 +6,11 @@ and statistical and econometric knowledge. The process of data analysis is, therefore, a back-and-forth discussion between people -with differing experiences, perspectives, and research interests. +with differing skill sets. The research assistant usually ends up being the pivot of this discussion. It is their job to translate the data received from the field into economically meaningful indicators and to analyze them -while making sure that code and outputs do not become tangled and lost over time. +while making sure that code and outputs do not become too difficult to follow or get lost over time. When it comes to code, though, analysis is the easy part, \textit{as long as you have organized your data well}. @@ -200,7 +200,7 @@ \section{Data cleaning} such as renaming, relabeling, and value labeling, much easier.\sidenote{\url{https://dimewiki.worldbank.org/wiki/iecodebook}} \index{iecodebook} We have a few recommendations on how to use this command for data cleaning. -First, we suggest keeping the same variable names as in the survey instrument, so it's straightforward to link data points for a variable to the question that originated them. +First, we suggest keeping the same variable names in the cleaned data set as in the survey instrument, so it's straightforward to link data points for a variable to the question that originated them. Second, don't skip the labeling. Applying labels makes it easier to understand what the data is showing while exploring the data. This minimizes the risk of small errors making their way through into the analysis stage. @@ -342,7 +342,7 @@ \section{Indicator construction} Carefully record how specific variables have been combined, recoded, and scaled. This can be part of a wider discussion with your team about creating protocols for variable definition. That will guarantee that indicators are defined consistently across projects. -Documentation is an output of construction as relevant as the codes. +Documentation is an output of construction as relevant as the code and the data. Someone unfamiliar with the project should be able to understand the contents of the analysis data sets, the steps taken to create them, and the decision-making process through your documentation. The construction documentation will complement the reports and notes created during data cleaning. Together, they will form a detailed account of the data processing. @@ -374,7 +374,7 @@ \section{Writing data analysis code} % Organizing scripts --------------------------------------------------------- During exploratory data analysis, you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. Although it's fine to write such a script if you are coding in real-time during a long analysis meeting with your PIs, this practice is error-prone. -It subtly encourages poor practices such as not clearing the workspace and not loading fresh data for each analysis task. +It subtly encourages poor practices such as not clearing the workspace and not reloading the constructed data set before each analysis task. It's important to take the time to organize scripts in a clean manner and to avoid mistakes. A well-organized analysis script starts with a completely fresh workspace and explicitly loads data before analyzing it. @@ -389,7 +389,8 @@ \section{Writing data analysis code} This includes clustering, sampling, and control variables, to name a few. If you have multiple analysis data sets, each of them should have a descriptive name about its sample and unit of observation. As your team comes to a decision about model specification, you can create globals or objects in the master script to use across scripts. -This is a good way to make sure specifications are consistent throughout the analysis. It's also very dynamic, making it easy to update all scripts if needed. +This is a good way to make sure specifications are consistent throughout the analysis. +Using pre-specified globals or objects also makes your code more dynamic, so it is easy to update specifications and results without changing every script. It is completely acceptable to have folders for each task, and compartmentalize each analysis as much as needed. It is always better to have more code files open than to keep scrolling inside a given file. From ffed54381a980a7a435af263d597e35ac9637c46 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:18:19 -0500 Subject: [PATCH 310/854] GitHub is free when public --- chapters/publication.tex | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index a7b88d2bb..19aeaf378 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -381,8 +381,7 @@ \subsection{Releasing a replication package} the specific solutions we mention here highlight some current approaches as well as their strengths and weaknesses. GitHub provides one solution. -Making your GitHub repository public -is completely free for finalized projects. +Making a GitHub repository public is completely free. It can hold any file types, provide a structured download of your whole project, and allow others to look at alternate versions or histories easily. From cd278a7d7aaf2666da6dd77901f8d556fff53c55 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:19:11 -0500 Subject: [PATCH 311/854] gotta submit --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 19aeaf378..07d2766a6 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -1,7 +1,7 @@ %------------------------------------------------ \begin{fullwidth} -Publishing academic research today extends well beyond writing up a Word document alone. +Publishing academic research today extends well beyond writing up and submitting a Word document alone. There are often various contributors making specialized inputs to a single output, a large number of iterations, versions, and revisions, and a wide variety of raw materials and results to be published together. From 84de121ad8aa82f5a5344a30914f7143af4a5178 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:19:54 -0500 Subject: [PATCH 312/854] Don't be vague --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 07d2766a6..ef309f375 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -51,7 +51,7 @@ \section{Collaborating on technical writing} \subsection{Dynamic documents} -Dynamic documents are a broad class of tools that enable such a workflow. +Dynamic documents are a broad class of tools that enable a streamlined, reproducible workflow. The term ``dynamic'' can refer to any document-creation technology that allows the creation of explicit references to raw output files. This means that, whenever outputs are updated, From 7c73dca767980b73b4e6e66fb905e3cb05c720ec Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:20:40 -0500 Subject: [PATCH 313/854] Explicitly encoded linkages --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index ef309f375..cca6b8118 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -53,7 +53,7 @@ \subsection{Dynamic documents} Dynamic documents are a broad class of tools that enable a streamlined, reproducible workflow. The term ``dynamic'' can refer to any document-creation technology -that allows the creation of explicit references to raw output files. +that allows the inclusion of explicitly encoded linkages to raw output files. This means that, whenever outputs are updated, the next iteration of the document will automatically include all changes made to all outputs without any additional intervention from the writer. From 5132ab9aaa981edf2f0a1bd33055a2a7ea961adb Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:21:50 -0500 Subject: [PATCH 314/854] Simplify --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index cca6b8118..a21c9c27e 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -55,7 +55,7 @@ \subsection{Dynamic documents} The term ``dynamic'' can refer to any document-creation technology that allows the inclusion of explicitly encoded linkages to raw output files. This means that, whenever outputs are updated, -the next iteration of the document will automatically include +the next time the document is loaded or compiled, it will automatically include all changes made to all outputs without any additional intervention from the writer. This means that updates will never be accidentally excluded, and it further means that updating results will not become more difficult From de4dd2b9815c4b9d78acab05b72be2cdbceb2eeb Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:22:20 -0500 Subject: [PATCH 315/854] user --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index a21c9c27e..457e18790 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -56,7 +56,7 @@ \subsection{Dynamic documents} that allows the inclusion of explicitly encoded linkages to raw output files. This means that, whenever outputs are updated, the next time the document is loaded or compiled, it will automatically include -all changes made to all outputs without any additional intervention from the writer. +all changes made to all outputs without any additional intervention from the user. This means that updates will never be accidentally excluded, and it further means that updating results will not become more difficult as the number of inputs grows, From b1b37feacde9a65adb48c2bba7fb9f130f009fd5 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:23:51 -0500 Subject: [PATCH 316/854] Office and dyndoc etc --- chapters/publication.tex | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 457e18790..0888a0244 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -62,9 +62,11 @@ \subsection{Dynamic documents} as the number of inputs grows, because they are all managed by a single integrated process. -You will note that this is not possible in tools like Microsoft Office. -In Word, for example, you have to copy and paste each object individually -whenever tables, graphs or other inputs have to be updated. +You will note that this is not possible in tools like Microsoft Office, +although there are various tools and add-ons that produce similar functionality, +and we will introduce some later in this book. +In Word, by default, you have to copy and paste each object individually +whenever tables, graphs, or other inputs have to be updated. This means that both the features above are not available: fully updating the document becomes more and more time-consuming as the number of inputs increases, From 859e01a1df175ff58612255636d77c96422a5836 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:25:54 -0500 Subject: [PATCH 317/854] Inefficiency issues --- chapters/publication.tex | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 0888a0244..9f8706b09 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -67,10 +67,9 @@ \subsection{Dynamic documents} and we will introduce some later in this book. In Word, by default, you have to copy and paste each object individually whenever tables, graphs, or other inputs have to be updated. -This means that both the features above are not available: -fully updating the document becomes more and more time-consuming -as the number of inputs increases, -and it therefore becomes more and more likely +This creates complex inefficiency: updates may be accidentally excluded +and ensuring they are not will become more difficult as the document grows. +As time goes on, it therefore becomes more and more likely that a mistake will be made or something will be missed. Furthermore, it is very hard to simultaneously edit or track changes in a Microsoft Word document. From 2820ec58fc7479c200b72c4e87f6fbaa4ebb11b1 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:26:36 -0500 Subject: [PATCH 318/854] X-men: first class --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 9f8706b09..4ca5de770 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -98,7 +98,7 @@ \subsection{Dynamic documents} One very simple one is Dropbox Paper, a free online writing tool that allows linkages to files in Dropbox, which are then automatically updated anytime the file is replaced. -Like the first class of tools, Dropbox Paper has very limited formatting options, +Dropbox Paper has very limited formatting options, but it is appropriate for working with collaborators who are not using statistical software. However, the most widely utilized software for dynamically managing both text and results is \LaTeX\ (pronounced ``lah-tek'').\sidenote{ From 5b7bcb2f15cedf9550b76682e7f428e30820ed8a Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 21 Jan 2020 11:27:14 -0500 Subject: [PATCH 319/854] Ch 5 re-write Restructed Ch 5, added 'survey development workflow' section as per #310 --- chapters/data-collection.tex | 51 ++++++++++++++++++------------------ 1 file changed, 26 insertions(+), 25 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index b7b26648d..af05dd567 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -13,8 +13,7 @@ \end{fullwidth} %------------------------------------------------ - -\section{Designing electronic questionnaires} +\section{Survey development workflow} A well-designed questionnaire results from careful planning, consideration of analysis and indicators, close review of existing questionnaires, survey pilots, and research team and stakeholder review. There are many excellent resources on questionnaire design, such as from the World Bank's Living Standards Measurement Survey. \sidenote{Grosh, Margaret; Glewwe, Paul. 2000. Designing Household Survey Questionnaires for Developing Countries : Lessons from 15 Years of the Living Standards Measurement Study, Volume 3. Washington, DC: World Bank. © World Bank.\url{https://openknowledge.worldbank.org/handle/10986/15195 License: CC BY 3.0 IGO.}} The focus of this chapter is the particular design challenges for electronic surveys (often referred to as Computer Assisted Personal Interviews (CAPI)). @@ -33,6 +32,22 @@ \section{Designing electronic questionnaires} Finalizing the questionnaire before programming also avoids version control concerns that arise from concurrent work on paper and electronic survey instruments. Finally, a paper questionnaire is an important documentation for data publication. + +\subsection{Content-focused Pilot} +A \textbf{survey pilot}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} is essential to finalize questionnaire design. +A content-focused pilot\sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} is best done on pen-and-paper, before the questionnaire is programmed. +The objective is to improve the structure and length of the questionnaire, refine the phrasing and translation of specific questions, and confirm coded response options are exhaustive.\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes}} +In addition, it is an opportunity to test and refine all survey protocols, such as how units will be sampled or pre-selected units identified. The pilot must be done out-of-sample, but in a context as similar as possible to the study sample. + +\subsection{Data-focused pilot} +A second survey pilot should be done after the questionnaire is programmed. +The objective of the data-focused pilot\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. +Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. +It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. +The data-focused pilot should be done in advance of enumerator training + + +\section{Designing electronic questionnaires} The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: begin from broad concepts and slowly flesh out the specifics. It is essential to start with a clear understanding of the @@ -40,14 +55,15 @@ \section{Designing electronic questionnaires} The first step of questionnaire design is to list key outcomes of interest, as well as the main factors to control for (covariates) and variables needed for experimental design. The ideal starting point for this is a \textbf{pre-analysis plan}.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} -Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). For each module, determine if it is applicable to the full sample, who is the appropriate respondent, and whether or how often the module should be repeated. A few examples: a module on maternal health only applies to households with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. +Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). For each module, determine if the module is applicable to the full sample, the appropriate respondent, and whether or how often, the module should be repeated. A few examples: a module on maternal health only applies to household with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. Each module should then be expanded into specific indicators to observe in the field. -Each module should then be expanded into specific indicators to observe in the field.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} At this point, it is useful to do a \textbf{content-focused pilot} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. + \subsection{Questionnaire design for quantitative analysis} This book covers surveys designed to yield datasets useful for quantitative analysis. This is a subset of surveys, and there are specific design considerations that will help to ensure the raw data outputs are ready for analysis. @@ -71,14 +87,6 @@ \subsection{Questionnaire design for quantitative analysis} These are essential data components for completing CONSORT records, a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups. \sidenote[][-3.5cm]{Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K. F., Simel, D., et al. (1996). Improving the quality of reporting of randomized controlled trials: The CONSORT statement. \textit{JAMA}, 276(8):637--639} - - -\subsection{Content-focused Pilot} -A \textbf{survey pilot}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} is the final step of questionnaire design. -A content-focused pilot\sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} is best done on pen-and-paper, before the questionnaire is programmed. -The objective is to improve the structure and length of the questionnaire, refine the phrasing and translation of specific questions, and confirm coded response options are exhaustive.\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes}} -In addition, it is an opportunity to test and refine all survey protocols, such as how units will be sampled or pre-selected units identified. The pilot must be done out-of-sample, but in a context as similar as possible to the study sample. - Once the content of the questionnaire is finalized and translated, it is time to proceed with programming the electronic survey instrument. @@ -92,15 +100,16 @@ \section{Programming electronic questionnaires} Survey software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. However, these are not fully automatic: you still need to actively design and manage the survey. -Here, we discuss specific practices that you need to follow to take advantage of electronic survey features and ensure that the exported data is compatible with the software that will be used for analysis, and the importance of a data-focused pilot. +Here, we discuss specific practices that you need to follow to take advantage of electronic survey features and ensure that the exported data is compatible with the software that will be used for analysis. + -\subsection{Electronic survey workflow} -The starting point for questionnaire programming is a complete paper version of the questionnaire, piloted for content, and translated where needed. -Starting the programming at this point reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. +As discussed above, the starting point for questionnare programming is a complete paper version of the questionnaire, piloted for content, and translated where needed. +Doing so reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. When programming, do not start with the first question and proceed straight through to the last question. Instead, code from high level to small detail, following the same questionnaire outline established at design phase. The outline provides the basis for pseudocode, allowing you to start with high level structure and work down to the level of individual questions. This will save time and reduce errors. + \subsection{Electronic survey features} Electronic surveys are more than simply a paper questionnaire displayed on a mobile device or web browser. All common survey software allow you to automate survey logic and add in hard and soft constraints on survey responses. @@ -136,14 +145,6 @@ \subsection{Compatibility with analysis software} ranges are included for numeric variables. \texttt{ietestform} also removes all leading and trailing blanks from response lists, which could be handled inconsistently across software. -\subsection{Data-focused pilot} -The final stage of questionnaire programming is another survey pilot. -The objective of the data-focused pilot\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. -Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. -It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. -The data-focused pilot should be done in advance of enumerator training - - %------------------------------------------------ \section{Data quality assurance} @@ -226,7 +227,7 @@ \subsection{Dashboard} \section{Collecting Data Securely} Primary data collection almost always includes \textbf{personally-identifiable information (PII)} -Primary data collection almost always includes \textbf{personally-identifiable information (PII)}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. PII must be handled with great care at all points in the data collection and management process, to comply with ethical requirements and avoid breaches of confidentiality. Access to PII must be restricted to team members granted that permission by the applicable Institutional Review Board or a data licensing agreement with a partner agency. Research teams must maintain strict protocols for data security at each stage of the process: data collection, storage, and sharing. \subsection{Secure data in the field} From a68c08b545843d1b958c92760a84fd6b011ac087 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:32:43 -0500 Subject: [PATCH 320/854] LaTeX --- chapters/publication.tex | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 4ca5de770..5c1b9802a 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -79,10 +79,7 @@ \subsection{Dynamic documents} Therefore this is a broadly unsuitable way to prepare technical documents. There are a number of tools that can be used for dynamic documents. -They fall into two broad groups -- -the first compiles a document as part of code execution, -and the second operates a separate document compiler. -In the first group are tools such as R's RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com/}} +In the first group are code-based tools such as R's RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com/}} and Stata's \texttt{dyndoc}\sidenote{\url{https://www.stata.com/manuals/rptdyndoc.pdf}}. These tools ``knit'' or ``weave'' text and code together, and are programmed to insert code outputs in pre-specified locations. @@ -92,23 +89,30 @@ \subsection{Dynamic documents} because they tend to offer limited editability outside the base software and often have limited abilities to incorporate precision formatting. -On the other hand, some dynamic document tools do not require -operation of any underlying software, but simply require +The second group of dynamic document tools do not require +direct operation of underlying code or software, but simply require that the writer have access to the updated outputs. One very simple one is Dropbox Paper, a free online writing tool that allows linkages to files in Dropbox, which are then automatically updated anytime the file is replaced. Dropbox Paper has very limited formatting options, but it is appropriate for working with collaborators who are not using statistical software. + However, the most widely utilized software for dynamically managing both text and results is \LaTeX\ (pronounced ``lah-tek'').\sidenote{ \url{https://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/GSWLaTeX.pdf}} \index{\LaTeX} +Rather than using a coding language that is built for another purpose +or trying to hide the code entirely, +\LaTeX\ is a special code language designed for document preparation and typesetting. While this tool has a significant learning curve, its enormous flexibility in terms of operation, collaboration, and output formatting and styling -makes it the primary choice for most large technical outputs today. -In fact, \LaTeX\ operates behind-the-scenes in many of the tools listed in the first group. +makes it the primary choice for most large technical outputs today, +and it has proven to have enduring popularity. +In fact, \LaTeX\ operates behind-the-scenes in many of the tools listed before. +Therefore, we recommend that you learn to use \LaTeX\ directly +as soon as you are able to and provide several resources for doing so. \subsection{Technical writing with \LaTeX} From 39da66f7401d2214b0d83bb66ca8260ce14a28fd Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:34:04 -0500 Subject: [PATCH 321/854] Plain text --- chapters/publication.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 5c1b9802a..e3348b0a7 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -131,7 +131,8 @@ \subsection{Technical writing with \LaTeX} with only a few keystrokes. In sum, \LaTeX\ enables automatically-organized documents, manages tables and figures dynamically, -and (because it is written in plain text) can be version-controlled using Git. +and because it is written in a plain text file format, +\texttt{.tex} can be version-controlled using Git. This is why it has become the dominant ``document preparation system'' in technical writing. Unfortunately, \LaTeX\ can be a challenge to set up and use at first, From 768c65dbd5782523bcf5ab8a1864d08424e25d36 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:36:14 -0500 Subject: [PATCH 322/854] Cite --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index e3348b0a7..1a70f6898 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -159,7 +159,7 @@ \subsection{Technical writing with \LaTeX} One of the most important tools available in \LaTeX\ is the BibTeX bibliography manager.\sidenote{ \url{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.3194&rep=rep1&type=pdf}} BibTeX keeps all the references you might use in an auxiliary file, -then references them as plain text in the document using a \LaTeX\ command. +then references them using a simple element typed directly in the document: a \texttt{cite} command. The same principles that apply to figures and tables are therefore applied here: You can make changes to the references in one place (the \texttt{.bib} file), and then everywhere they are used they are updated correctly with one process. From c24cc8d2b083ec6d45b21ced288532305b253650 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:42:32 -0500 Subject: [PATCH 323/854] Extraneous --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 1a70f6898..9c7a4a45f 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -264,7 +264,7 @@ \section{Preparing a complete replication package} all necessary de-identified data for the analysis, and all code necessary for the analysis. The code should exactly reproduce the raw outputs you have used for the paper, -and should include no extraneous documentation or PII data you would not share publicly. +and should include no documentation or PII data you would not share publicly. \subsection{Publishing data for replication} From cb9f37a85e182dacddec0917ce32130ce9c426b1 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:43:11 -0500 Subject: [PATCH 324/854] Accesss --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 9c7a4a45f..3818d5063 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -274,7 +274,7 @@ \subsection{Publishing data for replication} to investigate what other results might be obtained from the same population, and test alternative approaches to other questions. Therefore you should make clear in your study -where and how data are stored and how it might be accessed. +where and how data are stored, and how and under what circumstances it might be accessed. You do not have to publish data yourself, although in many cases you will have the right to release at least some subset of your analytical dataset. From b73d2d0a46eda52016c1e3b9973cc2b6d484f69b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:46:12 -0500 Subject: [PATCH 325/854] When you don't own data --- chapters/publication.tex | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 3818d5063..7f0343a6b 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -275,9 +275,13 @@ \subsection{Publishing data for replication} and test alternative approaches to other questions. Therefore you should make clear in your study where and how data are stored, and how and under what circumstances it might be accessed. -You do not have to publish data yourself, -although in many cases you will have the right to release -at least some subset of your analytical dataset. +You do not always have to complete the data publication data yourself, +as long as you cite or otherwise directly reference data that you cannot release. +Even if you think your raw data is owned by someone else, +in many cases you will have the right to release +at least some subset of your analytical dataset or the indicators you constructed. +Check with the data supplier or other professional about licensing questions, +particularly your right to publish derivative materials. You should only directly publish data which is fully de-identified and, to the extent required to ensure reasonable privacy, potentially identifying characteristics are further masked or removed. From b4cab852f19346cf36cef226a0c5268e0eca1210 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:47:11 -0500 Subject: [PATCH 326/854] Questionnaire --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 7f0343a6b..0b30a71c1 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -317,7 +317,7 @@ \subsection{Publishing data for replication} such as CSV files with accompanying codebooks, since these will be re-adaptable by any researcher. Additionally, when possible, you should also release -the data collection instrument or survey used to gather the information +the data collection instrument or survey questionnaire used to gather the information so that readers can understand which data components are collected directly in the field and which are derived. You should provide a clean version of the data From fedd6169a2496aa665e950c4a9c03e202401c46a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:47:39 -0500 Subject: [PATCH 327/854] Q2 --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 0b30a71c1..8f4f686c5 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -321,7 +321,7 @@ \subsection{Publishing data for replication} so that readers can understand which data components are collected directly in the field and which are derived. You should provide a clean version of the data -which corresponds exactly to the original database or instrument +which corresponds exactly to the original database or questionnaire as well as the constructed or derived dataset used for analysis. Wherever possible, you should also release the code that constructs any derived measures, From d75ce1a165126b7d22f59e6c84469961b47738f4 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:48:07 -0500 Subject: [PATCH 328/854] be sure --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 8f4f686c5..99bd10ca8 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -344,7 +344,7 @@ \subsection{Publishing code for replication} time investments prior to releasing your replication package. By contrast, replication code usually has few legal and privacy constraints. In most cases code will not contain identifying information; -check carefully that it does not. +but make sure to check carefully that it does not. Publishing code also requires assigning a license to it; in a majority of cases, code publishers like GitHub offer extremely permissive licensing options by default. From fb7aca54df3015bb18dc1257c989c07df9e15792 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:49:12 -0500 Subject: [PATCH 329/854] No branches --- chapters/publication.tex | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 99bd10ca8..5c94b0633 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -376,8 +376,7 @@ \subsection{Publishing code for replication} such as ensuring that the raw components of figures or tables are clearly identified. For example, outputs should clearly correspond by name to an exhibit in the paper, and vice versa. (Supplying a compiling \LaTeX\ document can support this.) -Code and outputs which are not used should be removed -- -if you are using GitHub, consider making them available in a different branch for transparency. +Code and outputs which are not used should be removed. \subsection{Releasing a replication package} From 36487cdedd381802485d620987839fdc70c312bb Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:49:46 -0500 Subject: [PATCH 330/854] no consensus --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 5c94b0633..04cb94410 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -384,7 +384,7 @@ \subsection{Releasing a replication package} all you need to do is find a place to publish your materials. This is slightly easier said than done, as there are a few variables to take into consideration -and no global consensus on the best solution. +and, at the time of writing, no global consensus on the best solution. The technologies available are likely to change dramatically over the next few years; the specific solutions we mention here highlight some current approaches From 99de660e8840afada7853fe2307df2673da36f74 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:50:41 -0500 Subject: [PATCH 331/854] size contrainst --- chapters/publication.tex | 3 +++ 1 file changed, 3 insertions(+) diff --git a/chapters/publication.tex b/chapters/publication.tex index 04cb94410..cfaa6b6c0 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -396,6 +396,9 @@ \subsection{Releasing a replication package} and allow others to look at alternate versions or histories easily. It is straightforward to simply upload a fixed directory to GitHub apply a sharing license, and obtain a URL for the whole package. +(However, there is a strict size restriction of 100MB per file and +a restriction on the size of the repository as a whole, +so larger projects will need alternative solutions.) However, GitHub is not ideal for other reasons. It is not built to hold data in an efficient way From 2cc65624d30fd9ebc560f397ea109afdc352ca16 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:53:14 -0500 Subject: [PATCH 332/854] Don't rag on dataverse --- chapters/publication.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index cfaa6b6c0..48f177b1e 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -407,10 +407,10 @@ \subsection{Releasing a replication package} you can change or remove the contents at any time. A repository such as the Harvard Dataverse\sidenote{ \url{https://dataverse.harvard.edu}} -addresses these issues, but is a poor place to store code. +addresses these issues, as it is designed to be a citable code repository. The Open Science Framework\sidenote{ \url{https://osf.io}} -provides a balanced implementation +also provides a balanced implementation that holds both code and data (as well as simple version histories), as does ResearchGate\sidenote{ \url{https://https://www.researchgate.net}} From dd7cc864a220f7c2a144db4b1610a4e046435cb0 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 11:54:00 -0500 Subject: [PATCH 333/854] changelog --- chapters/publication.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 48f177b1e..cc1405910 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -418,7 +418,8 @@ \subsection{Releasing a replication package} Any of these locations is acceptable -- the main requirement is that the system can handle the structured directory that you are submitting, -and that it can provide a stable, structured URL for your project. +and that it can provide a stable, structured URL for your project +and report exactly what, if any, modifications you have made since initial publication. You can even combine more than one tool if you prefer, as long as they clearly point to each other. Emerging technologies such as CodeOcean\sidenote{ From cff628814485dece278effe3d00ac6e0f8f6db67 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 21 Jan 2020 12:07:26 -0500 Subject: [PATCH 334/854] Ch5 re-write - added paragraph on displaying HFC output - added example for complex calculation - fixed linebreak in content-focused pilot section --- chapters/data-collection.tex | 32 +++++++++++++++++++------------- 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index af05dd567..3a4bb6d35 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -36,8 +36,7 @@ \section{Survey development workflow} \subsection{Content-focused Pilot} A \textbf{survey pilot}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} is essential to finalize questionnaire design. A content-focused pilot\sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} is best done on pen-and-paper, before the questionnaire is programmed. -The objective is to improve the structure and length of the questionnaire, refine the phrasing and translation of specific questions, and confirm coded response options are exhaustive.\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes}} -In addition, it is an opportunity to test and refine all survey protocols, such as how units will be sampled or pre-selected units identified. The pilot must be done out-of-sample, but in a context as similar as possible to the study sample. +The objective is to improve the structure and length of the questionnaire, refine the phrasing and translation of specific questions, and confirm coded response options are exhaustive.\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes}} In addition, it is an opportunity to test and refine all survey protocols, such as how units will be sampled or pre-selected units identified. The pilot must be done out-of-sample, but in a context as similar as possible to the study sample. \subsection{Data-focused pilot} A second survey pilot should be done after the questionnaire is programmed. @@ -188,19 +187,28 @@ \subsection{High frequency checks} these checks are survey specific, it is difficult to provide general guidance. An in-depth knowledge of the questionnaire, and a careful examination of the pre-analysis plan, is the best preparation. Examples include consistency -across multiple responses, complex calculations, suspicious patterns in survey -timing, or atypical response patters from specific enumerators. +across multiple responses, complex calculation (such as crop yield, which first requires unit conversions), +suspicious patterns in survey timing, or atypical response patters from specific enumerators. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} survey software typically provides rich metadata, which can be useful in assessing interview quality. For example, automatically collected time stamps show how long enumerators spent per question, and trace histories show how many times answers were changed before the survey was submitted. +High-frequency checks will only improve data quality if the issues they catch are communicated to the field. +There are lots of ways to do this; what's most important is to find a way to create actionable information for your team, given field constraints. +`ipacheck` generates an excel sheet with results for each run; these can be sent directly to the field teams. +Many teams choose other formats to display results, notably online dashboards created by custom scripts. +It is also possible to automate communication of errors to the field team by adding scripts to link the HFCs with a messaging program such as whatsapp. +Any of these solutions are possible: what works best for your team will depend on such variables as cellular networks in fieldwork areas, whether field supervisors have access to laptops, internet speed, and coding skills of the team preparing the HFC workflows. + \subsection{Data considerations for field monitoring} Careful monitoring of field work is essential for high quality data. -\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. -For back-checks, a random subset of the field sample is chosen and a subset of information from the full survey is verified through a brief interview with the original respondent. +\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and +other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. +For back-checks, a random subset of the field sample is chosen and a subset of information from the full survey is +verified through a brief interview with the original respondent. Design of the back-check questionnaire follows the same survey design principles discussed above: you should use the pre-analysis plan or list of key outcomes to establish which subset of variables to prioritize. @@ -215,14 +223,12 @@ \subsection{Data considerations for field monitoring} \texttt{bcstats} is a useful tool for analyzing back-check data in Stata module. \sidenote{\url{https://ideas.repec.org/c/boc/bocode/s458173.html}} -Electronic surveys also provide a unique opportunity to do audits through audio recordings of the interview, typically short recordings triggered at random throughout the questionnaire. -\textbf{Audio audits} are a useful means to assess whether the enumerator is conducting the interview as expected (and not sitting under a tree making up data). +Electronic surveys also provide a unique opportunity to do audits through audio recordings of the interview, +typically short recordings triggered at random throughout the questionnaire. +\textbf{Audio audits} are a useful means to assess whether the enumerator is conducting the interview +as expected (and not sitting under a tree making up data). Do note, however, that audio audits must be included in the Informed Consent. -\textcolor{red}{ -\subsection{Dashboard} -Do we want to include something here about displaying HFCs? } - %------------------------------------------------ \section{Collecting Data Securely} @@ -234,7 +240,7 @@ \subsection{Secure data in the field} All mainstream data collection software automatically \textbf{encrypt} \sidenote{\textbf{Encryption:} the process of making information unreadable to anyone without access to a specific deciphering -key. \url{https://dimewiki.worldbank.org/wiki/Encryption}} +key. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} all data submitted from the field while in transit (i.e., upload or download). Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. Therefore, as long as you are using an established survey software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, and accounts used have a logon password and are never left unlocked. \subsection{Secure data storage} From 312f2c084189f166183887cf928dc7dc164645d2 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 21 Jan 2020 12:44:28 -0500 Subject: [PATCH 335/854] [ch2] keeping interoperable but explained what we mean #318 --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 6646bb368..35da19a10 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -148,7 +148,7 @@ \subsection{Setting up your computer} that you will be doing, and plan which types of files will live in which types of sharing services. It is important to note that they are, in general, not interoperable: -you cannot have version-controlled files inside a syncing service, +meaning you should not have version-controlled files inside a syncing service, or vice versa, without setting up complex workarounds, and you cannot shift files between them without losing historical information. Therefore, choosing the correct sharing service at the outset is essential. From 832cd97a113c130f0faa08bb758d814d72f8c83e Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 21 Jan 2020 12:46:32 -0500 Subject: [PATCH 336/854] [ch2] replacing interoperable with same --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 35da19a10..26b757b81 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -326,7 +326,7 @@ \section{Organizing code and data} It is worth thinking in advance about how to store, name, and organize the different types of files you will be working with, so that there is no confusion down the line -and everyone has interoperable expectations. +and everyone has the same expectations. % ---------------------------------------------------------------------------------------------- \subsection{Organizing files and folder structures} From d2f09af98b60a0aaa26c3077f80a3c2aacfe96aa Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 13:37:00 -0500 Subject: [PATCH 337/854] collaborative collaborators collaborate --- chapters/publication.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index cc1405910..10c78045c 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -2,9 +2,10 @@ \begin{fullwidth} Publishing academic research today extends well beyond writing up and submitting a Word document alone. -There are often various contributors making specialized inputs to a single output, -a large number of iterations, versions, and revisions, -and a wide variety of raw materials and results to be published together. +Typically, various contributors collaborate on both code and writing, +manuscripts go through many iterations and revisions, +and the final package for publication includes not just a manuscript +but also the code and data used to generate the results. Ideally, your team will spend as little time as possible fussing with the technical requirements of publication. It is in nobody's interest for a skilled and busy researcher From 685b871fb694eed90a933401f611d8212232468d Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 13:37:57 -0500 Subject: [PATCH 338/854] Update chapters/publication.tex Co-Authored-By: Maria --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 10c78045c..a175805ff 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -15,7 +15,7 @@ collectively refered to as ``dynamic documents'' -- for managing the process of collaboration on any technical product. -For most research projects, completing a written piece is not the end of the task. +For most research projects, completing a manuscript is not the end of the task. In almost all cases, you will be required to release a replication package, which contains the code and materials needed to create the results. These represent an intellectual contribution in their own right, From c7d861f4503960d45ce1642caca9ad7ce3f20c87 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 13:38:18 -0500 Subject: [PATCH 339/854] Update chapters/publication.tex Co-Authored-By: Maria --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index a175805ff..c8a09f0d6 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -16,7 +16,7 @@ for managing the process of collaboration on any technical product. For most research projects, completing a manuscript is not the end of the task. -In almost all cases, you will be required to release a replication package, +Academic journals increasingly require submission of a replication package, which contains the code and materials needed to create the results. These represent an intellectual contribution in their own right, because they enable others to learn from your process From c1d32265de2700bc34c318e9d28e3a265009d1d2 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 13:40:26 -0500 Subject: [PATCH 340/854] de-limited --- chapters/publication.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 10c78045c..da41eca32 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -15,7 +15,7 @@ collectively refered to as ``dynamic documents'' -- for managing the process of collaboration on any technical product. -For most research projects, completing a written piece is not the end of the task. +For most research projects, completing a manuscript is not the end of the task. In almost all cases, you will be required to release a replication package, which contains the code and materials needed to create the results. These represent an intellectual contribution in their own right, @@ -87,7 +87,7 @@ \subsection{Dynamic documents} Documents called ``notebooks'' (such as Jupyter\sidenote{\url{https://jupyter.org/}}) work similarly, as they also use the underlying analytical software to create the document. These types of dynamic documents are usually appropriate for short or informal materials -because they tend to offer limited editability outside the base software +because they tend to offer restricted editability outside the base software and often have limited abilities to incorporate precision formatting. The second group of dynamic document tools do not require @@ -96,7 +96,7 @@ \subsection{Dynamic documents} One very simple one is Dropbox Paper, a free online writing tool that allows linkages to files in Dropbox, which are then automatically updated anytime the file is replaced. -Dropbox Paper has very limited formatting options, +Dropbox Paper has very few formatting options, but it is appropriate for working with collaborators who are not using statistical software. However, the most widely utilized software From 408f6c0fe400129ac1140abf725bd61afbee7adb Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 13:43:48 -0500 Subject: [PATCH 341/854] LaTeX --- chapters/publication.tex | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 572b6249d..2f7830ce3 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -123,18 +123,17 @@ \subsection{Technical writing with \LaTeX} as you do in Word or the equivalent, you write plain text interlaced with coded instructions for formatting (similar in concept to HTML). -The \LaTeX\ system includes commands for simple markup -like font styles, paragraph formatting, section headers and the like. -But it also includes special controls for including tables and figures, -footnotes and endnotes, complex mathematical notation, and automated bibliography preparation. -It also allows publishers to apply global styles and templates -to already-written material, allowing them to reformat entire documents in house styles -with only a few keystrokes. -In sum, \LaTeX\ enables automatically-organized documents, -manages tables and figures dynamically, -and because it is written in a plain text file format, +Because it is written in a plain text file format, \texttt{.tex} can be version-controlled using Git. This is why it has become the dominant ``document preparation system'' in technical writing. +\LaTeX\ enables automatically-organized documents, +manages tables and figures dynamically, +and includes commands for simple markup +like font styles, paragraph formatting, section headers and the like. +It also includes special controls for including tables and figures, +footnotes and endnotes, complex mathematical notation, and automated bibliography preparation. +It also allows publishers to apply global styles and templates to already-written material, +allowing them to reformat entire documents in house styles with only a few keystrokes. Unfortunately, \LaTeX\ can be a challenge to set up and use at first, particularly if you are new to working with plain text code and file management. From e3fdde023cc61fb4d00b8e2b7a85ef8a66430d27 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 13:45:14 -0500 Subject: [PATCH 342/854] Reorg LaTeX --- chapters/publication.tex | 42 ++++++++++++++++++++-------------------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 2f7830ce3..fefd8e8de 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -135,27 +135,6 @@ \subsection{Technical writing with \LaTeX} It also allows publishers to apply global styles and templates to already-written material, allowing them to reformat entire documents in house styles with only a few keystrokes. -Unfortunately, \LaTeX\ can be a challenge to set up and use at first, -particularly if you are new to working with plain text code and file management. -It is also unfortunately weak with spelling and grammar checking. -This is because \LaTeX\ requires that all formatting be done in its special code language, -and it is not particularly informative when you do something wrong. -This can be off-putting very quickly for people -who simply want to get to writing, like senior researchers. -While integrated editing and compiling tools like TeXStudio\sidenote{ - \url{https://www.texstudio.org}} -and \texttt{atom-latex}\sidenote{ - \url{https://atom.io/packages/atom-latex}} -offer the most flexibility to work with \LaTeX\ on your computer, -such as advanced integration with Git, -the entire group of writers needs to be comfortable -with \LaTeX\ before adopting one of these tools. -They can require a lot of troubleshooting at a basic level at first, -and non-technical staff may not be willing or able to acquire the required knowledge. -Therefore, to take advantage of the features of \LaTeX, -while making it easy and accessible to the entire writing team, -we need to abstract away from the technical details where possible. - One of the most important tools available in \LaTeX\ is the BibTeX bibliography manager.\sidenote{ \url{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.3194&rep=rep1&type=pdf}} BibTeX keeps all the references you might use in an auxiliary file, @@ -202,6 +181,27 @@ \subsection{Technical writing with \LaTeX} and use external tools like Word's compare feature to generate integrated tracked versions when needed. +Unfortunately, despite these advantages, \LaTeX\ can be a challenge to set up and use at first, +particularly if you are new to working with plain text code and file management. +It is also unfortunately weak with spelling and grammar checking. +This is because \LaTeX\ requires that all formatting be done in its special code language, +and it is not particularly informative when you do something wrong. +This can be off-putting very quickly for people +who simply want to get to writing, like senior researchers. +While integrated editing and compiling tools like TeXStudio\sidenote{ + \url{https://www.texstudio.org}} +and \texttt{atom-latex}\sidenote{ + \url{https://atom.io/packages/atom-latex}} +offer the most flexibility to work with \LaTeX\ on your computer, +such as advanced integration with Git, +the entire group of writers needs to be comfortable +with \LaTeX\ before adopting one of these tools. +They can require a lot of troubleshooting at a basic level at first, +and non-technical staff may not be willing or able to acquire the required knowledge. +Therefore, to take advantage of the features of \LaTeX, +while making it easy and accessible to the entire writing team, +we need to abstract away from the technical details where possible. + \subsection{Getting started with \LaTeX\ in the cloud} \LaTeX\ is a challenging tool to get started using, From d3b5eb4a91139e42a183a8beeb3989ede43b3774 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 13:50:57 -0500 Subject: [PATCH 343/854] release instrument --- chapters/publication.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index fefd8e8de..865d022c7 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -316,8 +316,8 @@ \subsection{Publishing data for replication} you should also consider releasing generic datasets such as CSV files with accompanying codebooks, since these will be re-adaptable by any researcher. -Additionally, when possible, you should also release -the data collection instrument or survey questionnaire used to gather the information +Additionally, you should also release +the data collection instrument or survey questionnaire so that readers can understand which data components are collected directly in the field and which are derived. You should provide a clean version of the data From 77a6fde65fa985a525f7f498756ec1b0071582b2 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 13:53:42 -0500 Subject: [PATCH 344/854] props --- chapters/publication.tex | 2 ++ 1 file changed, 2 insertions(+) diff --git a/chapters/publication.tex b/chapters/publication.tex index 865d022c7..5f6e18a47 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -336,6 +336,8 @@ \subsection{Publishing code for replication} exactly what you have done in order to obtain your results, as well as to apply similar methods in future projects. Therefore it should both be functional and readable. +If you've followed the recommendations in this book, +this will be much easier to do. Code is often not written this way when it is first prepared, so it is important for you to review the content and organization so that a new reader can figure out what and how your code should do. From 519f9de65d2fd09e94b6cc1d1d71c5fe97bde34a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 13:54:44 -0500 Subject: [PATCH 345/854] master script --- chapters/publication.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/publication.tex b/chapters/publication.tex index 5f6e18a47..57c934977 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -376,6 +376,7 @@ \subsection{Publishing code for replication} They should also be able to quickly map all the outputs of the code to the locations where they are placed in the associated published material, such as ensuring that the raw components of figures or tables are clearly identified. +Documentation in the master script is often used to indicate this information. For example, outputs should clearly correspond by name to an exhibit in the paper, and vice versa. (Supplying a compiling \LaTeX\ document can support this.) Code and outputs which are not used should be removed. From 96839a384d6ca77d7ad7850f8a9c62bf28eeecae Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 14:13:55 -0500 Subject: [PATCH 346/854] data colada Fix #300 --- chapters/handling-data.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index afa83cf10..1dfac2d9e 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -200,15 +200,15 @@ \subsection{Research credibility} Regardless of whether or not a formal pre-analysis plan is utilized, all experimental and observational studies should be \textbf{pre-registered}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Pre-Registration}} -simply to create a record of the fact that the study was undertaken. +simply to create a record of the fact that the study was undertaken.\sidenote{\url{http://datacolada.org/12}} This is increasingly required by publishers and can be done very quickly using the \textbf{AEA} database\sidenote{\url{https://www.socialscienceregistry.org/}}, the \textbf{3ie} database\sidenote{\url{http://ridie.3ieimpact.org/}}, the \textbf{eGAP} database\sidenote{\url{http://egap.org/content/registration/}}, -or the \textbf{OSF} registry\sidenote{\url{https://osf.io/registries}} as appropriate.\sidenote{\url{http://datacolada.org/12}} +or the \textbf{OSF} registry\sidenote{\url{https://osf.io/registries}} as appropriate. \index{pre-registration} -Common research standards from journals, funders, and others feature both ex +Common research standards from journals, funders, and others feature both ex ante (or ``regulation'') and ex post (or ``verification'') policies.\cite{stodden2013toward} Ex ante policies require that authors bear the burden From f9da3e2b9a464bd2fac8bd1f9b594b7f73042c93 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 14:18:30 -0500 Subject: [PATCH 347/854] IRB wiki (#267) --- chapters/handling-data.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 1dfac2d9e..37ffc7f6c 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -298,7 +298,8 @@ \subsection{Obtaining ethical approval and consent} \textbf{Institutional Review Board (IRB):} An institution formally responsible for ensuring that research meets ethical standards.} \index{Institutional Review Board} Most commonly this consists of a formal application for approval of a specific -protocol for consent, data collection, and data handling. +protocol for consent, data collection, and data handling.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/IRB_Approval}} An IRB which has sole authority over your project is not always apparent, particularly if some institutions do not have their own. It is customary to obtain an approval from a university IRB From a8aa0064b595e38e7085430e46db1e72ccb004b0 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 14:34:17 -0500 Subject: [PATCH 348/854] Update chapters/planning-data-work.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 26b757b81..61f73c38b 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -12,7 +12,7 @@ Identifying these details should help you map out the data needs for your project, giving you and your team a sense of how information resources should be organized. It's okay to update this map once the project is underway -- -the point is that everyone knows what the plan is. +the point is that everyone knows -- at any given time -- what the plan is. To implement this plan, you will need to prepare collaborative tools and workflows. Changing software or protocols halfway through a project can be costly and time-consuming, From b17c0a94348a223aa872ea2fc1ca2e4480b4b292 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 14:36:19 -0500 Subject: [PATCH 349/854] All code is plain text MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 61f73c38b..f67df268d 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -388,7 +388,7 @@ \subsection{Organizing files and folder structures} Those two types of collaboration tools function very differently and will almost always create undesired functionality if combined.) Nearly all code files and raw outputs (not datasets) are best managed this way. -This is because code files are usually \textbf{plaintext} files, +This is because code files are always \textbf{plaintext} files, and non-technical files are usually \textbf{binary} files.\index{plaintext}\index{binary files} It's also becoming more and more common for written outputs such as reports, presentations and documentations to be written using plaintext From 9fe090640d5d0c855a465bdcd4a3c2a17e9958cc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 14:37:07 -0500 Subject: [PATCH 350/854] Typo MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index f67df268d..7fa335392 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -648,7 +648,7 @@ \subsection{Output management} but formatting these can be much trickier and less full-featured than other editors. So dynamic documents can be great for creating appendices or quick documents with results as you work on them, -but are not usuall considered for final papers and reports. +but are not usually considered for final papers and reports. RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com/}} is the most widely adopted solution in R. There are also different options for Markdown in Stata, such as \texttt{markstat},\sidenote{\url{https://data.princeton.edu/stata/markdown}} From b336136a529f69217b1ede85bf10f4b974dac3f2 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 21 Jan 2020 15:18:37 -0500 Subject: [PATCH 351/854] [ch3] minor changes --- chapters/research-design.tex | 70 ++++++++++++++++++------------------ 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 0c897a03b..0856a7e71 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -17,21 +17,21 @@ the data structures needed to estimate the corresponding effects, and any available code tools that will assist you in this process. -This is important to understand before going into the field for several reasons. +Understanding your design before starting data work is important for several reasons. If you do not understand how to calculate the correct estimator for your study, you will not be able to assess the power of your research design. -You will also be unable to make tradeoffs in the field +You will also be unable to make decisions in the field when you inevitable have to allocate scarce resources between tasks like maximizing sample size and ensuring follow-up with specific individuals. You will save a lot of time by understanding the way -your data needs to be organized and set up as it comes in +your data needs to be organized as it comes in before you will be able to calculate meaningful results. Just as importantly, understanding each of these approaches will allow you to keep your eyes open for research opportunities: many of the most interesting projects occur because people in the field -recognize the opportunity to implement one of these methods on the fly -in response to an unexpected event in the field. +recognize the opportunity to implement one of these methods +in response to an unexpected event. While somewhat more conceptual than practical, a basic understanding of your project's chosen approach will make you much more effective at the analytical part of your work. @@ -43,7 +43,7 @@ \section{Causality, inference, and identification} The primary goal of research design is to establish \textbf{identification} -for a parameter of interest -- that is, to demonstrate +for a parameter of interest. That means finding a source of variation in a particular input that has no other possible channel to alter a particular outcome, in order to assert that some change in that outcome was caused by that change in the input. @@ -53,7 +53,7 @@ \section{Causality, inference, and identification} of a program-specific \textbf{treatment effect}, or the change in outcomes directly attributable to exposure to what we call the \textbf{treatment}.\cite{abadie2018econometric} \index{treatment effect} -When identification is believed, then we can say with confidence +When a study is well-identified, then we can say with confidence that our estimate of the treatment effect would, with an infinite amount of data, give us a precise estimate of that treatment effect. @@ -81,11 +81,11 @@ \subsection{Estimating treatment effects using control groups} The key assumption behind estimating treatment effects is that every person, facility, or village (or whatever the unit of intervention is) -has two possible states: their outcomes if they do not recieve some treatment -and their outcomes if they do recieve that treatment. +has two possible states: their outcomes if they do not receive some treatment +and their outcomes if they do receive that treatment. Each unit's treatment effect is the individual difference between these two states, -and the \textbf{average treatment effect (ATE)} is the average of all of -these differences across the potentially treated population. +and the \textbf{average treatment effect (ATE)} is the average of all +individual differences across the potentially treated population. \index{average treatment effect} This is the most common parameter that research designs will want to estimate. In most designs, the goal is to establish a ``counterfactual scenario'' for the treatment group @@ -112,11 +112,11 @@ \subsection{Estimating treatment effects using control groups} average treatment effect without observing individual-level effects, but can obtain it from some comparison of averages with a \textbf{control} group. \index{causal inference}\index{control group} -Every research design is based around a way of comparing another set of observations -- +Every research design is based on a way of comparing another set of observations -- the ``control'' observations -- against the treatment group. They all work to establish that the control observations would have been identical \textit{on average} to the treated group in the absence of the treatment. -Then, the mathematical properties of averages implies that the calculated +Then, the mathematical properties of averages imply that the calculated difference in averages is equivalent to the average difference: exactly the parameter we are seeking to estimate. Therefore, almost all designs can be accurately described @@ -125,7 +125,7 @@ \subsection{Estimating treatment effects using control groups} Most of the methods that you will encounter rely on some variant of this strategy, which is designed to maximize the ability to estimate the effect -of an average unit being offered the treatment being evaluated. +of an average unit being offered the treatment you want to evaluate. The focus on identification of the treatment effect, however, means there are several essential features to this approach that are not common in other types of statistical and data science work. @@ -133,7 +133,7 @@ \subsection{Estimating treatment effects using control groups} do not attempt to create a predictive or comprehensive model of how the outcome of interest is generated. Typically, these designs are not interested in predictive accuracy, -and the estimates and predictions that these models produce +and the estimates and predictions that they produce will not be as good at predicting outcomes or fitting the data as other models. Additionally, when control variables or other variables are used in estimation, there is no guarantee that those parameters are marginal effects. @@ -231,8 +231,6 @@ \section{Obtaining treatment effects from specific research designs} %----------------------------------------------------------------------------------------------- \subsection{Cross-sectional designs} -\textbf{Cross-sectional} surveys are the simplest possible study design: -a program is implemented, surveys are conducted, and data is analyzed. When it is an RCT, a randomization process constructs the control group at random from the population that is eligible to receive each treatment. When it is observational, we present other evidence that a similar equivalence holds. @@ -241,8 +239,9 @@ \subsection{Cross-sectional designs} and the ordinary least squares (OLS) regression of outcome on treatment, without any control variables, is an unbiased estimate of the average treatment effect. -Cross-sectional data is simple to handle because -for research teams do not need track anything over time. +A \textbf{cross-section} is the simplest data structure that can be used. +This type of data is easy to collect and handle because +you do not need track individual across time or across data sets. A cross-section is simply a representative set of observations taken at a single point in time. If this point in time is after a treatment has been fully delivered, @@ -260,7 +259,7 @@ \subsection{Cross-sectional designs} were used to stratify the treatment (in the form of strata fixed effects).\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/impactevaluations/how-randomize-using-many-baseline-variables-guest-post-thomas-barrios}} \textbf{Randomization inference} can be used -to esetimate the underlying variability in the randomization process +to estimate the underlying variability in the randomization process (more on this in the next chapter). \textbf{Balance checks} are often reported as evidence of an effective randomization, and are particularly important when the design is quasi-experimental @@ -270,14 +269,14 @@ \subsection{Cross-sectional designs} has no correlation between the treatment and the balance factors.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/should-we-require-balance-t-tests-baseline-observables-randomized-experiments}} -Analysis is typically straightforward with a strong understanding of the randomization. -A typical analysis will include a decription of the sampling and randomization process, +Analysis is typically straightforward \textit{once you have a strong understanding of the randomization}. +A typical analysis will include a description of the sampling and randomization process, summary statistics for the eligible population, balance checks for randomization and sample selection, a primary regression specification (with multiple hypotheses appropriately adjusted), additional specifications with adjustments for non-response, balance, and other potential contamination, and randomization-inference analysis or other placebo regression approaches. -There are a number of tools that are available +There are a number of tools that are also available to help with the complete process of data collection,\sidenote{ \url{https://toolkit.povertyactionlab.org/resource/coding-resources-randomized-evaluations}} to analyze balance,\sidenote{ @@ -288,7 +287,7 @@ \subsection{Cross-sectional designs} \url{https://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} %----------------------------------------------------------------------------------------------- -\subsection{Differences-in-differences} +\subsection{Difference-in-differences} Where cross-sectional designs draw their estimates of treatment effects from differences in outcome levels in a single measurement, @@ -297,14 +296,14 @@ \subsection{Differences-in-differences} designs (abbreviated as DD, DiD, diff-in-diff, and other variants) estimate treatment effects from \textit{changes} in outcomes between two or more rounds of measurement. - \index{differences-in-differences} + \index{difference-in-differences} In these designs, three control groups are used – the baseline level of treatment units, the baseline level of non-treatment units, and the endline level of non-treatment units.\sidenote{ \url{https://www.princeton.edu/~otorres/DID101.pdf}} The estimated treatment effect is the excess growth -of units that recieve the treatment, in the period they recieve it: +of units that receive the treatment, in the period they receive it: calculating that value is equivalent to taking the difference in means at endline and subtracting the difference in means at baseline @@ -313,7 +312,7 @@ \subsection{Differences-in-differences} and a control variable for the measurement round, but the treatment effect estimate corresponds to an interaction variable for treatment and round: -the group of observations for which the treatment is active. +it indicates the group of observations for which the treatment is active. This model critically depends on the assumption that, in the absense of the treatment, the two groups would have changed performance at the same rate over time, @@ -373,28 +372,29 @@ \subsection{Regression discontinuity} \textbf{Regression discontinuity (RD)} designs exploit sharp breaks or limits in policy designs to separate a group of potentially eligible recipients -into comparable gorups of individuals who do and do not recieve a treatment.\sidenote{ +into comparable gorups of individuals who do and do not receive a treatment.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Regression_Discontinuity}} These types of designs differ from cross-sectional and diff-in-diff designs in that the group eligible to receive treatment is not defined directly, -but instead created during the process of the treatment implementation. +but instead created during the treatment implementation. \index{regression discontinuity} In an RD design, there is typically some program or event -which has limited availability due to practical considerations or policy choices +that has limited availability due to practical considerations or policy choices and is therefore made available only to individuals who meet a certain threshold requirement. The intuition of this design is that there is an underlying \textbf{running variable} which serves as the sole determinant of access to the program, and a strict cutoff determines the value of this variable at which eligibility stops.\cite{imbens2008regression} Common examples are test score thresholds, income thresholds, and some types of lotteries.\sidenote{ \url{http://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn}} -The intuition is that individuals who are just on the recieving side of the threshold +The intuition is that individuals who are just on the receiving side of the threshold will be very nearly indistinguishable from those on the non-receiving side, and their post-treatment outcomes are therefore directly comparable.\cite{lee2010regression} The key assumption here is that the running variable cannot be directly manipulated -by the potential recipients; if the running variable is time there are special considerations.\cite{hausman2018regression} +by the potential recipients. +If the running variable is time there are special considerations.\cite{hausman2018regression} Regression discontinuity designs are, once implemented, -very similar in analysis to cross-sectional or differences-in-differences designs. +very similar in analysis to cross-sectional or difference-in-differences designs. Depending on the data that is available, the analytical approach will center on the comparison of individuals who are narrowly on the inclusion side of the discontinuity, @@ -438,7 +438,7 @@ \subsection{Instrumental variables} \textbf{Instrumental variables (IV)} designs, unlike the previous approaches, begin by assuming that the treatment delivered in the study in question is -linked to outcomes such that the effect is not directly identifiable. +linked to the outcome, so its effect is not directly identifiable. Instead, similar to regression discontinuity designs, IV attempts to focus on a subset of the variation in treatment uptake and assesses that limited window of variation that can be argued @@ -454,7 +454,7 @@ \subsection{Instrumental variables} As in regression discontinuity designs, the fundamental form of the regression -is similar to either cross-sectional or differences-in-differences designs. +is similar to either cross-sectional or difference-in-differences designs. However, instead of controlling for the instrument directly, the IV approach typically uses the \textbf{two-stage-least-squares (2SLS)} estimator.\sidenote{ \url{http://www.nuff.ox.ac.uk/teaching/economics/bond/IV\%20Estimation\%20Using\%20Stata.pdf}} From 571cc18e54cab8b3ac5206af106c2911eb730c59 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 16:29:32 -0500 Subject: [PATCH 352/854] Update chapters/research-design.tex Co-Authored-By: Luiza Andrade --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 0856a7e71..eebaccdc3 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -15,7 +15,7 @@ The intent is for you to obtain an understanding of the way in which each method constructs treatment and control groups, the data structures needed to estimate the corresponding effects, -and any available code tools that will assist you in this process. +and available code tools that will assist you in this process (the list, of course, is not exhaustive). Understanding your design before starting data work is important for several reasons. If you do not understand how to calculate the correct estimator for your study, From 81262b21221aa02d4ca1f0f101723b445cf3db0a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 16:33:31 -0500 Subject: [PATCH 353/854] save time --- chapters/research-design.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 0856a7e71..44b8e639a 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -25,8 +25,8 @@ between tasks like maximizing sample size and ensuring follow-up with specific individuals. You will save a lot of time by understanding the way -your data needs to be organized as it comes in -before you will be able to calculate meaningful results. +your data needs to be organized +in order to be able to calculate meaningful results. Just as importantly, understanding each of these approaches will allow you to keep your eyes open for research opportunities: many of the most interesting projects occur because people in the field @@ -240,7 +240,7 @@ \subsection{Cross-sectional designs} of outcome on treatment, without any control variables, is an unbiased estimate of the average treatment effect. A \textbf{cross-section} is the simplest data structure that can be used. -This type of data is easy to collect and handle because +This type of data is easy to collect and handle because you do not need track individual across time or across data sets. A cross-section is simply a representative set of observations taken at a single point in time. From 038ca8d555058ec57551c30edaf13e79874a6b12 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 16:41:37 -0500 Subject: [PATCH 354/854] Reduce understanding --- chapters/research-design.tex | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 5216cf5cd..d3b4751c3 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -11,14 +11,14 @@ This section will present a brief overview of the most common methods that are used in development research. Specifically, we will introduce you to several ``causal inference'' methods -that are frequently used to understand the impact of real development programs. +that are frequently used to estimate the impact of real development programs. The intent is for you to obtain an understanding of the way in which each method constructs treatment and control groups, the data structures needed to estimate the corresponding effects, and available code tools that will assist you in this process (the list, of course, is not exhaustive). -Understanding your design before starting data work is important for several reasons. -If you do not understand how to calculate the correct estimator for your study, +Thinking throug your design before starting data work is important for several reasons. +If you do not know how to calculate the correct estimator for your study, you will not be able to assess the power of your research design. You will also be unable to make decisions in the field when you inevitable have to allocate scarce resources @@ -27,13 +27,13 @@ You will save a lot of time by understanding the way your data needs to be organized in order to be able to calculate meaningful results. -Just as importantly, understanding each of these approaches +Just as importantly, familiarity with each of these approaches will allow you to keep your eyes open for research opportunities: many of the most interesting projects occur because people in the field recognize the opportunity to implement one of these methods in response to an unexpected event. While somewhat more conceptual than practical, -a basic understanding of your project's chosen approach will make you +intuitive knowledge of your project's chosen approach will make you much more effective at the analytical part of your work. \end{fullwidth} From e65eb7c18285651cb9a5cf5c19f1af828416ddfc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 16:47:58 -0500 Subject: [PATCH 355/854] Identification --- chapters/research-design.tex | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index d3b4751c3..4e40ce2ef 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -42,16 +42,13 @@ \section{Causality, inference, and identification} -The primary goal of research design is to establish \textbf{identification} -for a parameter of interest. That means finding -a source of variation in a particular input that has no other possible channel -to alter a particular outcome, in order to assert that some change in that outcome -was caused by that change in the input. +The primary goal of research design is to establish \textbf{causal identification} for an effect. +Causal identification means establishing that a change in an input directly altered an outcome. \index{identification} -When we are discussing the types of inputs commonly referred to as +When we are discussing the types of inputs -- ``treatments'' -- commonly referred to as ``programs'' or ``interventions'', we are typically attempting to obtain estimates -of a program-specific \textbf{treatment effect}, or the change in outcomes -directly attributable to exposure to what we call the \textbf{treatment}.\cite{abadie2018econometric} +of program-specific \textbf{treatment effects} +These are the changes in outcomes attributable to the treatment.\cite{abadie2018econometric} \index{treatment effect} When a study is well-identified, then we can say with confidence that our estimate of the treatment effect would, From 1b32e5a56fa16307e0c5536a4ef0c25bc3a2f6bc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 16:50:27 -0500 Subject: [PATCH 356/854] Cite PSM --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 4e40ce2ef..178a87366 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -529,7 +529,7 @@ \subsection{Matching} meaning that researchers can try different models of matching until one, by chance, leads to the final result that was desired; analytical approaches have shown that the better the fit of the matching model, -the more likely it is that it has arisen by chance and is therefore biased. +the more likely it is that it has arisen by chance and is therefore biased.\cite{king2019propensity} Newer methods, such as \textbf{coarsened exact matching},\cite{iacus2012causal} are designed to remove some of the dependence on linearity. In all ex-post cases, pre-specification of the exact matching model From bfdd5d0bde4304578c981537362c559e6fc68ad9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 16:54:27 -0500 Subject: [PATCH 357/854] At least one --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 178a87366..e4fb6507b 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -575,7 +575,7 @@ \subsection{Synthetic controls} can be thought of as balancing by matching the composition of the treated unit. To construct this estimator, the synthetic controls method requires -a significant amount of retrospective data on the treatment unit and possible comparators, +at least one period of retrospective data on the treatment unit and possible comparators, including historical data on the outcome of interest for all units. The counterfactual blend is chosen by optimizing the prediction of past outcomes based on the potential input characteristics, From 4721f3e4ab2ed2f98fb25b21291014b6a591221c Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 16:55:48 -0500 Subject: [PATCH 358/854] Large N --- chapters/research-design.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index e4fb6507b..0b37837bd 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -582,7 +582,8 @@ \subsection{Synthetic controls} and typically selects a small set of comparators to weight into the final analysis. These datasets therefore may not have a large number of variables or observations, but the extent of the time series both before and after the implementation -of the treatment are the key sources of power for the estimate. +of the treatment are key sources of power for the estimate, +as are the number of counterfactual units available. Visualizations are often excellent demonstrations of these results. The \texttt{synth} package provides functionality for use in Stata, although since there are a large number of possible parameters From 4c06288ec5465798110b434b18b0f9c97a2e47f4 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 16:56:51 -0500 Subject: [PATCH 359/854] Stata and R --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 0b37837bd..87f9ba06e 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -585,7 +585,7 @@ \subsection{Synthetic controls} of the treatment are key sources of power for the estimate, as are the number of counterfactual units available. Visualizations are often excellent demonstrations of these results. -The \texttt{synth} package provides functionality for use in Stata, +The \texttt{synth} package provides functionality for use in Stata and R, although since there are a large number of possible parameters and implementations of the design it can be complex to operate.\sidenote{ \url{https://web.stanford.edu/~jhain/synthpage.html}} From ba43eb7a8921cea477fee75d054c15c7e1854038 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:39:56 -0500 Subject: [PATCH 360/854] Be direct --- chapters/research-design.tex | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 87f9ba06e..86f6ea32f 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -32,8 +32,7 @@ many of the most interesting projects occur because people in the field recognize the opportunity to implement one of these methods in response to an unexpected event. -While somewhat more conceptual than practical, -intuitive knowledge of your project's chosen approach will make you +Intuitive knowledge of your project's chosen approach will make you much more effective at the analytical part of your work. \end{fullwidth} From c856ff955c5d5cd3153531b06f5f81512bd2b970 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:40:34 -0500 Subject: [PATCH 361/854] Reorder --- chapters/research-design.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 86f6ea32f..7a1acefa1 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -41,14 +41,14 @@ \section{Causality, inference, and identification} -The primary goal of research design is to establish \textbf{causal identification} for an effect. -Causal identification means establishing that a change in an input directly altered an outcome. - \index{identification} When we are discussing the types of inputs -- ``treatments'' -- commonly referred to as ``programs'' or ``interventions'', we are typically attempting to obtain estimates of program-specific \textbf{treatment effects} These are the changes in outcomes attributable to the treatment.\cite{abadie2018econometric} \index{treatment effect} +The primary goal of research design is to establish \textbf{causal identification} for an effect. +Causal identification means establishing that a change in an input directly altered an outcome. + \index{identification} When a study is well-identified, then we can say with confidence that our estimate of the treatment effect would, with an infinite amount of data, From 99ecc2dbbab0eb0d4524f8fd3e6e28cff67242d9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:41:34 -0500 Subject: [PATCH 362/854] Update chapters/research-design.tex Co-Authored-By: Luiza Andrade --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 7a1acefa1..ec045c166 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -56,7 +56,7 @@ \section{Causality, inference, and identification} Under this condition, we can proceed to draw evidence from the limited samples we have access to, using statistical techniques to express the uncertainty of not having infinite data. Without identification, we cannot say that the estimate would be accurate, -even with unlimited data, and therefore cannot associate it to the treatment +even with unlimited data, and therefore cannot attribute it to the treatment in the small samples that we typically have access to. Conversely, more data is not a substitute for a well-identified experimental design. Therefore it is important to understand how exactly your study From 7a9918651323bc519d833ca14096704e53949083 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:41:57 -0500 Subject: [PATCH 363/854] Update chapters/research-design.tex Co-Authored-By: Luiza Andrade --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index ec045c166..7abc3c9a2 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -64,7 +64,7 @@ \section{Causality, inference, and identification} so you can calculate and interpret those estimates appropriately. All the study designs we discuss here use the \textbf{potential outcomes} framework to compare a group that received some treatment to another, counterfactual group. -Each of these types of approaches can be used in two contexts: +Each of these approaches can be used in two contexts: \textbf{experimental} designs, in which the research team is directly responsible for creating the variation in treatment, and \textbf{quasi-experimental} designs, in which the team From 36fbd1d1f67e8a72695dcadfacdb330a75c5dff6 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:46:52 -0500 Subject: [PATCH 364/854] Apply suggestions from code review Co-Authored-By: Luiza Andrade --- chapters/research-design.tex | 58 ++++++++++++++++++------------------ 1 file changed, 29 insertions(+), 29 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 7abc3c9a2..b43f6356f 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -69,7 +69,7 @@ \section{Causality, inference, and identification} is directly responsible for creating the variation in treatment, and \textbf{quasi-experimental} designs, in which the team identifies a ``natural'' source of variation and uses it for identification. -Neither type of approach is implicitly better or worse, +Neither approach is implicitly better or worse, and both are capable of achieving effect identification under different contexts. %----------------------------------------------------------------------------------------------- @@ -83,8 +83,8 @@ \subsection{Estimating treatment effects using control groups} and the \textbf{average treatment effect (ATE)} is the average of all individual differences across the potentially treated population. \index{average treatment effect} -This is the most common parameter that research designs will want to estimate. -In most designs, the goal is to establish a ``counterfactual scenario'' for the treatment group +This is the parameter that most research designs attempt to estimate. +Their goal is to establish a ``counterfactual scenario'' for the treatment group with which outcomes can be directly compared. There are several resources that provide more or less mathematically intensive approaches to understanding how various methods to his. @@ -120,7 +120,7 @@ \subsection{Estimating treatment effects using control groups} \url{http://nickchk.com/econ305.html}} Most of the methods that you will encounter rely on some variant of this strategy, -which is designed to maximize the ability to estimate the effect +which is designed to maximize their ability to estimate the effect of an average unit being offered the treatment you want to evaluate. The focus on identification of the treatment effect, however, means there are several essential features to this approach @@ -128,13 +128,13 @@ \subsection{Estimating treatment effects using control groups} First, the econometric models and estimating equations used do not attempt to create a predictive or comprehensive model of how the outcome of interest is generated. -Typically, these designs are not interested in predictive accuracy, +Typically, causal inference designs are not interested in predictive accuracy, and the estimates and predictions that they produce will not be as good at predicting outcomes or fitting the data as other models. Additionally, when control variables or other variables are used in estimation, -there is no guarantee that those parameters are marginal effects. +there is no guarantee that the resulting parameters are marginal effects. They can only be interpreted as correlative averages, -unless the experimenter has additional sources of identification for them. +unless there are additional sources of identification. The models you will construct and estimate are intended to do exactly one thing: to express the intention of your project's research design, and to accurately estimate the effect of the treatment it is evaluating. @@ -170,8 +170,8 @@ \subsection{Experimental and quasi-experimental research designs} Randomized designs all share several major statistical concerns. The first is the fact that it is always possible to select a control group, -by chance, which was not in fact going to be very similar to the treatment group. -This feature is called randomization noise, and all RCTs share the need to understand +by chance, which is not in fact very similar to the treatment group. +This feature is called randomization noise, and all RCTs share the need to assess how randomization noise may impact the estimates that are obtained. Second, takeup and implementation fidelity are extremely important, since programs will by definition have no effect @@ -188,12 +188,12 @@ \subsection{Experimental and quasi-experimental research designs} \textbf{Quasi-experimental} research designs,\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Quasi-Experimental_Methods}} -by contrast, are inference methods based on events not controlled by the research team. +by contrast, are causal inference methods based on events not controlled by the research team. Instead, they rely on ``experiments of nature'', in which natural variation can be argued to approximate the type of exogenous variation in treatment availability that a researcher would attempt to create with an experiment.\cite{dinardo2016natural} -Unlike with carefully planned experimental designs, +Unlike carefully planned experimental designs, quasi-experimental designs typically require the extra luck of having access to data collected at the right times and places to exploit events that occurred in the past, @@ -205,9 +205,9 @@ \subsection{Experimental and quasi-experimental research designs} Quasi-experimental designs therefore can access a much broader range of questions, and with much less effort in terms of executing an intervention. However, they require in-depth understanding of the precise events -the researcher wishes to address in order to know what data to collect +the researcher wishes to address in order to know what data to use and how to model the underlying natural experiment. -Additionally, because the population who will have been exposed +Additionally, because the population exposed to such events is limited by the scale of the event, quasi-experimental designs are often power-constrained. There is nothing the research team can do to increase power @@ -227,9 +227,9 @@ \section{Obtaining treatment effects from specific research designs} %----------------------------------------------------------------------------------------------- \subsection{Cross-sectional designs} -When it is an RCT, a randomization process constructs the control group at random +In an RCT, the control group is randomly constructed from the population that is eligible to receive each treatment. -When it is observational, we present other evidence that a similar equivalence holds. +In an observational study, we present other evidence that a similar equivalence holds. Therefore, by construction, each unit's receipt of the treatment is unrelated to any of its other characteristics and the ordinary least squares (OLS) regression @@ -244,7 +244,7 @@ \subsection{Cross-sectional designs} then the outcome values at that point in time already reflect the effect of the treatment. -What needs to be carefully maintained in data for cross-sectional RCTs +For cross-sectional RCTs, what needs to be carefully maintained in data is the treatment randomization process itself, as well as detailed field data about differences in data quality and loss-to-follow-up across groups.\cite{athey2017econometrics} @@ -305,13 +305,13 @@ \subsection{Difference-in-differences} the difference in means at baseline (hence the singular ``difference-in-differences'').\cite{mckenzie2012beyond} The regression model includes a control variable for treatment assignment, -and a control variable for the measurement round, +and a control variable for time period, but the treatment effect estimate corresponds to -an interaction variable for treatment and round: +an interaction variable for treatment and time: it indicates the group of observations for which the treatment is active. This model critically depends on the assumption that, in the absense of the treatment, -the two groups would have changed performance at the same rate over time, +the outcome of the two groups would have changed at the same rate over time, typically referred to as the \textbf{parallel trends} assumption.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice}} Experimental approaches satisfy this requirement in expectation, @@ -326,22 +326,22 @@ \subsection{Difference-in-differences} as well as their execution in the field, are critically important to maintain alongside the survey results. In panel data structures, we attempt to observe the exact same units -in the repeated rounds, so that we see the same individuals +in different points in time, so that we see the same individuals both before and after they have received treatment (or not).\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/what-are-we-estimating-when-we-estimate-difference-differences}} -This allows each unit's baseline outcome to be used -as an additional control for its endline outcome, +This allows each unit's baseline outcome (the outcome before the intervention) to be used +as an additional control for its endline outcome (the last outcome observation in the data), a \textbf{fixed effects} design often referred to as an ANCOVA model, which can provide large increases in power and robustness.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/another-reason-prefer-ancova-dealing-changes-measurement-between-baseline-and-follow}} -When tracking individuals over rounds for this purpose, +When tracking individuals over time for this purpose, maintaining sampling and tracking records is especially important, because attrition and loss-to-follow-up will remove that unit's information -from all rounds of observation, not just the one they are unobserved in. +from all points in time, not just the one they are unobserved in. Panel-style experiments therefore require substantially more effort in the field work portion.\sidenote{ \url{https://www.princeton.edu/~otorres/Panel101.pdf}} -Since baseline and endline data collection may be far apart, +Since baseline and endline may be far apart in time, it is important to create careful records during the first round so that follow-ups can be conducted with the same subjects, and attrition across rounds can be properly taken into account.\sidenote{ @@ -370,7 +370,7 @@ \subsection{Regression discontinuity} in policy designs to separate a group of potentially eligible recipients into comparable gorups of individuals who do and do not receive a treatment.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Regression_Discontinuity}} -These types of designs differ from cross-sectional and diff-in-diff designs +These designs differ from cross-sectional and diff-in-diff designs in that the group eligible to receive treatment is not defined directly, but instead created during the treatment implementation. \index{regression discontinuity} @@ -378,12 +378,12 @@ \subsection{Regression discontinuity} that has limited availability due to practical considerations or policy choices and is therefore made available only to individuals who meet a certain threshold requirement. The intuition of this design is that there is an underlying \textbf{running variable} -which serves as the sole determinant of access to the program, +that serves as the sole determinant of access to the program, and a strict cutoff determines the value of this variable at which eligibility stops.\cite{imbens2008regression} Common examples are test score thresholds, income thresholds, and some types of lotteries.\sidenote{ \url{http://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn}} The intuition is that individuals who are just on the receiving side of the threshold -will be very nearly indistinguishable from those on the non-receiving side, +will be very nearly indistinguishable from those who are just under it, and their post-treatment outcomes are therefore directly comparable.\cite{lee2010regression} The key assumption here is that the running variable cannot be directly manipulated by the potential recipients. @@ -397,7 +397,7 @@ \subsection{Regression discontinuity} compared against those who are narrowly on the exclusion side.\sidenote{ \url{https://cattaneo.princeton.edu/books/Cattaneo-Idrobo-Titiunik_2019\_CUP-Vol1.pdf}} The regression model will be identical to the matching research designs -(ie, contingent whether data has one or more rounds +(i.e., contingent on whether data has one or more time periods and whether the same units are known to be observed repeatedly). The treatment effect will be identified, however, by the addition of a control for the running variable -- meaning that the treatment effect variable From 42a62f59159455bf4e57fdaa5e6673cdc6745a6e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:48:22 -0500 Subject: [PATCH 365/854] Excellent resources --- bibliography.bib | 11 +++++++++++ chapters/research-design.tex | 2 +- 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/bibliography.bib b/bibliography.bib index bf6935128..a23f3748d 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -1,3 +1,14 @@ +@article{king2019propensity, + title={Why propensity scores should not be used for matching}, + author={King, Gary and Nielsen, Richard}, + journal={Political Analysis}, + volume={27}, + number={4}, + pages={435--454}, + year={2019}, + publisher={Cambridge University Press} +} + @article{abadie2010synthetic, title={Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program}, author={Abadie, Alberto and Diamond, Alexis and Hainmueller, Jens}, diff --git a/chapters/research-design.tex b/chapters/research-design.tex index b43f6356f..f8bb19648 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -95,7 +95,7 @@ \subsection{Estimating treatment effects using control groups} \url{https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/} \\ \noindent \url{http://scunning.com/cunningham_mixtape.pdf}} \textit{Mostly Harmless Econometrics} and \textit{Mastering Metrics} -are canonical treatments of the statistical principles behind all econometric approaches.\sidenote{ +are excellent resources on the statistical principles behind all econometric approaches.\sidenote{ \url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion} \\ \noindent \url{http://assets.press.princeton.edu/chapters/s10363.pdf}} From 6062a89ad9b396f6e9abd850415d943b0552e1bc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:49:08 -0500 Subject: [PATCH 366/854] Update chapters/research-design.tex Co-Authored-By: Luiza Andrade --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index f8bb19648..e12ae6288 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -106,7 +106,7 @@ \subsection{Estimating treatment effects using control groups} Instead, we typically make inferences from samples. \textbf{Causal inference} methods are those in which we are able to estimate the average treatment effect without observing individual-level effects, -but can obtain it from some comparison of averages with a \textbf{control} group. +but through some comparison of averages with a \textbf{control} group. \index{causal inference}\index{control group} Every research design is based on a way of comparing another set of observations -- the ``control'' observations -- against the treatment group. From a250a1cb110ace99d7e145c98d7042f1b48eda3f Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:49:58 -0500 Subject: [PATCH 367/854] causal identification --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index e12ae6288..85afe8cbd 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -123,7 +123,7 @@ \subsection{Estimating treatment effects using control groups} which is designed to maximize their ability to estimate the effect of an average unit being offered the treatment you want to evaluate. The focus on identification of the treatment effect, however, -means there are several essential features to this approach +means there are several essential features to causal identification methods that are not common in other types of statistical and data science work. First, the econometric models and estimating equations used do not attempt to create a predictive or comprehensive model From a520f5e8e436cf89b06f1a49269c783123406872 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:52:55 -0500 Subject: [PATCH 368/854] Fidelity --- chapters/research-design.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 85afe8cbd..8969a11d1 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -175,8 +175,8 @@ \subsection{Experimental and quasi-experimental research designs} how randomization noise may impact the estimates that are obtained. Second, takeup and implementation fidelity are extremely important, since programs will by definition have no effect -if they are not in fact accepted by or delivered to -the people who are supposed to receive them. +if the population intended to be treated +does not accept or does not receive the treatment. Unfortunately, these effects kick in very quickly and are highly nonlinear: 70\% takeup or efficacy doubles the required sample, and 50\% quadruples it.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up}} From a93ed0ebd03423fb96fed00e2bb72a25756201df Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:53:47 -0500 Subject: [PATCH 369/854] implementation gaps --- chapters/research-design.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 8969a11d1..e12ae6288 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -123,7 +123,7 @@ \subsection{Estimating treatment effects using control groups} which is designed to maximize their ability to estimate the effect of an average unit being offered the treatment you want to evaluate. The focus on identification of the treatment effect, however, -means there are several essential features to causal identification methods +means there are several essential features to this approach that are not common in other types of statistical and data science work. First, the econometric models and estimating equations used do not attempt to create a predictive or comprehensive model @@ -175,8 +175,8 @@ \subsection{Experimental and quasi-experimental research designs} how randomization noise may impact the estimates that are obtained. Second, takeup and implementation fidelity are extremely important, since programs will by definition have no effect -if the population intended to be treated -does not accept or does not receive the treatment. +if they are not in fact accepted by or delivered to +the people who are supposed to receive them. Unfortunately, these effects kick in very quickly and are highly nonlinear: 70\% takeup or efficacy doubles the required sample, and 50\% quadruples it.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up}} From 747140b0b696728625e2942bf9d6aef3fd42f22e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:54:28 -0500 Subject: [PATCH 370/854] implementation gaps --- chapters/research-design.tex | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index e12ae6288..55c94d2f8 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -123,7 +123,7 @@ \subsection{Estimating treatment effects using control groups} which is designed to maximize their ability to estimate the effect of an average unit being offered the treatment you want to evaluate. The focus on identification of the treatment effect, however, -means there are several essential features to this approach +means there are several essential features to causal identification methods that are not common in other types of statistical and data science work. First, the econometric models and estimating equations used do not attempt to create a predictive or comprehensive model @@ -175,9 +175,9 @@ \subsection{Experimental and quasi-experimental research designs} how randomization noise may impact the estimates that are obtained. Second, takeup and implementation fidelity are extremely important, since programs will by definition have no effect -if they are not in fact accepted by or delivered to -the people who are supposed to receive them. -Unfortunately, these effects kick in very quickly and are highly nonlinear: +if the population intended to be treated +does not accept or does not receive the treatment. +Unfortunately, the loss of power happens with relatively small implementation gaps and is highly nonlinear: 70\% takeup or efficacy doubles the required sample, and 50\% quadruples it.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up}} Such effects are also very hard to correct ex post, From fa9da15997a74244ff6bac941a5e10bd4cdb140c Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:55:19 -0500 Subject: [PATCH 371/854] time and place --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 55c94d2f8..4780b1b94 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -198,7 +198,7 @@ \subsection{Experimental and quasi-experimental research designs} of having access to data collected at the right times and places to exploit events that occurred in the past, or having the ability to collect data in a time and place -dictated by the availability of identification. +where an event that produces causal identification occurred. Therefore, these methods often use either secondary data, or use primary data in a cross-sectional retrospective method. From fec36b11278beab3624d879e848c86d1df21fefd Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:55:48 -0500 Subject: [PATCH 372/854] grammar --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 4780b1b94..64f227e64 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -200,7 +200,7 @@ \subsection{Experimental and quasi-experimental research designs} or having the ability to collect data in a time and place where an event that produces causal identification occurred. Therefore, these methods often use either secondary data, -or use primary data in a cross-sectional retrospective method. +or they use primary data in a cross-sectional retrospective method. Quasi-experimental designs therefore can access a much broader range of questions, and with much less effort in terms of executing an intervention. From 56c0aded46e95fdb3ef39dcb12b362248ab85c5d Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:57:13 -0500 Subject: [PATCH 373/854] power calcs --- chapters/research-design.tex | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 64f227e64..e681355f7 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -210,11 +210,10 @@ \subsection{Experimental and quasi-experimental research designs} Additionally, because the population exposed to such events is limited by the scale of the event, quasi-experimental designs are often power-constrained. -There is nothing the research team can do to increase power -by providing treatment to more people or expanding the control group: -instead, power is typically maximized by ensuring -that sampling is carried out effectively -and that attrition from the sampled groups is dealt with effectively. +Since the research team cannot change the population of the study +or the treatment assignment, power is typically maximized by ensuring +that sampling for data collection is carefully powered +and that attrition from the sampled groups is minimized. Sampling noise and survey non-response are therefore analogous to the randomization noise and implementation failures that can be observed in RCT designs, and have similar implications for field work. From 682436bb0784e348fa6bade381afb90117632117 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 17:58:10 -0500 Subject: [PATCH 374/854] Don't get ahead --- chapters/research-design.tex | 3 --- 1 file changed, 3 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index e681355f7..d1768b77b 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -214,9 +214,6 @@ \subsection{Experimental and quasi-experimental research designs} or the treatment assignment, power is typically maximized by ensuring that sampling for data collection is carefully powered and that attrition from the sampled groups is minimized. -Sampling noise and survey non-response are therefore analogous -to the randomization noise and implementation failures -that can be observed in RCT designs, and have similar implications for field work. %----------------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------------- From 997d68366d974667dc79da2256a24d61167bb256 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 18:01:16 -0500 Subject: [PATCH 375/854] balance definition --- chapters/research-design.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index d1768b77b..1d579314e 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -253,7 +253,9 @@ \subsection{Cross-sectional designs} \textbf{Randomization inference} can be used to estimate the underlying variability in the randomization process (more on this in the next chapter). -\textbf{Balance checks} are often reported as evidence of an effective randomization, +\textbf{Balance checks}\sidenote{ + \textbf{Balance checks:} Statistical tests of the similarity of treatment and control groups.} +are often reported as evidence of an effective randomization, and are particularly important when the design is quasi-experimental (since then the randomization process cannot be simulated explicitly). However, controls for balance variables are usually unnecessary in RCTs, From 5bce9f1656adb2b8505703c2c7529569ae6d67bf Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 18:03:17 -0500 Subject: [PATCH 376/854] round --- chapters/research-design.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 1d579314e..2a4ed3bfb 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -254,7 +254,7 @@ \subsection{Cross-sectional designs} to estimate the underlying variability in the randomization process (more on this in the next chapter). \textbf{Balance checks}\sidenote{ - \textbf{Balance checks:} Statistical tests of the similarity of treatment and control groups.} + \textbf{Balance checks:} Statistical tests of the similarity of treatment and control groups.} are often reported as evidence of an effective randomization, and are particularly important when the design is quasi-experimental (since then the randomization process cannot be simulated explicitly). @@ -318,7 +318,7 @@ \subsection{Difference-in-differences} There are two main types of data structures for differences-in-differences: \textbf{repeated cross-sections} and \textbf{panel data}. -In repeated cross-sections, each round contains a random sample +In repeated cross-sections, each successive round of data collection contains a random sample of observations from the treated and untreated groups; as in cross-sectional designs, both the randomization and sampling processes, as well as their execution in the field, From f4bca7eaae979522de660bfdce4a83fc12354c5e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 18:04:11 -0500 Subject: [PATCH 377/854] no field --- chapters/research-design.tex | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 2a4ed3bfb..a1facbc72 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -320,8 +320,7 @@ \subsection{Difference-in-differences} \textbf{repeated cross-sections} and \textbf{panel data}. In repeated cross-sections, each successive round of data collection contains a random sample of observations from the treated and untreated groups; -as in cross-sectional designs, both the randomization and sampling processes, -as well as their execution in the field, +as in cross-sectional designs, both the randomization and sampling processes are critically important to maintain alongside the survey results. In panel data structures, we attempt to observe the exact same units in different points in time, so that we see the same individuals From 081760ed6e67700001e443cdaada562cc4afa076 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 18:05:52 -0500 Subject: [PATCH 378/854] LTFU --- chapters/research-design.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index a1facbc72..2e25ee21d 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -243,7 +243,7 @@ \subsection{Cross-sectional designs} For cross-sectional RCTs, what needs to be carefully maintained in data is the treatment randomization process itself, as well as detailed field data about differences -in data quality and loss-to-follow-up across groups.\cite{athey2017econometrics} +in data quality and loss to follow-up across groups.\cite{athey2017econometrics} Only these details are needed to construct the appropriate estimator: clustering of the estimate is required at the level at which the treatment is assigned to observations, @@ -333,7 +333,7 @@ \subsection{Difference-in-differences} \url{https://blogs.worldbank.org/impactevaluations/another-reason-prefer-ancova-dealing-changes-measurement-between-baseline-and-follow}} When tracking individuals over time for this purpose, maintaining sampling and tracking records is especially important, -because attrition and loss-to-follow-up will remove that unit's information +because attrition and loss to follow-up will remove that unit's information from all points in time, not just the one they are unobserved in. Panel-style experiments therefore require substantially more effort in the field work portion.\sidenote{ From ac03383cac81e3c4030e5d49f435098580068fae Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 18:06:40 -0500 Subject: [PATCH 379/854] field work --- chapters/research-design.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 2e25ee21d..59bcb4c0f 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -335,8 +335,8 @@ \subsection{Difference-in-differences} maintaining sampling and tracking records is especially important, because attrition and loss to follow-up will remove that unit's information from all points in time, not just the one they are unobserved in. -Panel-style experiments therefore require substantially more effort -in the field work portion.\sidenote{ +Panel-style experiments therefore require a lot more effort in field work +for studies that use survey data.\sidenote{ \url{https://www.princeton.edu/~otorres/Panel101.pdf}} Since baseline and endline may be far apart in time, it is important to create careful records during the first round From 70093639f8ab479d73883ff2ca4edd7389b2ddf9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 18:08:37 -0500 Subject: [PATCH 380/854] just above --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 59bcb4c0f..ea5385c86 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -379,7 +379,7 @@ \subsection{Regression discontinuity} and a strict cutoff determines the value of this variable at which eligibility stops.\cite{imbens2008regression} Common examples are test score thresholds, income thresholds, and some types of lotteries.\sidenote{ \url{http://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn}} -The intuition is that individuals who are just on the receiving side of the threshold +The intuition is that individuals who are just above the threshold will be very nearly indistinguishable from those who are just under it, and their post-treatment outcomes are therefore directly comparable.\cite{lee2010regression} The key assumption here is that the running variable cannot be directly manipulated From 7531b12e971b7a60d4b7f27e6bd89e5716a3c2a1 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 21 Jan 2020 18:09:41 -0500 Subject: [PATCH 381/854] paragraph cleanup --- chapters/research-design.tex | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index ea5385c86..51ee18665 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -393,21 +393,20 @@ \subsection{Regression discontinuity} who are narrowly on the inclusion side of the discontinuity, compared against those who are narrowly on the exclusion side.\sidenote{ \url{https://cattaneo.princeton.edu/books/Cattaneo-Idrobo-Titiunik_2019\_CUP-Vol1.pdf}} -The regression model will be identical to the matching research designs -(i.e., contingent on whether data has one or more time periods -and whether the same units are known to be observed repeatedly). +The regression model will be identical to the matching research designs, +i.e., contingent on whether data has one or more time periods +and whether the same units are known to be observed repeatedly. The treatment effect will be identified, however, by the addition of a control for the running variable -- meaning that the treatment effect variable will only be applicable for observations in a small window around the cutoff. (Spatial discontinuity designs are handled a bit differently due to their multidimensionality.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/spatial-jumps}}) -In the RD model, the functional form of that control and the size of that window -(often referred to as the choice of \textbf{bandwidth} for the design) +In the RD model, the functional form of that control and the size of that window, +often referred to as the choice of \textbf{bandwidth} for the design, are the critical parameters for the result.\cite{calonico2019regression} Therefore, RD analysis often includes extensive robustness checking using a variety of both functional forms and bandwidths, -as well as placebo testing for non-realized locations of the cutoff -(conceptually similar to the idea of randomization inference). +as well as placebo testing for non-realized locations of the cutoff. In the analytical stage, regression discontinuity designs often include a large component of visual evidence presentation.\sidenote{ From 388a7d59e5d304bb8554d602b9e10bd6eb79989b Mon Sep 17 00:00:00 2001 From: Maria Date: Wed, 22 Jan 2020 16:18:43 -0500 Subject: [PATCH 382/854] Ch5 re-write - fixed missing bracket on sidenote --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 3a4bb6d35..cf6e6848b 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -240,7 +240,7 @@ \subsection{Secure data in the field} All mainstream data collection software automatically \textbf{encrypt} \sidenote{\textbf{Encryption:} the process of making information unreadable to anyone without access to a specific deciphering -key. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} +key. \url{https://dimewiki.worldbank.org/wiki/Encryption}} all data submitted from the field while in transit (i.e., upload or download). Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. Therefore, as long as you are using an established survey software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, and accounts used have a logon password and are never left unlocked. \subsection{Secure data storage} From d09c32dea7f777838c442c9212973423330e4e5d Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 14:22:16 -0500 Subject: [PATCH 383/854] [ch6] clarification on encryption --- chapters/data-analysis.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index b46bd20f3..cc4fd76d2 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -143,10 +143,11 @@ \section{Data cleaning} It should contain only materials that are received directly from the field. They will invariably come in a host of file formats and nearly always contain personally-identifying information.\index{personally-identifying information} These files should be retained in the raw data folder \textit{exactly as they were received}. -The folder must be encrypted if it is shared in an insecure fashion,\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} -and it must be backed up in a secure offsite location. +Be mindful of where this file is stored. +Maintain a backup copy in a secure offsite location. Every other file is created from the raw data, and therefore can be recreated. The exception, of course, is the raw data itself, so it should never be edited directly. +Additionally, no one who is not listed in the IRB should be able to access its content, not even the company providing file-sharing services. Check if your organization has guidelines on how to store data securely, as they may offer an institutional solution. If that is not the case, you will need to encrypt the file and make sure that only IRB-listed team members have the encryption key.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}}. Secure storage of the raw data means access to it will be restricted even inside the research team.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Security}} Loading encrypted data frequently can be disruptive to the workflow. From 0dea9130edfe594872e9fdd495e2fa9c54ee6e58 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 14:30:24 -0500 Subject: [PATCH 384/854] [ch6] examples of secure enviroments --- chapters/data-analysis.tex | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index cc4fd76d2..aa42a0faf 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -163,6 +163,10 @@ \section{Data cleaning} (e.g. GPS coordinates can be translated into distances). However, if sensitive information is strictly needed for analysis, all the tasks described in this chapter must be performed in a secure environment. +What that means for a specific project will depend on IRB conditions, +but a few examples are company-managed machines, +servers accessed through two-factor-authentication, +or even cold rooms. % Unique ID and data entry corrections --------------------------------------------- There are two main cases when the raw data will be modified during data cleaning. From 4c2adc1633d9cbc8acae4615523b4b7a60cd689e Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 14:31:02 -0500 Subject: [PATCH 385/854] [ch6] formatting --- chapters/data-analysis.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index aa42a0faf..62dfc20c7 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -147,7 +147,9 @@ \section{Data cleaning} Maintain a backup copy in a secure offsite location. Every other file is created from the raw data, and therefore can be recreated. The exception, of course, is the raw data itself, so it should never be edited directly. -Additionally, no one who is not listed in the IRB should be able to access its content, not even the company providing file-sharing services. Check if your organization has guidelines on how to store data securely, as they may offer an institutional solution. If that is not the case, you will need to encrypt the file and make sure that only IRB-listed team members have the encryption key.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}}. +Additionally, no one who is not listed in the IRB should be able to access its content, not even the company providing file-sharing services. +Check if your organization has guidelines on how to store data securely, as they may offer an institutional solution. +If that is not the case, you will need to encrypt the file and make sure that only IRB-listed team members have the encryption key.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}}. Secure storage of the raw data means access to it will be restricted even inside the research team.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Security}} Loading encrypted data frequently can be disruptive to the workflow. From 6fa1e03e735f9acab7ae7af3a56d8bfc81417251 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 14:31:45 -0500 Subject: [PATCH 386/854] [ch6] removed comment about merging --- chapters/data-analysis.tex | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 62dfc20c7..07afd2ae5 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -174,8 +174,7 @@ \section{Data cleaning} There are two main cases when the raw data will be modified during data cleaning. The first one is when there are duplicated entries in the data. Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/wiki/ID\_Variable\_Properties}} -is possibly the most important step in data cleaning -(as anyone who ever tried to merge data sets that are not uniquely identified knows). +is possibly the most important step in data cleaning. Modern survey tools create unique observation identifiers. That, however, is not the same as having a unique ID variable for each individual in the sample. You want to make sure the data set has a unique ID variable From 968ba4cc8b500fc1e08136d51c17be098634c083 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 14:36:17 -0500 Subject: [PATCH 387/854] [ch6] secondary data may also need corrections --- chapters/data-analysis.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 07afd2ae5..43a3901a4 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -197,7 +197,8 @@ \section{Data cleaning} and how the correct value was obtained. % Data description ------------------------------------------------------------------ -Note that if you are using secondary data, the tasks described above can likely be skipped. +On average, making corrections to primary data is more time-consuming than when using secondary data. +But you should always check for possible issues in any data you are about to use. The last step of data cleaning, however, will most likely still be necessary. It consists of describing the data, so that its users have all the information needed to interact with it. This is a key step to making the data easy to use, but it can be quite repetitive. From 6de79d5b74763cabaa254cb71efcd0c97e158222 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 15:01:13 -0500 Subject: [PATCH 388/854] [ch6] annotating data --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 43a3901a4..288a0110e 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -200,7 +200,7 @@ \section{Data cleaning} On average, making corrections to primary data is more time-consuming than when using secondary data. But you should always check for possible issues in any data you are about to use. The last step of data cleaning, however, will most likely still be necessary. -It consists of describing the data, so that its users have all the information needed to interact with it. +It consists of annotating the data, so that its users have all the information needed to interact with it. This is a key step to making the data easy to use, but it can be quite repetitive. The \texttt{iecodebook} command suite, also part of \texttt{iefieldkit}, is designed to make some of the most tedious components of this process, From fe5b30977f476ce2a286e42be51a44077f5b76fb Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 15:03:57 -0500 Subject: [PATCH 389/854] [ch6] "changes" in the data --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 288a0110e..16bbb4daf 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -262,7 +262,7 @@ \section{Data cleaning} \section{Indicator construction} % What is construction ------------------------------------- -Any changes to the original data set happen during construction. +Data construction is the process of processing the data points as provided in the raw data to make them suitable for analysis. It is at this stage that the raw data is transformed into analysis data. This is done by creating derived variables (binaries, indices, and interactions, to name a few). From 5134e94d5e3226aad818e29b5516ada7e1da2a6a Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 15:19:47 -0500 Subject: [PATCH 390/854] [ch6] What to do during construction --- chapters/data-analysis.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 16bbb4daf..210aec013 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -300,9 +300,8 @@ \section{Indicator construction} % What to do during construction ----------------------------------------- -Construction is the step where you face the largest risk of making a mistake that will affect your results. -Keep in mind that details and scales matter. -It is important to check and double-check the value-assignments of questions and their scales before constructing new variables using them. +Keep in mind that details matter when constructing variables, and overlooking them may affect your results. +It is important to check and double-check the value-assignments of questions and their scales before constructing new variables using them. Are they in percentages or proportions? Are all variables you are combining into an index or average using the same scale? Are yes or no questions coded as 0 and 1, or 1 and 2? @@ -321,6 +320,7 @@ \section{Indicator construction} How to treat outliers is a research question, but make sure to note what we the decision made by the research team, and how you came to it. Results can be sensitive to the treatment of outliers, so keeping the original variable in the data set will allow you to test how much it affects the estimates. More generally, create derived measures in new variables instead of overwriting the original information. +Because data construction involves translating concrete data points to more abstract measurements, it is important to document exactly how each variable is derived or calculated. % Outputs ----------------------------------------------------------------- From 342f6b8823c8c89ec0bd032b9383b16d27c943d2 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 15:32:11 -0500 Subject: [PATCH 391/854] [ch6] explain multiple analysis data sets --- chapters/data-analysis.tex | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 210aec013..4bbc972ea 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -333,10 +333,11 @@ \section{Indicator construction} you may have one or multiple constructed data sets, depending on how your analysis is structured. So don't worry if you cannot create a single, ``canonical'' analysis data set. -It is common to have many purpose-built analysis datasets: -there may be a \texttt{data-wide.dta}, -\texttt{data-wide-children-only.dta}, \texttt{data-long.dta}, -\texttt{data-long-counterfactual.dta}, and many more as needed. +It is common to have many purpose-built analysis datasets. +Think of an agricultural intervention that was randomized across villages and only affected certain plots within each village. +The research team may want to run household-level regressions on income, test for plot-level productivity gains, and check if village characteristics are balanced. +Having three separate datasets for each of these three pieces of analysis will result in much cleaner do files than if they all started from the same file. + One thing all constructed data sets should have in common, though, are functionally-named variables. Constructed variables are called ``constructed'' because they were not present in the survey to start with, so making their names consistent with the survey form is not as crucial. From dc049ac40912fdd1ebbe58ece4e697de86f31e23 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 15:35:08 -0500 Subject: [PATCH 392/854] [ch6] organizing analysis scripts --- chapters/data-analysis.tex | 1 - 1 file changed, 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 4bbc972ea..90cda0e36 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -381,7 +381,6 @@ \section{Writing data analysis code} The way you deal with code and outputs for exploratory and final analysis is different, and this section will discuss how. % Organizing scripts --------------------------------------------------------- During exploratory data analysis, you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. -Although it's fine to write such a script if you are coding in real-time during a long analysis meeting with your PIs, this practice is error-prone. It subtly encourages poor practices such as not clearing the workspace and not reloading the constructed data set before each analysis task. It's important to take the time to organize scripts in a clean manner and to avoid mistakes. From 83a294a9349439fe2bdfa57062e6e626b98953c4 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 15:37:06 -0500 Subject: [PATCH 393/854] [ch6] democratic code names --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 90cda0e36..e2723effe 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -403,7 +403,7 @@ \section{Writing data analysis code} To accomplish this, you will need to make sure that you have an effective data management system, including naming, file organization, and version control. Just like you did with each of the analysis datasets, name each of the individual analysis files descriptively. -Code files such as \path{spatial-diff-in-diff.do}, \path{matching-villages.do}, and \path{summary-statistics.do} +Code files such as \path{spatial-diff-in-diff.do}, \path{matching-villages.R}, and \path{summary-statistics.py} are clear indicators of what each file is doing, and allow you to find code quickly. If you intend to numerically order the code as they appear in a paper or report, leave this to near publication time. From 97166a9e54ffbc62c3216188463210ba922afcc2 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 15:39:51 -0500 Subject: [PATCH 394/854] [ch6] exploratory analysis otputs --- chapters/data-analysis.tex | 4 ---- 1 file changed, 4 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index e2723effe..8f7a7bcf0 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -450,11 +450,7 @@ \section{Writing data analysis code} \section{Exporting analysis outputs} -% Exploratory analysis It's ok to not export each and every table and graph created during exploratory analysis. -Instead, we suggest running them into markdown files using RMarkdown or the different dynamic document options available in Stata. -This will allow you to update and present results quickly while maintaining a record of the different analysis tried. -% Final analysis Final analysis scripts, on the other hand, should export final outputs, which are ready to be included to a paper or report. No manual edits, including formatting, should be necessary after exporting final outputs. Manual edits are difficult to replicate, and you will inevitably need to make changes to the outputs. From eca7441691e3d16e692ea172617caa25bc94c8bb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Fri, 24 Jan 2020 15:55:05 -0500 Subject: [PATCH 395/854] [ch6] remove sentence with weird negative tone Co-Authored-By: Luiza Andrade --- chapters/data-analysis.tex | 1 - 1 file changed, 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 8f7a7bcf0..6051c2e85 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -191,7 +191,6 @@ \section{Data cleaning} correcting mistakes in data entry. During data quality monitoring, you will inevitably encounter data entry mistakes, such as typos and inconsistent values. -If you don't, you are probably not doing a very good job at looking for them. These mistakes should be fixed in the cleaned data set, and you should keep a careful record of how they were identified, and how the correct value was obtained. From ab0718bd5027143409b4621ca4a8bc011240ac2e Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 15:55:29 -0500 Subject: [PATCH 396/854] [ch6] emphasizing no copy-paste rule --- chapters/data-analysis.tex | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 8f7a7bcf0..b95f8225f 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -452,15 +452,18 @@ \section{Exporting analysis outputs} It's ok to not export each and every table and graph created during exploratory analysis. Final analysis scripts, on the other hand, should export final outputs, which are ready to be included to a paper or report. -No manual edits, including formatting, should be necessary after exporting final outputs. +No manual edits, including formatting, should be necessary after exporting final outputs -- +those that require copying and pasting edited outputs, in particular, are absolutely not advisable. Manual edits are difficult to replicate, and you will inevitably need to make changes to the outputs. Automating them will save you time by the end of the process. However, don't spend too much time formatting tables and graphs until you are ready to publish.\sidenote{For a more detailed discussion on this, including different ways to export tables from Stata, see \url{https://github.com/bbdaniels/stata-tables}} Polishing final outputs can be a time-consuming process, and you want to it as few times as possible. -We cannot stress this enough: don't ever set a workflow that requires copying and pasting results from the console. -There are numerous commands to export outputs from both R and Stata.\sidenote{Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, +We cannot stress this enough: don't ever set a workflow that requires copying and pasting results. +Copying results from excel to word is risk-prone and inefficient. +Copying results from a software console is risk-prone, even more inefficient, and unnecessary. +There are numerous commands to export outputs from both R and Stata to a myriad of formats.\sidenote{Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, and \href{https://www.benjaminbdaniels.com/stata-code/outwrite/}{\texttt{outwrite}} in Stata, and \href{https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf}{\texttt{stargazer}} and \href{https://ggplot2.tidyverse.org/reference/ggsave.html}{\texttt{ggsave}} in R.} From be5e8f020c44e4579594fff706d2c0d626f2ad7b Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 24 Jan 2020 16:08:48 -0500 Subject: [PATCH 397/854] [ch6 ]small edits to safely store and share --- chapters/data-analysis.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index de591721a..623a9f794 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -149,7 +149,9 @@ \section{Data cleaning} The exception, of course, is the raw data itself, so it should never be edited directly. Additionally, no one who is not listed in the IRB should be able to access its content, not even the company providing file-sharing services. Check if your organization has guidelines on how to store data securely, as they may offer an institutional solution. -If that is not the case, you will need to encrypt the file and make sure that only IRB-listed team members have the encryption key.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}}. +If that is not the case, you will need to encrypt the data, especially before +sharing it, and make sure that only IRB-listed team members have the +encryption key.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}}. Secure storage of the raw data means access to it will be restricted even inside the research team.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Security}} Loading encrypted data frequently can be disruptive to the workflow. From f6c991725f2ed4a1d29349426eabf992e0cfd861 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 16:09:31 -0500 Subject: [PATCH 398/854] [ch6] emphasizing no copy-paste rule (2) --- chapters/data-analysis.tex | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index de591721a..9bc34e787 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -471,9 +471,14 @@ \section{Exporting analysis outputs} In Stata, that would mean always using \texttt{graph export} to save images as \texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., instead of \texttt{graph save}, which creates a \texttt{.gph} file that can only be opened through a Stata installation. Some publications require ``lossless'' TIFF of EPS files, which are created by specifying the desired extension. -For tables, \texttt{.tex} is preferred. -Excel \texttt{.xlsx} and \texttt{.csv} files are also acceptable, but require the extra step of copying the tables into the final output, so it can be cumbersome to ensure that your paper or report is always up-to-date. Whichever format you decide to use, remember to always specify the file extension explicitly. +For tables there are less options and more consideration to be made. +Exporting table to \texttt{.tex} should be preferred. +Excel \texttt{.xlsx} and \texttt{.csv} are also commonly used, +but require the extra step of copying the tables into the final output. +The amount of work needed in a copy-paste workflow increases rapidly with the number of tables and figures included in a research output, +and do the chances of having the wrong version a result in your paper or report. + % Formatting If you need to create a table with a very particular format, that is not automated by any command you know, consider writing the it manually From 414f3b7dbdd0033636d254b0df0c959765b0b160 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 24 Jan 2020 16:16:22 -0500 Subject: [PATCH 399/854] [ch6] manually edit raw data exception --- chapters/data-analysis.tex | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 623a9f794..c0b36fdc4 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -146,7 +146,13 @@ \section{Data cleaning} Be mindful of where this file is stored. Maintain a backup copy in a secure offsite location. Every other file is created from the raw data, and therefore can be recreated. -The exception, of course, is the raw data itself, so it should never be edited directly. +The exception, of course, is the raw data itself, so it should never be edited +directly. +The rare and only exception to this is if the raw data is encoded incorrectly +and some non-English letter is causing rows or columns to break incorrectly +when the data is imported. In this rare scenario you will have to fix this +manually, and then both the broken and the fixed version of the raw data should +be securely backed up. Additionally, no one who is not listed in the IRB should be able to access its content, not even the company providing file-sharing services. Check if your organization has guidelines on how to store data securely, as they may offer an institutional solution. If that is not the case, you will need to encrypt the data, especially before From f744e094cb558a9cf0af6f4442fc156096105947 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 16:16:39 -0500 Subject: [PATCH 400/854] [ch6] iefolder --- chapters/data-analysis.tex | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 9bc34e787..11a2cd343 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -68,16 +68,19 @@ \section{Data management} There are many schemes to organize research data. Our preferred scheme reflects the task breakdown just discussed. \index{data organization} -We created the \texttt{iefolder}\sidenote{ +DIME Analytics created the \texttt{iefolder}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iefolder}} package (part of \texttt{ietoolkit}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/ietoolkit}}) -based on our experience with primary survey data, -but it can be used for different types of data. -\texttt{iefolder} is designed to standardize folder structures across teams and projects. +to standardize folder structures across teams and projects. This means that PIs and RAs face very small costs when switching between projects, because they are organized in the same way.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Folder}} -At the first level of this folder are what we call survey round folders.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Survey\_Round}} +We created the command based on our experience with primary data, +but it can be used for different types of data. +Whatever you team may need in terms of organization, +the principle of creating one standard remains. + +At the first level of the structure created by \texttt{iefolder} are what we call survey round folders.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Survey\_Round}} You can think of a ``round'' as one source of data, that will be cleaned in the same script. Inside round folders, there are dedicated folders for From 1d1d14a58dda41170d1865d8df6f2b6a645d2918 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 24 Jan 2020 16:26:17 -0500 Subject: [PATCH 401/854] [ch6] error-prone instead of risk-prone --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index c0b36fdc4..90f5e3b98 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -468,7 +468,7 @@ \section{Exporting analysis outputs} and you want to it as few times as possible. We cannot stress this enough: don't ever set a workflow that requires copying and pasting results. -Copying results from excel to word is risk-prone and inefficient. +Copying results from excel to word is error-prone and inefficient. Copying results from a software console is risk-prone, even more inefficient, and unnecessary. There are numerous commands to export outputs from both R and Stata to a myriad of formats.\sidenote{Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, and \href{https://www.benjaminbdaniels.com/stata-code/outwrite/}{\texttt{outwrite}} in Stata, From b9cc5f1e5c5212eab45a839e088599a211a9b1bc Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 24 Jan 2020 17:38:50 -0500 Subject: [PATCH 402/854] [ch6] publishing documentation --- chapters/data-analysis.tex | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 1d5e91eca..628d50519 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -255,11 +255,11 @@ \section{Data cleaning} Throughout the data cleaning process, you will need inputs from the field, including enumerator manuals, survey instruments, supervisor notes, and data quality monitoring reports. -These materials are part of what we call data documentation -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Documentation}} -\index{Documentation}, -and should be stored in the corresponding folder, -as you will probably need them during analysis and publication. +These materials are essential for data documentation.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Documentation}} +\index{Documentation} +They should be stored in the corresponding ``Documentation'' folder for easy access, +as you will probably need them during analysis, +and they must be made available for publication. Include in the \texttt{Documentation} folder records of any corrections made to the data, including to duplicated entries, as well as communications from the field where theses issues are reported. From b595afd67ce53c4c252eaf30159c43dcd843ae81 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 24 Jan 2020 18:01:02 -0500 Subject: [PATCH 403/854] [ch6] labeling and annotating --- chapters/data-analysis.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 628d50519..e2d5bb276 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -210,7 +210,8 @@ \section{Data cleaning} On average, making corrections to primary data is more time-consuming than when using secondary data. But you should always check for possible issues in any data you are about to use. The last step of data cleaning, however, will most likely still be necessary. -It consists of annotating the data, so that its users have all the information needed to interact with it. +It consists of labeling and annotating the data, so that its users have all the +information needed to interact with it. This is a key step to making the data easy to use, but it can be quite repetitive. The \texttt{iecodebook} command suite, also part of \texttt{iefieldkit}, is designed to make some of the most tedious components of this process, From 7527ffaa42912dfa8b198d9b5fef909e5dae9027 Mon Sep 17 00:00:00 2001 From: Luiza Date: Sat, 25 Jan 2020 08:35:38 -0500 Subject: [PATCH 404/854] [ch6] changes to the raw data --- chapters/data-analysis.tex | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index e2d5bb276..804bc86e8 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -151,12 +151,12 @@ \section{Data cleaning} Every other file is created from the raw data, and therefore can be recreated. The exception, of course, is the raw data itself, so it should never be edited directly. -The rare and only exception to this is if the raw data is encoded incorrectly -and some non-English letter is causing rows or columns to break incorrectly -when the data is imported. In this rare scenario you will have to fix this -manually, and then both the broken and the fixed version of the raw data should -be securely backed up. -Additionally, no one who is not listed in the IRB should be able to access its content, not even the company providing file-sharing services. +The rare and only case when the raw data can be edited directly is when it is encoded incorrectly +and some non-English character is causing rows or columns to break at the wrong place +when the data is imported. +In this scenario, you will have to remove the special character manually, save the resulting data set \textit{in a new file} and securely back up \textit{both} the broken and the fixed version of the raw data. + +Note that no one who is not listed in the IRB should be able to access its content, not even the company providing file-sharing services. Check if your organization has guidelines on how to store data securely, as they may offer an institutional solution. If that is not the case, you will need to encrypt the data, especially before sharing it, and make sure that only IRB-listed team members have the From 529aa7340d2d54c48153e68ecd5b16545a6a1e85 Mon Sep 17 00:00:00 2001 From: Luiza Date: Sat, 25 Jan 2020 08:39:32 -0500 Subject: [PATCH 405/854] [ch6] remove 'secure environment' --- chapters/data-analysis.tex | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 804bc86e8..904671a80 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -174,12 +174,8 @@ \section{Data cleaning} de-identification should not affect the usability of the data. In fact, most identifying information can be converted into non-identified variables for analysis purposes (e.g. GPS coordinates can be translated into distances). -However, if sensitive information is strictly needed for analysis, -all the tasks described in this chapter must be performed in a secure environment. -What that means for a specific project will depend on IRB conditions, -but a few examples are company-managed machines, -servers accessed through two-factor-authentication, -or even cold rooms. +However, if sensitive information is strictly needed for analysis, +the data must be encrypted while performing the tasks described in this chapter. % Unique ID and data entry corrections --------------------------------------------- There are two main cases when the raw data will be modified during data cleaning. From fff83e44dc471d9a8a5a36f1837671b9a786bb77 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Mon, 27 Jan 2020 13:04:10 -0500 Subject: [PATCH 406/854] Update chapters/research-design.tex Co-Authored-By: Maria --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 51ee18665..3f1d4343e 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -553,7 +553,7 @@ \subsection{Synthetic controls} \textbf{Synthetic control} is a relatively newer method for the case when appropriate counterfactual individuals -do not exist in reality and there are very few (often only one) treatment unit.\cite{abadie2015comparative} +do not exist in reality and there are very few (often only one) treatment units.\cite{abadie2015comparative} \index{synthetic controls} For example, state- or national-level policy changes are typically very difficult to find valid comparators for, From e0ffa7e4610fa96e3982c49d15485221be117142 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Mon, 27 Jan 2020 13:05:00 -0500 Subject: [PATCH 407/854] Update chapters/research-design.tex Co-Authored-By: Maria --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 3f1d4343e..018c90e08 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -17,7 +17,7 @@ the data structures needed to estimate the corresponding effects, and available code tools that will assist you in this process (the list, of course, is not exhaustive). -Thinking throug your design before starting data work is important for several reasons. +Thinking through your design before starting data work is important for several reasons. If you do not know how to calculate the correct estimator for your study, you will not be able to assess the power of your research design. You will also be unable to make decisions in the field From c6f8635970ff8bccc099ea681d7b847ffc7198fd Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Mon, 27 Jan 2020 14:07:16 -0500 Subject: [PATCH 408/854] Update chapters/research-design.tex Co-Authored-By: Maria --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 018c90e08..8a0346f9c 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -21,7 +21,7 @@ If you do not know how to calculate the correct estimator for your study, you will not be able to assess the power of your research design. You will also be unable to make decisions in the field -when you inevitable have to allocate scarce resources +when you inevitably have to allocate scarce resources between tasks like maximizing sample size and ensuring follow-up with specific individuals. You will save a lot of time by understanding the way From 567034e81d8481b4eb444ad05cfafb9de7a596c7 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 16:56:41 -0500 Subject: [PATCH 409/854] methods and data --- chapters/research-design.tex | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 8a0346f9c..122bb214a 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -1,9 +1,8 @@ %----------------------------------------------------------------------------------------------- \begin{fullwidth} -Research design is the process of structuring field work --- both experimental design and data collection -- -that will answer a specific research question. +Research design is the process of defining the methods and data +that will be used to answer a specific research question. You don't need to be an expert in this, and there are lots of good resources out there that focus on designing interventions and evaluations From d18cd39d9010ac6127197f7c1d3b80d0bb5daf9a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 16:57:43 -0500 Subject: [PATCH 410/854] no retrospective specifics --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 122bb214a..d9ea53266 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -569,7 +569,7 @@ \subsection{Synthetic controls} can be thought of as balancing by matching the composition of the treated unit. To construct this estimator, the synthetic controls method requires -at least one period of retrospective data on the treatment unit and possible comparators, +retrospective data on the treatment unit and possible comparators, including historical data on the outcome of interest for all units. The counterfactual blend is chosen by optimizing the prediction of past outcomes based on the potential input characteristics, From 11b91f55827c40ba40a012f22d4478109808c867 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:02:12 -0500 Subject: [PATCH 411/854] Rewrite motivation --- chapters/research-design.tex | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index d9ea53266..17c79ee29 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -7,14 +7,17 @@ and there are lots of good resources out there that focus on designing interventions and evaluations as well as on econometric approaches. -This section will present a brief overview -of the most common methods that are used in development research. -Specifically, we will introduce you to several ``causal inference'' methods -that are frequently used to estimate the impact of real development programs. -The intent is for you to obtain an understanding of +Therefore, without going into technical detail, +this section will present a brief overview +of the most common methods that are used in development research, +particularly those that are widespread in program evaluation. +These ``causal inference'' methods will turn up in nearly every project, +so you will need to have a broad knowledge of how the methods in your project +are used in order to manage data and code appropriately. +The intent of this chapter is for you to obtain an understanding of the way in which each method constructs treatment and control groups, the data structures needed to estimate the corresponding effects, -and available code tools that will assist you in this process (the list, of course, is not exhaustive). +and some available code tools designed for each method (the list, of course, is not exhaustive). Thinking through your design before starting data work is important for several reasons. If you do not know how to calculate the correct estimator for your study, From b62766dd6792136b9572b0cf3285df8ef4d01dc0 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:02:46 -0500 Subject: [PATCH 412/854] statistical --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 17c79ee29..009c8b2d3 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -21,7 +21,7 @@ Thinking through your design before starting data work is important for several reasons. If you do not know how to calculate the correct estimator for your study, -you will not be able to assess the power of your research design. +you will not be able to assess the statistical power of your research design. You will also be unable to make decisions in the field when you inevitably have to allocate scarce resources between tasks like maximizing sample size From 8e32d1dcfad3a6336f46eccfd992adc55515be31 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:08:45 -0500 Subject: [PATCH 413/854] potential outcomes --- bibliography.bib | 10 ++++++++++ chapters/research-design.tex | 2 +- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/bibliography.bib b/bibliography.bib index a23f3748d..987f3227b 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -9,6 +9,16 @@ @article{king2019propensity publisher={Cambridge University Press} } +@article{athey2017state, + title={The state of applied econometrics: Causality and policy evaluation}, + author={Athey, Susan and Imbens, Guido W}, + journal={Journal of Economic Perspectives}, + volume={31}, + number={2}, + pages={3--32}, + year={2017} +} + @article{abadie2010synthetic, title={Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program}, author={Abadie, Alberto and Diamond, Alexis and Hainmueller, Jens}, diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 009c8b2d3..2f4262d09 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -64,7 +64,7 @@ \section{Causality, inference, and identification} Therefore it is important to understand how exactly your study identifies its estimate of treatment effects, so you can calculate and interpret those estimates appropriately. -All the study designs we discuss here use the \textbf{potential outcomes} framework +All the study designs we discuss here use the potential outcomes framework\cite{athey2017state} to compare a group that received some treatment to another, counterfactual group. Each of these approaches can be used in two contexts: \textbf{experimental} designs, in which the research team From c4feeaa7cedb25667c71d00e5fdf9e0883a3e3ef Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:09:26 -0500 Subject: [PATCH 414/854] do this --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 2f4262d09..11b658eff 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -89,7 +89,7 @@ \subsection{Estimating treatment effects using control groups} Their goal is to establish a ``counterfactual scenario'' for the treatment group with which outcomes can be directly compared. There are several resources that provide more or less mathematically intensive -approaches to understanding how various methods to his. +approaches to understanding how various methods do this. \textit{Impact Evaluation in Practice} is a strong general guide to these methods.\sidenote{ \url{https://www.worldbank.org/en/programs/sief-trust-fund/publication/impact-evaluation-in-practice}} \textit{Causal Inference} and \textit{Causal Inference: The Mixtape} From 3fef5ea41bffafbec1e7c823026e7724a78be202 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:09:57 -0500 Subject: [PATCH 415/854] to the --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 11b658eff..41abfd880 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -93,7 +93,7 @@ \subsection{Estimating treatment effects using control groups} \textit{Impact Evaluation in Practice} is a strong general guide to these methods.\sidenote{ \url{https://www.worldbank.org/en/programs/sief-trust-fund/publication/impact-evaluation-in-practice}} \textit{Causal Inference} and \textit{Causal Inference: The Mixtape} -provides more detailed mathematical approaches fo the tools.\sidenote{ +provides more detailed mathematical approaches to the tools.\sidenote{ \url{https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/} \\ \noindent \url{http://scunning.com/cunningham_mixtape.pdf}} \textit{Mostly Harmless Econometrics} and \textit{Mastering Metrics} From 6876047e5d9ad88115446120d2f7ae3446e81459 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:13:19 -0500 Subject: [PATCH 416/854] counterfactual --- chapters/research-design.tex | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 41abfd880..d1c76a518 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -86,8 +86,10 @@ \subsection{Estimating treatment effects using control groups} individual differences across the potentially treated population. \index{average treatment effect} This is the parameter that most research designs attempt to estimate. -Their goal is to establish a ``counterfactual scenario'' for the treatment group -with which outcomes can be directly compared. +Their goal is to establish a \textbf{counterfactual}\sidenote{ + \textbf{Counterfactual:} A statistical description of what would have happened to specific individuals in an alternative scenario.} +for the treatment group with which outcomes can be directly compared. + \index{counterfactual} There are several resources that provide more or less mathematically intensive approaches to understanding how various methods do this. \textit{Impact Evaluation in Practice} is a strong general guide to these methods.\sidenote{ From 0db327ea875984544955b3b61fefbef0cdb607b9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:13:55 -0500 Subject: [PATCH 417/854] Update chapters/research-design.tex Co-Authored-By: Maria --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index d1c76a518..21e106034 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -127,7 +127,7 @@ \subsection{Estimating treatment effects using control groups} which is designed to maximize their ability to estimate the effect of an average unit being offered the treatment you want to evaluate. The focus on identification of the treatment effect, however, -means there are several essential features to causal identification methods +means there are several essential features of causal identification methods that are not common in other types of statistical and data science work. First, the econometric models and estimating equations used do not attempt to create a predictive or comprehensive model From c75ec81a3ba51bdb999b018c08c608985cd8b3c5 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:14:18 -0500 Subject: [PATCH 418/854] Update chapters/research-design.tex Co-Authored-By: Maria --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 21e106034..a8d568930 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -151,7 +151,7 @@ \subsection{Experimental and quasi-experimental research designs} Experimental research designs explicitly allow the research team to change the condition of the populations being studied,\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Experimental_Methods}} -often in the form of NGO programs, government regulations, +often in the form of government programs, NGO projects, new regulations, information campaigns, and many more types of interventions.\cite{banerjee2009experimental} The classic method is the \textbf{randomized control trial (RCT)}.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Randomized_Control_Trials}} From a58fdd55a20547ed9724481a778f80bcc5330976 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:14:43 -0500 Subject: [PATCH 419/854] Update chapters/research-design.tex Co-Authored-By: Maria --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index a8d568930..4c697597f 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -153,7 +153,7 @@ \subsection{Experimental and quasi-experimental research designs} \url{https://dimewiki.worldbank.org/wiki/Experimental_Methods}} often in the form of government programs, NGO projects, new regulations, information campaigns, and many more types of interventions.\cite{banerjee2009experimental} -The classic method is the \textbf{randomized control trial (RCT)}.\sidenote{ +The classic experimental method is the \textbf{randomized control trial (RCT)}.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Randomized_Control_Trials}} \index{randomized control trials} In randomized control trials, the control group is randomized -- From 54f2bc021d32da7ee457e0427c9335c1b7bdaa43 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:23:05 -0500 Subject: [PATCH 420/854] Update chapters/research-design.tex Co-Authored-By: Maria --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 4c697597f..744b21b32 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -166,7 +166,7 @@ \subsection{Experimental and quasi-experimental research designs} as evidenced by its broad credibility in fields ranging from clinical medicine to development. Therefore RCTs are very popular tools for determining the causal impact of specific programs or policy interventions. -However, there are many types of treatments that are impractical or unethical +However, there are many other types of interventions that are impractical or unethical to effectively approach using an experimental strategy, and therefore many limitations to accessing ``big questions'' through RCT approaches.\sidenote{ From 55152e030c3fdc99948862dea5d1a12e0f9126ba Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:23:19 -0500 Subject: [PATCH 421/854] Update chapters/research-design.tex Co-Authored-By: Maria --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 744b21b32..346a10bb0 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -168,7 +168,7 @@ \subsection{Experimental and quasi-experimental research designs} of specific programs or policy interventions. However, there are many other types of interventions that are impractical or unethical to effectively approach using an experimental strategy, -and therefore many limitations to accessing ``big questions'' +and therefore there are limitations to accessing ``big questions'' through RCT approaches.\sidenote{ \url{https://www.nber.org/papers/w14690.pdf}} From e704e2ae2784773c6a89b52a376664706dbe7950 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:25:50 -0500 Subject: [PATCH 422/854] power --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 346a10bb0..0cef7748a 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -181,7 +181,7 @@ \subsection{Experimental and quasi-experimental research designs} since programs will by definition have no effect if the population intended to be treated does not accept or does not receive the treatment. -Unfortunately, the loss of power happens with relatively small implementation gaps and is highly nonlinear: +Loss of statistical power occurs quickly and is highly nonlinear: 70\% takeup or efficacy doubles the required sample, and 50\% quadruples it.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up}} Such effects are also very hard to correct ex post, From 8e20d2c95904914d245d7a459e9ea9286e110102 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:30:33 -0500 Subject: [PATCH 423/854] randomization noise --- chapters/research-design.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 0cef7748a..d524aa9e3 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -177,6 +177,7 @@ \subsection{Experimental and quasi-experimental research designs} by chance, which is not in fact very similar to the treatment group. This feature is called randomization noise, and all RCTs share the need to assess how randomization noise may impact the estimates that are obtained. +(More detail on this later.) Second, takeup and implementation fidelity are extremely important, since programs will by definition have no effect if the population intended to be treated From 4198e355e9afa0d38a4063775a359fb4f505aae2 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:31:25 -0500 Subject: [PATCH 424/854] less carefully powered --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index d524aa9e3..52264fca4 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -217,7 +217,7 @@ \subsection{Experimental and quasi-experimental research designs} quasi-experimental designs are often power-constrained. Since the research team cannot change the population of the study or the treatment assignment, power is typically maximized by ensuring -that sampling for data collection is carefully powered +that sampling for data collection is carefully designed to match the study objectives and that attrition from the sampled groups is minimized. %----------------------------------------------------------------------------------------------- From feadab6eb226d8ae425f3bcb41f5eef9ee1ee893 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:37:23 -0500 Subject: [PATCH 425/854] cross-sections --- chapters/research-design.tex | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 52264fca4..64c102025 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -228,22 +228,22 @@ \section{Obtaining treatment effects from specific research designs} %----------------------------------------------------------------------------------------------- \subsection{Cross-sectional designs} -In an RCT, the control group is randomly constructed +A cross-sectional research design is any type of study +that collects data in only one time period +and directly compares treatment and control groups. +This type of data is easy to collect and handle because +you do not need track individual across time or across data sets. +If this point in time is after a treatment has been fully delivered, +then the outcome values at that point in time +already reflect the effect of the treatment. +If the study is an RCT, the control group is randomly constructed from the population that is eligible to receive each treatment. -In an observational study, we present other evidence that a similar equivalence holds. +If it is a non-randomized observational study, we present other evidence that a similar equivalence holds. Therefore, by construction, each unit's receipt of the treatment is unrelated to any of its other characteristics and the ordinary least squares (OLS) regression of outcome on treatment, without any control variables, is an unbiased estimate of the average treatment effect. -A \textbf{cross-section} is the simplest data structure that can be used. -This type of data is easy to collect and handle because -you do not need track individual across time or across data sets. -A cross-section is simply a representative set of observations -taken at a single point in time. -If this point in time is after a treatment has been fully delivered, -then the outcome values at that point in time -already reflect the effect of the treatment. For cross-sectional RCTs, what needs to be carefully maintained in data is the treatment randomization process itself, From 945f4c7070c0a6ccd678105528b282f3e600c63d Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:38:05 -0500 Subject: [PATCH 426/854] Update chapters/research-design.tex Co-Authored-By: Maria --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 64c102025..fa4ae21c2 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -250,7 +250,7 @@ \subsection{Cross-sectional designs} as well as detailed field data about differences in data quality and loss to follow-up across groups.\cite{athey2017econometrics} Only these details are needed to construct the appropriate estimator: -clustering of the estimate is required at the level +clustering of the standard errors is required at the level at which the treatment is assigned to observations, and controls are required for variables which were used to stratify the treatment (in the form of strata fixed effects).\sidenote{ From fa3b30dc02dbadeb500a71614ccea451f5871d82 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:40:12 -0500 Subject: [PATCH 427/854] strata FE --- chapters/research-design.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index fa4ae21c2..7e2ada71e 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -252,8 +252,8 @@ \subsection{Cross-sectional designs} Only these details are needed to construct the appropriate estimator: clustering of the standard errors is required at the level at which the treatment is assigned to observations, -and controls are required for variables which -were used to stratify the treatment (in the form of strata fixed effects).\sidenote{ +and variables which were used to stratify the treatment +must be included as controls (in the form of strata fixed effects).\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/impactevaluations/how-randomize-using-many-baseline-variables-guest-post-thomas-barrios}} \textbf{Randomization inference} can be used to estimate the underlying variability in the randomization process From fcf5a1b1710ecbe4ac5413050ea95e03a6db41b7 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:47:48 -0500 Subject: [PATCH 428/854] primary --- chapters/research-design.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 7e2ada71e..de4073cf1 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -326,7 +326,7 @@ \subsection{Difference-in-differences} In repeated cross-sections, each successive round of data collection contains a random sample of observations from the treated and untreated groups; as in cross-sectional designs, both the randomization and sampling processes -are critically important to maintain alongside the survey results. +are critically important to maintain alongside the data. In panel data structures, we attempt to observe the exact same units in different points in time, so that we see the same individuals both before and after they have received treatment (or not).\sidenote{ @@ -341,7 +341,7 @@ \subsection{Difference-in-differences} because attrition and loss to follow-up will remove that unit's information from all points in time, not just the one they are unobserved in. Panel-style experiments therefore require a lot more effort in field work -for studies that use survey data.\sidenote{ +for studies that use primary data.\sidenote{ \url{https://www.princeton.edu/~otorres/Panel101.pdf}} Since baseline and endline may be far apart in time, it is important to create careful records during the first round From 9fcbb61e28a08c969a6484d75c078fa947a163bb Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:48:30 -0500 Subject: [PATCH 429/854] did --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index de4073cf1..619ea043c 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -349,7 +349,7 @@ \subsection{Difference-in-differences} and attrition across rounds can be properly taken into account.\sidenote{ \url{http://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} -As with cross-sectional designs, this set of study designs is widespread. +As with cross-sectional designs, difference-in-differences designs are widespread. Therefore there exist a large number of standardized tools for analysis. Our \texttt{ietoolkit} Stata package includes the \texttt{ieddtab} command which produces standardized tables for reporting results.\sidenote{ From 4250d95e57df49218b937c28809ff23a806be840 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:49:06 -0500 Subject: [PATCH 430/854] running --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 619ea043c..248f20ffa 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -381,7 +381,7 @@ \subsection{Regression discontinuity} and is therefore made available only to individuals who meet a certain threshold requirement. The intuition of this design is that there is an underlying \textbf{running variable} that serves as the sole determinant of access to the program, -and a strict cutoff determines the value of this variable at which eligibility stops.\cite{imbens2008regression} +and a strict cutoff determines the value of this variable at which eligibility stops.\cite{imbens2008regression}\index{running variable} Common examples are test score thresholds, income thresholds, and some types of lotteries.\sidenote{ \url{http://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn}} The intuition is that individuals who are just above the threshold From fe3f4f2fdc9a42cc751777d80d9713f7e7e82ddc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:49:51 -0500 Subject: [PATCH 431/854] no lotteries --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 248f20ffa..1176e91dc 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -382,7 +382,7 @@ \subsection{Regression discontinuity} The intuition of this design is that there is an underlying \textbf{running variable} that serves as the sole determinant of access to the program, and a strict cutoff determines the value of this variable at which eligibility stops.\cite{imbens2008regression}\index{running variable} -Common examples are test score thresholds, income thresholds, and some types of lotteries.\sidenote{ +Common examples are test score thresholds and income thresholds.\sidenote{ \url{http://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn}} The intuition is that individuals who are just above the threshold will be very nearly indistinguishable from those who are just under it, From 1af32e3de28d7398bd8899b9acbf6851b5b38a91 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:50:44 -0500 Subject: [PATCH 432/854] LATE --- chapters/research-design.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 1176e91dc..e449afcfb 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -403,7 +403,8 @@ \subsection{Regression discontinuity} and whether the same units are known to be observed repeatedly. The treatment effect will be identified, however, by the addition of a control for the running variable -- meaning that the treatment effect variable -will only be applicable for observations in a small window around the cutoff. +will only be applicable for observations in a small window around the cutoff +and the treatment effects estimated will be ``local'' rather than ``average''. (Spatial discontinuity designs are handled a bit differently due to their multidimensionality.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/spatial-jumps}}) In the RD model, the functional form of that control and the size of that window, From a04af4c4237cf3811eddc139b04d1741668ce053 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:51:17 -0500 Subject: [PATCH 433/854] functional form --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index e449afcfb..a37ec5fcd 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -407,7 +407,7 @@ \subsection{Regression discontinuity} and the treatment effects estimated will be ``local'' rather than ``average''. (Spatial discontinuity designs are handled a bit differently due to their multidimensionality.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/spatial-jumps}}) -In the RD model, the functional form of that control and the size of that window, +In the RD model, the functional form of the running variable control and the size of that window, often referred to as the choice of \textbf{bandwidth} for the design, are the critical parameters for the result.\cite{calonico2019regression} Therefore, RD analysis often includes extensive robustness checking From 1a0adb619df6f3c13fc9a8c4e3854556ffd97e87 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:52:47 -0500 Subject: [PATCH 434/854] event study --- chapters/research-design.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index a37ec5fcd..e5b4c1cc2 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -389,7 +389,8 @@ \subsection{Regression discontinuity} and their post-treatment outcomes are therefore directly comparable.\cite{lee2010regression} The key assumption here is that the running variable cannot be directly manipulated by the potential recipients. -If the running variable is time there are special considerations.\cite{hausman2018regression} +If the running variable is time (what is commonly called an ``event study''), +there are special considerations.\cite{hausman2018regression} Regression discontinuity designs are, once implemented, very similar in analysis to cross-sectional or difference-in-differences designs. From 3bfbe88787430768198ebd409234ce704026d9c6 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:57:38 -0500 Subject: [PATCH 435/854] RD guide --- chapters/research-design.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index e5b4c1cc2..c4616d242 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -389,7 +389,7 @@ \subsection{Regression discontinuity} and their post-treatment outcomes are therefore directly comparable.\cite{lee2010regression} The key assumption here is that the running variable cannot be directly manipulated by the potential recipients. -If the running variable is time (what is commonly called an ``event study''), +If the running variable is time (what is commonly called an ``event study''), there are special considerations.\cite{hausman2018regression} Regression discontinuity designs are, once implemented, @@ -413,7 +413,8 @@ \subsection{Regression discontinuity} are the critical parameters for the result.\cite{calonico2019regression} Therefore, RD analysis often includes extensive robustness checking using a variety of both functional forms and bandwidths, -as well as placebo testing for non-realized locations of the cutoff. +as well as placebo testing for non-realized locations of the cutoff.\sidenote{ + \url{https://www.mdrc.org/sites/default/files/RDD\%20Guide\_Full\%20rev\%202016\_0.pdf}} In the analytical stage, regression discontinuity designs often include a large component of visual evidence presentation.\sidenote{ From c391d74b640883d270b2769a0cdcc020aeeb4377 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:58:35 -0500 Subject: [PATCH 436/854] built-in packages --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index c4616d242..930824e53 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -482,7 +482,7 @@ \subsection{Instrumental variables} In practice, there are a variety of packages that can be used to analyse data and report results from instrumental variables designs. While the built-in Stata command \texttt{ivregress} will often be used -to create the final results, these are not sufficient on their own. +to create the final results, the built-in packages are not sufficient on their own. The \textbf{first stage} of the design should be extensively tested, to demonstrate the strength of the relationship between the instrument and the treatment variable being instrumented.\cite{stock2005weak} From ef3fc1d83a72c3ff57d37ede14d0f1de73496d5b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 17:59:40 -0500 Subject: [PATCH 437/854] young --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 930824e53..10c86884c 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -490,7 +490,7 @@ \subsection{Instrumental variables} \url{https://www.carolinpflueger.com/WangPfluegerWeakivtest_20141202.pdf}} Additionally, tests should be run that identify and exclude individual observations or clusters that have extreme effects on the estimator, -using customized bootstrap or leave-one-out approaches. +using customized bootstrap or leave-one-out approaches.\cite{young2017consistency} Finally, bounds can be constructed allowing for imperfections in the exogeneity of the instrument using loosened assumptions, particularly when the underlying instrument is not directly randomized.\sidenote{ From 9c246ff0972bc5952d781771a1981c506d0862ce Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:01:07 -0500 Subject: [PATCH 438/854] strata --- chapters/research-design.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 10c86884c..736bceaa4 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -517,7 +517,8 @@ \subsection{Matching} When matching is performed before a randomization process, it can be done on any observable characteristics, including outcomes, if they are available. -The randomization should then record an indicator for the matching group. +The randomization should then record an indicator for each matching set, +as these become equivalent to randomization strata and require controls in analysis. This approach is stratification taken to its most extreme: it reduces the number of potential randomizations dramatically from the possible number that would be available From 0e678e4d8bd80092f4d42a0d614e60c158df99ce Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:03:29 -0500 Subject: [PATCH 439/854] PSM --- chapters/research-design.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 736bceaa4..724561504 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -528,7 +528,9 @@ \subsection{Matching} it is based on the assertion that within the matched groups, the assignment of treatment is as good as random. However, since most matching models rely on a specific linear model, -such as the typical \textbf{propensity score matching} estimator, +such as \textbf{propensity score matching},\sidenote{ + \textbf{Propensity Score Matching (PSM):} An estimation method that controls for the likelihood + that each unit of observation would recieve treatment as predicted by observable characteristics.} they are open to the criticism of ``specification searching'', meaning that researchers can try different models of matching until one, by chance, leads to the final result that was desired; From 07dd863f739c31f85262e54881e75aa8d5454f72 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:03:59 -0500 Subject: [PATCH 440/854] or not to be --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 724561504..bd3ee1e5e 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -501,7 +501,7 @@ \subsection{Instrumental variables} \subsection{Matching} \textbf{Matching} methods use observable characteristics of individuals -to directly construct treatment and control groups as similar as possible +to directly construct treatment and control groups to be as similar as possible to each other, either before a randomization process or after the collection of non-randomized data.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Matching}} From 1534b33df02db84dd3a6f140b60755994a3d873d Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:05:18 -0500 Subject: [PATCH 441/854] Update chapters/research-design.tex Co-Authored-By: Maria --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index bd3ee1e5e..52082c5c3 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -562,7 +562,7 @@ \subsection{Matching} %----------------------------------------------------------------------------------------------- \subsection{Synthetic controls} -\textbf{Synthetic control} is a relatively newer method +\textbf{Synthetic control} is a relatively new method for the case when appropriate counterfactual individuals do not exist in reality and there are very few (often only one) treatment units.\cite{abadie2015comparative} \index{synthetic controls} From fa922de7df148f33f51d430a250f2f57c2fac5ec Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:05:41 -0500 Subject: [PATCH 442/854] Update chapters/research-design.tex Co-Authored-By: Maria --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 52082c5c3..86e0e3b70 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -540,7 +540,7 @@ \subsection{Matching} are designed to remove some of the dependence on linearity. In all ex-post cases, pre-specification of the exact matching model can prevent some of the potential criticisms on this front, -but ex-post matching in general is not regarded as a strong approach. +but ex-post matching in general is not regarded as a strong identification strategy. Analysis of data from matching designs is relatively straightforward; the simplest design only requires controls (indicator variables) for each group From 76e04a5aa75e36378a7475825e589da12ae22fd6 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:15:07 -0500 Subject: [PATCH 443/854] analytical dimensions --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 9e79223be..062f503ea 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -14,7 +14,7 @@ will be able to make meaningful inferences about, and randomization analyses simulate counterfactual possibilities if the events being studied had happened differently. -These needs are particularly important in the initial phases of development studies -- +These analytical dimensions are particularly important in the initial phases of development research -- typically conducted well before any actual fieldwork occurs -- and often have implications for planning and budgeting. From 24184a6cddaa68546d0ca1dfff0b336508536d48 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 27 Jan 2020 18:15:21 -0500 Subject: [PATCH 444/854] [ch6] final para --- chapters/data-analysis.tex | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 904671a80..b0a7b99c3 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -499,6 +499,10 @@ \section{Exporting analysis outputs} This means it should be easy to read and understand them with only the information they contain. Make sure labels and notes cover all relevant information, such as sample, unit of observation, unit of measurement and variable definition.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Checklist:\_Reviewing\_Graphs} \\ \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Submit\_Table}} - +If you follow the steps outlined in this chapter, most of the data work involved in the last step of the research process -- publication -- will already be done. +If you used de-identified data for analysis, publishing the cleaned data set in a trusted repository will allow you to cite your data. +Some of the documentation produced during cleaning and construction can be published even if your data is too sensitive to be published. +Your analysis code will be organized in a reproducible way, so will need to do release a replication package is a last round of code review. +This will allow you to focus on what matters: writing up your results into a compelling story. %------------------------------------------------ From caf0dc088f390e5d2b74011d1c91d953d3bf5624 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:15:36 -0500 Subject: [PATCH 445/854] feasibility --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 062f503ea..0bc2174e8 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -16,7 +16,7 @@ if the events being studied had happened differently. These analytical dimensions are particularly important in the initial phases of development research -- typically conducted well before any actual fieldwork occurs -- -and often have implications for planning and budgeting. +and often have implications for feasibility, planning, and budgeting. Power calculations and randomization inference methods give us the tools to critically and quantitatively assess different From c82b8b110705a550a230c40cdcbe2b1c5efdbde5 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:17:14 -0500 Subject: [PATCH 446/854] statistical noise Co-Authored-By: Maria --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 0bc2174e8..f6536b9de 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -30,7 +30,7 @@ creating groups that are not good counterfactuals for each other. Power calculation and randomization inference are the main methods by which these probabilities of error are assessed. -Good experimental design has high \textbf{power} -- a low likelihood that this noise +Good experimental design has high \textbf{power} -- a low likelihood that statistical noise will substantially affect estimates of treatment effects. Not all studies are capable of achieving traditionally high power: From 03c6502cb098b7aeed1a3fa25894f464349bf840 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:19:04 -0500 Subject: [PATCH 447/854] correctly Co-Authored-By: Maria --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index f6536b9de..c9450032e 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -54,7 +54,7 @@ \section{Random processes in Stata} The fundamental econometrics behind impact evaluation depends on establishing that the observations in the sample and any experimental treatment assignment processes are truly random. -Therefore, understanding and programming for sampling and randomization +Therefore, understanding and correctly programming for sampling and randomization is essential to ensuring that planned experiments are correctly implemented in the field, so that the results can be interpreted according to the experimental design. From 0c2c157738e8b9b2220cff8d57c7626728fad3e3 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:20:11 -0500 Subject: [PATCH 448/854] random processes --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 0bc2174e8..ffd34ecaa 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -58,7 +58,7 @@ \section{Random processes in Stata} is essential to ensuring that planned experiments are correctly implemented in the field, so that the results can be interpreted according to the experimental design. -(Note that there are two distinct concepts referred to here by ``randomization'': +(Note that there are two distinct random processes referred to here: the conceptual process of assigning units to treatment arms, and the technical process of assigning random numbers in statistical software, which is a part of all tasks that include a random component.\sidenote{ From 3030909a90cfe0e5fadf4c5c3805135d3b3293b5 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 27 Jan 2020 18:22:38 -0500 Subject: [PATCH 449/854] [ch6] explain the 'explanatory guide' --- chapters/data-analysis.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index b0a7b99c3..748acd52f 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -353,10 +353,11 @@ \section{Indicator construction} Remember to consider keeping related variables together and adding notes to each as necessary. % Documentation -It is wise to start an explanatory guide as soon as you start making changes to the data. +It is wise to start writing a variable dictionary as soon as you begin making changes to the data. Carefully record how specific variables have been combined, recoded, and scaled. -This can be part of a wider discussion with your team about creating protocols for variable definition. -That will guarantee that indicators are defined consistently across projects. +This can be part of a wider discussion with your team about creating protocols for variable definition, which will guarantee that indicators are defined consistently across projects. +When all your final variables have been created, you can use \texttt{iecodebook}'s \texttt{export} subcommand to list all variables in the data set, +and complement it with the variable definitions you wrote during construction to create a concise meta data document. Documentation is an output of construction as relevant as the code and the data. Someone unfamiliar with the project should be able to understand the contents of the analysis data sets, the steps taken to create them, and the decision-making process through your documentation. The construction documentation will complement the reports and notes created during data cleaning. From 3aab9a88b7a75d3b5c2e2e37a0b35939223f5890 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:25:28 -0500 Subject: [PATCH 450/854] field realities Co-Authored-By: Maria --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 535a9489d..16a06a76b 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -77,7 +77,7 @@ \section{Random processes in Stata} and the third introduces common varieties encountered in the field. The fourth section discusses more advanced topics that are used to analyze the random processes directly in order to understand their properties. -However, the needs you will encounter in the field will inevitably +However, field realities will inevitably be more complex than anything we present here, and you will need to recombine these lessons to match your project's needs. From b1b9396b6b44db03b7842a89582909a56766fa79 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:25:44 -0500 Subject: [PATCH 451/854] must Co-Authored-By: Maria --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 16a06a76b..262594476 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -85,7 +85,7 @@ \subsection{Reproducibility in random Stata processes} Reproducibility in statistical programming means that random results can be re-obtained at a future time. -All random methods should be reproducible.\cite{orozco2018make} +All random methods must be reproducible.\cite{orozco2018make} Stata, like most statistical software, uses a \textbf{pseudo-random number generator}. Basically, it has a really long ordered list of numbers with the property that knowing the previous one gives you precisely zero information about the next one. From 4cec68372f89b69ae8e6aee632a6929fcede235e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:26:06 -0500 Subject: [PATCH 452/854] each time Co-Authored-By: Maria --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 262594476..0027c9fe9 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -100,7 +100,7 @@ \subsection{Reproducibility in random Stata processes} In Stata, this is accomplished through three command concepts: \textbf{versioning}, \textbf{sorting}, and \textbf{seeding}. -\textbf{Versioning} means using the same version of the software. +\textbf{Versioning} means using the same version of the software each time you run the random process. If anything is different, the underlying randomization algorithms may have changed, and it will be impossible to recover the original result. In Stata, the \texttt{version} command ensures that the software algorithm is fixed. From eedd8798fb25561988ae1f522ca9a6766864ee96 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:28:17 -0500 Subject: [PATCH 453/854] Versioning --- chapters/sampling-randomization-power.tex | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 535a9489d..03f8e4d8b 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -103,10 +103,11 @@ \subsection{Reproducibility in random Stata processes} \textbf{Versioning} means using the same version of the software. If anything is different, the underlying randomization algorithms may have changed, and it will be impossible to recover the original result. -In Stata, the \texttt{version} command ensures that the software algorithm is fixed. -We recommend using \texttt{version 13.1} for backward compatibility; -the algorithm was changed after Stata 14 but its improvements do not matter in practice. -(Note that you will \textit{never} be able to transfer a randomization to another software such as R.) +In Stata, the \texttt{version} command ensures that the software algorithm is fixed.\sidenote{ +At the time of writing we recommend using \texttt{version 13.1} for backward compatibility; +the algorithm was changed after Stata 14 but its improvements do not matter in practice.} +Note that you will \textit{never} be able to reproduce a randomization in a different software, +such as moving from Stata to R or vice versa.} The \texttt{ieboilstart} command in \texttt{ietoolkit} provides functionality to support this requirement.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/ieboilstart}} We recommend you use \texttt{ieboilstart} at the beginning of your master do-file.\sidenote{ From 56a618ccc4225d2882f551240899f08d59569eb6 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:30:04 -0500 Subject: [PATCH 454/854] seeds --- chapters/sampling-randomization-power.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 640141f56..75ab51d7f 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -132,9 +132,9 @@ \subsection{Reproducibility in random Stata processes} (This link is a just shortcut to request such a random seed on \url{https://www.random.org}.) There are many more seeds possible but this is a large enough set for most purposes. In Stata, \texttt{set seed [seed]} will set the generator to that state. -You should use exactly one seed per randomization process. +You should use exactly one unique, different, and randomly created seed per randomization process. To be clear: you should not set a single seed once in the master do-file, -but instead you should set one in code right before each random process. +but instead you should set a new seed in code right before each random process. The most important thing is that each of these seeds is truly random, so do not use shortcuts such as the current date or a seed you have used before. You will see in the code below that we include the source and timestamp for verification. From d1344c796439cec5b41a9cc278d1067987ad4469 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:31:51 -0500 Subject: [PATCH 455/854] random processes --- chapters/sampling-randomization-power.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 75ab51d7f..fe8577862 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -83,7 +83,9 @@ \section{Random processes in Stata} \subsection{Reproducibility in random Stata processes} -Reproducibility in statistical programming means that random results +Any process that includes a random component is a random process, +including sampling, randomization, power calculation simulations, and algorithms like bootstrapping. +Reproducibility in statistical programming means that the outputs of random processes can be re-obtained at a future time. All random methods must be reproducible.\cite{orozco2018make} Stata, like most statistical software, uses a \textbf{pseudo-random number generator}. @@ -138,8 +140,6 @@ \subsection{Reproducibility in random Stata processes} The most important thing is that each of these seeds is truly random, so do not use shortcuts such as the current date or a seed you have used before. You will see in the code below that we include the source and timestamp for verification. -Any process that includes a random component is a random process, -including sampling, randomization, power calculation simulations, and algorithms like bootstrapping. Other commands may induce randomness in the data or alter the seed without you realizing it, so carefully confirm exactly how your code runs before finalizing it.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Randomization_in_Stata}} From 17174e8ea6ae965b15f0e243e7cee6760ee35c5a Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 27 Jan 2020 18:32:57 -0500 Subject: [PATCH 456/854] [ch6] construction vs analysis --- chapters/data-analysis.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 748acd52f..ef9ab7323 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -297,13 +297,14 @@ \section{Indicator construction} So you want to construct indicators for both rounds in the same code, after merging them. % From analysis -Data construction is never a finished process. -It comes ``before'' data analysis only in a limited sense: the construction code must be run before the analysis code. -Typically, however, construction and analysis code are written concurrently. +Ideally, indicator construction is done right after data cleaning, following the pre-analysis plan. +In practice, it's almost never that easy. As you write the analysis, different constructed variables will become necessary, as well as subsets and other alterations to the data. Still, constructing variables in a separate script from the analysis will help you ensure consistency across different outputs. If every script that creates a table starts by loading a data set, subsetting it and manipulating variables, any edits to construction need to be replicated in all scripts. This increases the chances that at least one of them will have a different sample or variable definition. +Therefore, even if construction ends up coming before analysis only in the order the code is run, +it's important to think of them as different steps. % What to do during construction ----------------------------------------- From 82395de8baf1ece3de76b315d191cb10003e0ba9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:33:23 -0500 Subject: [PATCH 457/854] stop noting that --- chapters/sampling-randomization-power.tex | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index fe8577862..1bd861d64 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -58,11 +58,11 @@ \section{Random processes in Stata} is essential to ensuring that planned experiments are correctly implemented in the field, so that the results can be interpreted according to the experimental design. -(Note that there are two distinct random processes referred to here: +There are two distinct random processes referred to here: the conceptual process of assigning units to treatment arms, and the technical process of assigning random numbers in statistical software, which is a part of all tasks that include a random component.\sidenote{ - \url{https://blog.stata.com/2016/03/10/how-to-generate-random-numbers-in-stata/}}) + \url{https://blog.stata.com/2016/03/10/how-to-generate-random-numbers-in-stata/}} Randomization is challenging. It is deeply unintuitive for the human brain. ``True'' randomization is also nearly impossible to achieve for computers, @@ -108,13 +108,13 @@ \subsection{Reproducibility in random Stata processes} In Stata, the \texttt{version} command ensures that the software algorithm is fixed.\sidenote{ At the time of writing we recommend using \texttt{version 13.1} for backward compatibility; the algorithm was changed after Stata 14 but its improvements do not matter in practice.} -Note that you will \textit{never} be able to reproduce a randomization in a different software, +You will \textit{never} be able to reproduce a randomization in a different software, such as moving from Stata to R or vice versa.} The \texttt{ieboilstart} command in \texttt{ietoolkit} provides functionality to support this requirement.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/ieboilstart}} We recommend you use \texttt{ieboilstart} at the beginning of your master do-file.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Master_Do-files}} -However, note that testing your do-files without running them +However, testing your do-files without running them via the master do-file may produce different results, since Stata's \texttt{version} setting expires after each time you run your do-files. @@ -159,9 +159,9 @@ \section{Sampling and randomization} play an important role in determining the size of the confidence intervals for any estimates generated from that sample, and therefore our ability to draw conclusions. -(Note that random sampling and random assignment serve different purposes. -Random sampling ensures that you have unbiased population estimates, -and random assignment ensures that you have unbiased treatment estimates.) +Random sampling and random assignment serve different purposes. +Random \textit{sampling} ensures that you have unbiased population estimates, +and random \textit{assignment} ensures that you have unbiased treatment estimates. If you randomly sample or assign a set number of observations from a set frame, there are a large -- but fixed -- number of permutations which you may draw. From bfe3d7021dfda69a039c96446d2b5b0e715c5de6 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 27 Jan 2020 18:52:51 -0500 Subject: [PATCH 458/854] population of interest Co-Authored-By: Maria --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 1bd861d64..e0794b10d 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -432,7 +432,7 @@ \subsection{Power calculations} sampling and randomization, clustering, stratification, and treatment arms quickly becomes very complex. -Furthermore, you should use real data whenever it is available, +Furthermore, you should use real data on the population of interest whenever it is available, or you will have to make assumptions about the distribution of outcomes. Together, the concepts of minimum detectable effect From 429717e6c234aa588881ba99a33f4b22e9fc14a6 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 27 Jan 2020 20:00:41 -0500 Subject: [PATCH 459/854] [ch6] reorganize data management section --- chapters/data-analysis.tex | 51 +++++++++++++++++++------------------- 1 file changed, 25 insertions(+), 26 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index ef9ab7323..d6d8e60ae 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -42,29 +42,7 @@ \section{Data management} Smart use of version control also allows you to track how each edit affects other files in the project. -% Task breakdown -We divide the process of turning raw data into analysis data into three stages: -data cleaning, variable construction, and data analysis. -Though they are frequently implemented at the same time, -we find that creating separate scripts and data sets prevents mistakes. -It will be easier to understand this division as we discuss what each stage comprises. -What you should know by now is that each of these stages has well-defined inputs and outputs. -This makes it easier to track tasks across scripts, -and avoids duplication of code that could lead to inconsistent results. -For each stage, there should be a code folder and a corresponding data set. -The names of codes, data sets and outputs for each stage should be consistent, -making clear how they relate to one another. -So, for example, a script called \texttt{clean-section-1} would create -a data set called \texttt{cleaned-section-1}. - -The division of a project in stages also helps the review workflow inside your team. -The code, data and outputs of each of these stages should go through at least one round of code review. -During the code review process, team members should read and run each other's codes. -Doing this at the end of each stage helps prevent the amount of work to be reviewed to become too overwhelming. -Code review is a common quality assurance practice among data scientists. -It helps to keep the level of the outputs high, and is also a great way to learn and improve your code. - -% Folder structure +\subsection{Folder structure} There are many schemes to organize research data. Our preferred scheme reflects the task breakdown just discussed. \index{data organization} @@ -91,7 +69,29 @@ \section{Data management} Additionally, \texttt{iefolder} creates \textbf{master do-files}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} so all project code is reflected in a top-level script. -% Master scripts +\subsection{Task breakdown} +We divide the process of turning raw data into analysis data into three stages: +data cleaning, variable construction, and data analysis. +Though they are frequently implemented at the same time, +we find that creating separate scripts and data sets prevents mistakes. +It will be easier to understand this division as we discuss what each stage comprises. +What you should know by now is that each of these stages has well-defined inputs and outputs. +This makes it easier to track tasks across scripts, +and avoids duplication of code that could lead to inconsistent results. +For each stage, there should be a code folder and a corresponding data set. +The names of codes, data sets and outputs for each stage should be consistent, +making clear how they relate to one another. +So, for example, a script called \texttt{clean-section-1} would create +a data set called \texttt{cleaned-section-1}. + +The division of a project in stages also helps the review workflow inside your team. +The code, data and outputs of each of these stages should go through at least one round of code review. +During the code review process, team members should read and run each other's codes. +Doing this at the end of each stage helps prevent the amount of work to be reviewed to become too overwhelming. +Code review is a common quality assurance practice among data scientists. +It helps to keep the level of the outputs high, and is also a great way to learn and improve your code. + +\subsection{Master scripts} Master scripts allow users to execute all the project code from a single file. They briefly describes what each code, and maps the files they require and create. @@ -104,7 +104,7 @@ \section{Data management} and where different files can be found in the project folder. That is, it should contain all the information needed to interact with a project's data work. -% Version control +\subsection{Version control} Finally, everything that can be version-controlled should be. Version control allows you to effectively track code edits, including the addition and deletion of files. @@ -126,7 +126,6 @@ \section{Data management} \section{Data cleaning} -% intro: what is data cleaning ------------------------------------------------- Data cleaning is the first stage of transforming the data you received from the field into data that you can analyze.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. From 4c62e83e4fdb44b315f653ed4005e9a6b5cd8beb Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 27 Jan 2020 20:48:51 -0500 Subject: [PATCH 460/854] [ch6] subsections to data cleaning --- chapters/data-analysis.tex | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index d6d8e60ae..eec78e8b4 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -139,8 +139,8 @@ \section{Data cleaning} You should use this time to understand the types of responses collected, both within each survey question and across respondents. Knowing your data set well will make it possible to do analysis. +\subsection{De-identification} -% Deidentification ------------------------------------------------------------------ The initial input for data cleaning is the raw data. It should contain only materials that are received directly from the field. They will invariably come in a host of file formats and nearly always contain personally-identifying information.\index{personally-identifying information} @@ -176,7 +176,8 @@ \section{Data cleaning} However, if sensitive information is strictly needed for analysis, the data must be encrypted while performing the tasks described in this chapter. -% Unique ID and data entry corrections --------------------------------------------- +\subsection{Correction of data entry errors} + There are two main cases when the raw data will be modified during data cleaning. The first one is when there are duplicated entries in the data. Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/wiki/ID\_Variable\_Properties}} @@ -201,7 +202,8 @@ \section{Data cleaning} and you should keep a careful record of how they were identified, and how the correct value was obtained. -% Data description ------------------------------------------------------------------ +\subsection{Labeling and annotating the raw data} + On average, making corrections to primary data is more time-consuming than when using secondary data. But you should always check for possible issues in any data you are about to use. The last step of data cleaning, however, will most likely still be necessary. @@ -225,9 +227,8 @@ \section{Data cleaning} Finally, any additional information collected only for quality monitoring purposes, such as notes and duration fields, can also be dropped. -% Outputs ----------------------------------------------------------------- +\subsection{Outputs from data cleaning} -% Data set The most important output of data cleaning is the cleaned data set. It should contain the same information as the raw data set, with no changes to data points. From 66e2227de76468ba01a54db0f0582d5be0dd05a5 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 27 Jan 2020 21:12:04 -0500 Subject: [PATCH 461/854] [ch6] documentation --- chapters/data-analysis.tex | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index eec78e8b4..d55422588 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -248,7 +248,7 @@ \subsection{Outputs from data cleaning} use commands such as \texttt{compress} in Stata to make sure the data is always stored in the most efficient format. -% Documentation +\subsection{Documenting data cleaning} Throughout the data cleaning process, you will need inputs from the field, including enumerator manuals, survey instruments, supervisor notes, and data quality monitoring reports. @@ -260,12 +260,20 @@ \subsection{Outputs from data cleaning} Include in the \texttt{Documentation} folder records of any corrections made to the data, including to duplicated entries, as well as communications from the field where theses issues are reported. -Make sure to also have a record of potentially problematic patterns you noticed -while exploring the data, such as outliers and variables with many missing values. -Be very careful not to include sensitive information in -documentation that is not securely stored, +Be very careful not to include sensitive information in documentation that is not securely stored, or that you intend to release as part of a replication package or data publication. +Another important component of data cleaning documentation are the results of +As clean your data set, take the time to explore the variables in it. +Use tabulations, histograms and density plots to understand the structure of data, +and look for potentially problematic patterns such as outliers, +missing values and distributions that may be caused by data entry errors. +Don't spend time trying to correct data points that were not flagged during data quality monitoring. +Instead, create a record of what you observe, +then use it as a basis to discuss with your team how to address potential issues during data construction. +This material will also be valuable during exploratory data analysis. + + \section{Indicator construction} % What is construction ------------------------------------- From 3d73788098391dc1836653f8b020211e521cfd5b Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 27 Jan 2020 21:22:11 -0500 Subject: [PATCH 462/854] [ch6] publishable data --- chapters/data-analysis.tex | 50 ++++++++++++++++++++++---------------- 1 file changed, 29 insertions(+), 21 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index d55422588..d36e966bb 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -227,27 +227,6 @@ \subsection{Labeling and annotating the raw data} Finally, any additional information collected only for quality monitoring purposes, such as notes and duration fields, can also be dropped. -\subsection{Outputs from data cleaning} - -The most important output of data cleaning is the cleaned data set. -It should contain the same information as the raw data set, -with no changes to data points. -It should also be easily traced back to the survey instrument, -and be accompanied by a dictionary or codebook. -Typically, one cleaned data set will be created for each data source, -i.e. per survey instrument. -Each row in the cleaned data set represents one survey entry or unit of observation.\sidenote{\cite{tidy-data}} -If the raw data set is very large, or the survey instrument is very complex, -you may want to break the data cleaning into sub-steps, -and create intermediate cleaned data sets -(for example, one per survey module). -Breaking cleaned data sets into the smallest unit of observation inside a roster -make the cleaning faster and the data easier to handle during construction. -But having a single cleaned data set will help you with sharing and publishing the data. -To make sure this file doesn't get too big to be handled, -use commands such as \texttt{compress} in Stata to make sure the data -is always stored in the most efficient format. - \subsection{Documenting data cleaning} Throughout the data cleaning process, you will need inputs from the field, including enumerator manuals, survey instruments, @@ -274,6 +253,35 @@ \subsection{Documenting data cleaning} This material will also be valuable during exploratory data analysis. +\subsection{The cleaned data set} + +The main output of data cleaning is the cleaned data set. +It should contain the same information as the raw data set, +with no changes to data points. +It should also be easily traced back to the survey instrument, +and be accompanied by a dictionary or codebook. +Typically, one cleaned data set will be created for each data source, +i.e. per survey instrument. +Each row in the cleaned data set represents one survey entry or unit of observation.\sidenote{\cite{tidy-data}} +If the raw data set is very large, or the survey instrument is very complex, +you may want to break the data cleaning into sub-steps, +and create intermediate cleaned data sets +(for example, one per survey module). +Breaking cleaned data sets into the smallest unit of observation inside a roster +make the cleaning faster and the data easier to handle during construction. +But having a single cleaned data set will help you with sharing and publishing the data. +To make sure this file doesn't get too big to be handled, +use commands such as \texttt{compress} in Stata to make sure the data +is always stored in the most efficient format. +Once you have a cleaned, de-identified data set, and documentation to support it, +you have created the first data output of your project: +a publishable data set. +The next chapter will get into the details of data publication. +For now, all you need to know is that your team should consider submitting the data set for publication at this point, +even if it will remain embargoed for some time. +This will help you organize your files and create a back up of the data, +and some donors require that the data be filed as an intermediate step of the project. + \section{Indicator construction} % What is construction ------------------------------------- From d32c7a5326e9712328b5becc769962e888b89b3a Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 27 Jan 2020 21:29:01 -0500 Subject: [PATCH 463/854] [ch6] construction and PAP --- chapters/data-analysis.tex | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index d36e966bb..6cf3d9013 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -228,6 +228,7 @@ \subsection{Labeling and annotating the raw data} such as notes and duration fields, can also be dropped. \subsection{Documenting data cleaning} + Throughout the data cleaning process, you will need inputs from the field, including enumerator manuals, survey instruments, supervisor notes, and data quality monitoring reports. @@ -252,7 +253,6 @@ \subsection{Documenting data cleaning} then use it as a basis to discuss with your team how to address potential issues during data construction. This material will also be valuable during exploratory data analysis. - \subsection{The cleaned data set} The main output of data cleaning is the cleaned data set. @@ -285,10 +285,11 @@ \subsection{The cleaned data set} \section{Indicator construction} % What is construction ------------------------------------- -Data construction is the process of processing the data points as provided in the raw data to make them suitable for analysis. +The second stage in the creation of analysis data is construction. +Constructing variables means processing the data points as provided in the raw data to make them suitable for analysis. It is at this stage that the raw data is transformed into analysis data. -This is done by creating derived variables -(binaries, indices, and interactions, to name a few). +This is done by creating derived variables (binaries, indices, and interactions, to name a few), +as planned during research design, and using the pre-analysis plan as a guide. To understand why construction is necessary, let's take the example of a household survey's consumption module. It will result in separate variables indicating the From a4da1825f04fef1fb98af2dcf6740d4d80a138d2 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 27 Jan 2020 21:34:45 -0500 Subject: [PATCH 464/854] [ch6] construction subsections --- chapters/data-analysis.tex | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 6cf3d9013..73d5e8c7c 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -323,7 +323,7 @@ \section{Indicator construction} Therefore, even if construction ends up coming before analysis only in the order the code is run, it's important to think of them as different steps. - +\subsection{Construction tasks} % What to do during construction ----------------------------------------- Keep in mind that details matter when constructing variables, and overlooking them may affect your results. It is important to check and double-check the value-assignments of questions and their scales before constructing new variables using them. @@ -349,7 +349,20 @@ \section{Indicator construction} % Outputs ----------------------------------------------------------------- -% Data set +\subsection{Documenting indicators construction} + +It is wise to start writing a variable dictionary as soon as you begin making changes to the data. +Carefully record how specific variables have been combined, recoded, and scaled. +This can be part of a wider discussion with your team about creating protocols for variable definition, which will guarantee that indicators are defined consistently across projects. +When all your final variables have been created, you can use \texttt{iecodebook}'s \texttt{export} subcommand to list all variables in the data set, +and complement it with the variable definitions you wrote during construction to create a concise meta data document. +Documentation is an output of construction as relevant as the code and the data. +Someone unfamiliar with the project should be able to understand the contents of the analysis data sets, the steps taken to create them, and the decision-making process through your documentation. +The construction documentation will complement the reports and notes created during data cleaning. +Together, they will form a detailed account of the data processing. + +\subsection{Constructed data sets} + The outputs of construction are the data sets that will be used for analysis. The level of observation of a constructed data set is the unit analysis. Each data set is purpose-built to answer an analysis question. @@ -362,7 +375,7 @@ \section{Indicator construction} Think of an agricultural intervention that was randomized across villages and only affected certain plots within each village. The research team may want to run household-level regressions on income, test for plot-level productivity gains, and check if village characteristics are balanced. Having three separate datasets for each of these three pieces of analysis will result in much cleaner do files than if they all started from the same file. - + One thing all constructed data sets should have in common, though, are functionally-named variables. Constructed variables are called ``constructed'' because they were not present in the survey to start with, so making their names consistent with the survey form is not as crucial. @@ -370,19 +383,6 @@ \section{Indicator construction} However, functionality should be prioritized here. Remember to consider keeping related variables together and adding notes to each as necessary. -% Documentation -It is wise to start writing a variable dictionary as soon as you begin making changes to the data. -Carefully record how specific variables have been combined, recoded, and scaled. -This can be part of a wider discussion with your team about creating protocols for variable definition, which will guarantee that indicators are defined consistently across projects. -When all your final variables have been created, you can use \texttt{iecodebook}'s \texttt{export} subcommand to list all variables in the data set, -and complement it with the variable definitions you wrote during construction to create a concise meta data document. -Documentation is an output of construction as relevant as the code and the data. -Someone unfamiliar with the project should be able to understand the contents of the analysis data sets, the steps taken to create them, and the decision-making process through your documentation. -The construction documentation will complement the reports and notes created during data cleaning. -Together, they will form a detailed account of the data processing. - - - %------------------------------------------------ \section{Writing data analysis code} From 0602589e915172695ec1728659cbe003ad446aa1 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 28 Jan 2020 10:04:48 -0500 Subject: [PATCH 465/854] [ch6] construction with primary panel adta --- chapters/data-analysis.tex | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 73d5e8c7c..6fc619b48 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -348,6 +348,15 @@ \subsection{Construction tasks} Because data construction involves translating concrete data points to more abstract measurements, it is important to document exactly how each variable is derived or calculated. % Outputs ----------------------------------------------------------------- +When dealing with primary panel data, is it common to construct indicators soon after receiving data from a new survey round. +However, creating indicators for each round separately increases the risk of using different definitions every time. +Having a well-established definition for each constructed variable helps prevent that mistake, +but the best way to guarantee it won't happen is to create the indicators for all rounds in the same script. +Say you constructed variables after baseline, and are now receiving midline data. +Then the first thing you should do is create a panel data set +-- \{texttt{iecodebook}'s \texttt{append} subcommand will help you reconcile and append survey rounds. +After that, adapt the construction code so it can be used on the panel data set. +Apart from preventing inconsistencies, this process will also save you time and give you an opportunity to review your original code. \subsection{Documenting indicators construction} From daf5e9196581dc85f1798c61f4d6c6abf9202b38 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 28 Jan 2020 10:09:29 -0500 Subject: [PATCH 466/854] [ch6] construction with primary panel adta (2) --- chapters/data-analysis.tex | 2 ++ 1 file changed, 2 insertions(+) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 6fc619b48..3be7718a7 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -349,6 +349,8 @@ \subsection{Construction tasks} % Outputs ----------------------------------------------------------------- When dealing with primary panel data, is it common to construct indicators soon after receiving data from a new survey round. +Dealing with primary panel data also has its own complexities. +It is common to construct indicators soon after receiving data from a new survey round. However, creating indicators for each round separately increases the risk of using different definitions every time. Having a well-established definition for each constructed variable helps prevent that mistake, but the best way to guarantee it won't happen is to create the indicators for all rounds in the same script. From 092fe91b7e166ef6dfe82618d45fb5a6c1fa2de2 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 28 Jan 2020 10:45:23 -0500 Subject: [PATCH 467/854] [ch6] construction example --- chapters/data-analysis.tex | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 3be7718a7..43f8f88e8 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -288,18 +288,17 @@ \section{Indicator construction} The second stage in the creation of analysis data is construction. Constructing variables means processing the data points as provided in the raw data to make them suitable for analysis. It is at this stage that the raw data is transformed into analysis data. -This is done by creating derived variables (binaries, indices, and interactions, to name a few), +This is done by creating derived variables (dummies, indices, and interactions, to name a few), as planned during research design, and using the pre-analysis plan as a guide. To understand why construction is necessary, let's take the example of a household survey's consumption module. -It will result in separate variables indicating the -amount of each item in the bundle that was consumed. -There may be variables indicating the cost of these items. -You cannot run a meaningful regression on these variables. -You need to manipulate them into something that has \textit{economic} meaning. +For each item in a context-specific bundle, it will ask whether the household consumed any of it over a certain period of time. +If they did, it will then ask about quantities, units and expenditure for each item. +However, it is difficult to run a meaningful regression on the number of cups of milk and handfuls of beans that a household consumed over a week. +You need to manipulate them into something that has \textit{economic} meaning, +such as caloric input or food expenditure per adult equivalent. During this process, the data points will typically be reshaped and aggregated -so that level of the data set goes from the unit of observation in the survey to the unit of analysis.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} -To use the same example, the data on quantity consumed was collect for each item, and needs to be aggregated to the household level before analysis. +so that level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} % Why it is a separate process ------------------------------- From a74346f587f6762440839f3852d584c1eb987ec7 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 10:57:14 -0500 Subject: [PATCH 468/854] types --- chapters/research-design.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 86e0e3b70..350ae7c4f 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -66,13 +66,13 @@ \section{Causality, inference, and identification} so you can calculate and interpret those estimates appropriately. All the study designs we discuss here use the potential outcomes framework\cite{athey2017state} to compare a group that received some treatment to another, counterfactual group. -Each of these approaches can be used in two contexts: +Each of these approaches can be used in two types of designs: \textbf{experimental} designs, in which the research team is directly responsible for creating the variation in treatment, and \textbf{quasi-experimental} designs, in which the team identifies a ``natural'' source of variation and uses it for identification. -Neither approach is implicitly better or worse, -and both are capable of achieving effect identification under different contexts. +Neither type is implicitly better or worse, +and both types are capable of achieving effect identification under different contexts. %----------------------------------------------------------------------------------------------- \subsection{Estimating treatment effects using control groups} From 6c3596a73c198fd3dbef7dc8425eb84e30e0d5fa Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 10:58:16 -0500 Subject: [PATCH 469/854] causal inference --- chapters/research-design.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 350ae7c4f..566a15fa3 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -153,7 +153,8 @@ \subsection{Experimental and quasi-experimental research designs} \url{https://dimewiki.worldbank.org/wiki/Experimental_Methods}} often in the form of government programs, NGO projects, new regulations, information campaigns, and many more types of interventions.\cite{banerjee2009experimental} -The classic experimental method is the \textbf{randomized control trial (RCT)}.\sidenote{ +The classic experimental causal inference method +is the \textbf{randomized control trial (RCT)}.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Randomized_Control_Trials}} \index{randomized control trials} In randomized control trials, the control group is randomized -- From 5b30a1ec618a8c03b819100cb3bf97cbb5a47285 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 11:03:08 -0500 Subject: [PATCH 470/854] time and space --- chapters/research-design.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 566a15fa3..a87ac38af 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -392,6 +392,8 @@ \subsection{Regression discontinuity} by the potential recipients. If the running variable is time (what is commonly called an ``event study''), there are special considerations.\cite{hausman2018regression} +Similarly, spatial discontinuity designs are handled a bit differently due to their multidimensionality.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/spatial-jumps}} Regression discontinuity designs are, once implemented, very similar in analysis to cross-sectional or difference-in-differences designs. @@ -407,8 +409,6 @@ \subsection{Regression discontinuity} for the running variable -- meaning that the treatment effect variable will only be applicable for observations in a small window around the cutoff and the treatment effects estimated will be ``local'' rather than ``average''. -(Spatial discontinuity designs are handled a bit differently due to their multidimensionality.\sidenote{ - \url{https://blogs.worldbank.org/impactevaluations/spatial-jumps}}) In the RD model, the functional form of the running variable control and the size of that window, often referred to as the choice of \textbf{bandwidth} for the design, are the critical parameters for the result.\cite{calonico2019regression} From 04b2122c31534843121d4ee7b62e6fb7bdb36da5 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 11:04:33 -0500 Subject: [PATCH 471/854] LATE 2: THE LATENING --- chapters/research-design.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index a87ac38af..9d2207b18 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -406,9 +406,9 @@ \subsection{Regression discontinuity} i.e., contingent on whether data has one or more time periods and whether the same units are known to be observed repeatedly. The treatment effect will be identified, however, by the addition of a control -for the running variable -- meaning that the treatment effect variable -will only be applicable for observations in a small window around the cutoff -and the treatment effects estimated will be ``local'' rather than ``average''. +for the running variable -- meaning that the treatment effect estimate +will only be applicable for observations in a small window around the cutoff: +in the lingo, the treatment effects estimated will be ``local'' rather than ``average''. In the RD model, the functional form of the running variable control and the size of that window, often referred to as the choice of \textbf{bandwidth} for the design, are the critical parameters for the result.\cite{calonico2019regression} From a4c757d26357323a9b3c7f540da07858e1457cf9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 11:19:50 -0500 Subject: [PATCH 472/854] replication --- chapters/publication.tex | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 57c934977..747f187dc 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -21,10 +21,10 @@ These represent an intellectual contribution in their own right, because they enable others to learn from your process and better understand the results you have obtained. -Holding code and data to the same standards as written work -is a new discipline for many researchers, -and here we provide some basic guidelines and responsibilities for that process -that will help you prepare a functioning and informative replication package. +Holding code and data to the same standards a written work +is a new practice for many researchers. +In this chapter, we provide guidelines that will help you +prepare a functioning and informative replication package. In all cases, we note that technology is rapidly evolving and that the specific tools noted here may not remain cutting-edge, but the core principles involved in publication and transparency will endure. From 6fb890a2b810471425df0c32e5212a8f30ccca1d Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 11:21:02 -0500 Subject: [PATCH 473/854] shameless self-promotion --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 747f187dc..0fb9f1e67 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -101,7 +101,7 @@ \subsection{Dynamic documents} However, the most widely utilized software for dynamically managing both text and results is \LaTeX\ (pronounced ``lah-tek'').\sidenote{ - \url{https://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/GSWLaTeX.pdf}} + \url{https://github.com/worldbank/DIME-LaTeX-Templates}} \index{\LaTeX} Rather than using a coding language that is built for another purpose or trying to hide the code entirely, From 98191aa9185fed90c7238504b0549f77e8ba7b8b Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 28 Jan 2020 11:22:12 -0500 Subject: [PATCH 474/854] [ch6] construction reorg --- chapters/data-analysis.tex | 129 ++++++++++++++++++++----------------- 1 file changed, 70 insertions(+), 59 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 43f8f88e8..a0ca307d9 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -300,55 +300,70 @@ \section{Indicator construction} During this process, the data points will typically be reshaped and aggregated so that level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} -% Why it is a separate process ------------------------------- +\subsection{Why construction?} % From cleaning Construction is done separately from data cleaning for two reasons. -The first one is to clear differentiation between the data originally collected and the result of data processing decisions. +The first one is to clearly differentiate the data originally collected from the result of data processing decisions. The second is to ensure that variable definition is consistent across data sources. Unlike cleaning, construction can create many outputs from many inputs. Let's take the example of a project that has a baseline and an endline survey. Unless the two instruments are exactly the same, which is preferable but often not the case, the data cleaning for them will require different steps, and therefore will be done separately. However, you still want the constructed variables to be calculated in the same way, so they are comparable. -So you want to construct indicators for both rounds in the same code, after merging them. +To do this, you will at least two cleaning scripts, and a single one for construction -- +we will discuss how to do this in practice in a bit. % From analysis -Ideally, indicator construction is done right after data cleaning, following the pre-analysis plan. -In practice, it's almost never that easy. -As you write the analysis, different constructed variables will become necessary, as well as subsets and other alterations to the data. +Ideally, indicator construction should be done right after data cleaning, according to the pre-analysis plan. +In practice, however, following this principle is not always easy. +As you analyze the data, different constructed variables will become necessary, as well as subsets and other alterations to the data. Still, constructing variables in a separate script from the analysis will help you ensure consistency across different outputs. If every script that creates a table starts by loading a data set, subsetting it and manipulating variables, any edits to construction need to be replicated in all scripts. This increases the chances that at least one of them will have a different sample or variable definition. Therefore, even if construction ends up coming before analysis only in the order the code is run, it's important to think of them as different steps. -\subsection{Construction tasks} -% What to do during construction ----------------------------------------- -Keep in mind that details matter when constructing variables, and overlooking them may affect your results. -It is important to check and double-check the value-assignments of questions and their scales before constructing new variables using them. -Are they in percentages or proportions? -Are all variables you are combining into an index or average using the same scale? -Are yes or no questions coded as 0 and 1, or 1 and 2? +\subsection{Construction tasks and how to approach them} + +The first thing that comes to mind when we talk about variable construction is, of course, creating new variables. +Do this by adding new variables to the data set instead of overwriting the original information, and assign functional names to them. +During cleaning, you want to keep all variables consistent with the survey instrument. +But constructed variables were not present in the survey to start with, +so making their names consistent with the survey form is not as crucial. +Of course, whenever possible, having variable names that are both intuitive \textit{and} can be linked to the survey is ideal, but if you need to choose, prioritize functionality. +Ordering the data set so that related variables are together and adding notes to each of them as necessary will also make your data set more user-friendly. + +The most simple case of new variables to be created are aggregate indicators. +For example, you may want to add a household's income from different sources into a single total income variable, or create a dummy for having at least one child in school. +Jumping to the step where you actually create this variables seems intuitive, +but it can also cause you a lot of problems, as overlooking details may affect your results. +It is important to check and double-check the value-assignments of questions and their scales before constructing new variables based on them. This is when you will use the knowledge of the data you acquired and the documentation you created during the cleaning step the most. -It is often useful to start looking at comparisons and other documentation -outside the code editor. +It is often useful to start looking at comparisons and other documentation outside the code editor. -Adding comments to the code explaining what you are doing and why is crucial here. -There are always ways for things to go wrong that you never anticipated, but two issues to pay extra attention to are missing values and dropped observations. -If you are subsetting a data set, you should drop observations explicitly, indicating why you are doing that and how the data set changed. -Merging, reshaping and aggregating data sets can change both the total number of observations and the number of observations with missing values. -Make sure to read about how each command treats missing observations and, whenever possible, add automated checks in the script that throw an error message if the result is changing. +Make sure to standardize units and recode categorical variables so their values are consistent. +It's possible that your questionnaire asked respondents to report some answers as percentages and others as proportions, +or that in one variable 0 means "no" and 1 means "yes", while in another one the same answers were coded are 1 and 2. +We recommend coding yes/no questions as either 1/0 or TRUE/FALSE, so they can be used numerically as frequencies in means and as dummies in regressions. +Check that non-binary categorical variables have the same value-assignment, i.e., +that labels and levels have the same correspondence across variables that use the same options. +Finally, make sure that any numeric variables you are comparing are converted to the same scale or unit of measure. You cannot add one hectare and twos acres into a meaningful number. -At this point, you will also need to address some of the issues in the data that you identified during data cleaning. +During construction, you will also need to address some of the issues you identified in the data during data cleaning. The most common of them is the presence of outliers. How to treat outliers is a research question, but make sure to note what we the decision made by the research team, and how you came to it. Results can be sensitive to the treatment of outliers, so keeping the original variable in the data set will allow you to test how much it affects the estimates. -More generally, create derived measures in new variables instead of overwriting the original information. -Because data construction involves translating concrete data points to more abstract measurements, it is important to document exactly how each variable is derived or calculated. +All these points also apply to imputation of missing values and other distributional patterns. -% Outputs ----------------------------------------------------------------- -When dealing with primary panel data, is it common to construct indicators soon after receiving data from a new survey round. -Dealing with primary panel data also has its own complexities. +The more complex construction tasks involve changing the structure of the data: +adding new observations or variables by merging data sets, +and changing the unit of observation through collapses or reshapes. +There are always ways for things to go wrong that we never anticipated, but two issues to pay extra attention to are missing values and dropped observations. +Merging, reshaping and aggregating data sets can change both the total number of observations and the number of observations with missing values. +Make sure to read about how each command treats missing observations and, whenever possible, add automated checks in the script that throw an error message if the result is changing. +If you are subsetting your data, drop observations explicitly, indicating why you are doing that and how the data set changed. + +Finally, primary panel data involves additional timing complexities. It is common to construct indicators soon after receiving data from a new survey round. However, creating indicators for each round separately increases the risk of using different definitions every time. Having a well-established definition for each constructed variable helps prevent that mistake, @@ -361,8 +376,10 @@ \subsection{Construction tasks} \subsection{Documenting indicators construction} -It is wise to start writing a variable dictionary as soon as you begin making changes to the data. -Carefully record how specific variables have been combined, recoded, and scaled. +Because data construction involves translating concrete data points to more abstract measurements, it is important to document exactly how each variable is derived or calculated. +Adding comments to the code explaining what you are doing and why is a crucial step both to prevent mistakes and to guarantee transparency. +To make sure that these comments can be more easily navigated, it is wise to start writing a variable dictionary as soon as you begin making changes to the data. +Carefully record how specific variables have been combined, recoded, and scaled, and refer to those records in the code. This can be part of a wider discussion with your team about creating protocols for variable definition, which will guarantee that indicators are defined consistently across projects. When all your final variables have been created, you can use \texttt{iecodebook}'s \texttt{export} subcommand to list all variables in the data set, and complement it with the variable definitions you wrote during construction to create a concise meta data document. @@ -373,26 +390,16 @@ \subsection{Documenting indicators construction} \subsection{Constructed data sets} -The outputs of construction are the data sets that will be used for analysis. -The level of observation of a constructed data set is the unit analysis. -Each data set is purpose-built to answer an analysis question. -Since different pieces of analysis may require different samples, -or even different units of observation, -you may have one or multiple constructed data sets, -depending on how your analysis is structured. +The other set of construction outputs, as expected, consists of the data sets that will be used for analysis. +A constructed data set is built to answer an analysis question. +Since different pieces of analysis may require different samples, or even different units of observation, +you may have one or multiple constructed data sets, depending on how your analysis is structured. So don't worry if you cannot create a single, ``canonical'' analysis data set. It is common to have many purpose-built analysis datasets. Think of an agricultural intervention that was randomized across villages and only affected certain plots within each village. The research team may want to run household-level regressions on income, test for plot-level productivity gains, and check if village characteristics are balanced. Having three separate datasets for each of these three pieces of analysis will result in much cleaner do files than if they all started from the same file. -One thing all constructed data sets should have in common, though, are functionally-named variables. -Constructed variables are called ``constructed'' because they were not present in the survey to start with, -so making their names consistent with the survey form is not as crucial. -Of course, whenever possible, having variables names that are both intuitive and can be linked to the survey is ideal. -However, functionality should be prioritized here. -Remember to consider keeping related variables together and adding notes to each as necessary. - %------------------------------------------------ \section{Writing data analysis code} @@ -410,12 +417,12 @@ \section{Writing data analysis code} Instead, we will outline the structure of writing analysis code, assuming you have completed the process of data cleaning and construction. -% Exploratory and final data analysis ----------------------------------------- +\subsection{Organizing analysis code} + The analysis stage usually starts with a process we call exploratory data analysis. This is when you are trying different things and looking for patterns in your data. It progresses into final analysis when your team starts to decide what are the main results, those that will make it into the research output. The way you deal with code and outputs for exploratory and final analysis is different, and this section will discuss how. -% Organizing scripts --------------------------------------------------------- During exploratory data analysis, you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. It subtly encourages poor practices such as not clearing the workspace and not reloading the constructed data set before each analysis task. It's important to take the time to organize scripts in a clean manner and to avoid mistakes. @@ -435,7 +442,9 @@ \section{Writing data analysis code} This is a good way to make sure specifications are consistent throughout the analysis. Using pre-specified globals or objects also makes your code more dynamic, so it is easy to update specifications and results without changing every script. It is completely acceptable to have folders for each task, and compartmentalize each analysis as much as needed. -It is always better to have more code files open than to keep scrolling inside a given file. +It is better to have more code files open than to keep scrolling inside a given file. + +\subsection{Exporting outputs} To accomplish this, you will need to make sure that you have an effective data management system, including naming, file organization, and version control. Just like you did with each of the analysis datasets, name each of the individual analysis files descriptively. @@ -445,20 +454,7 @@ \section{Writing data analysis code} leave this to near publication time. % Self-promotion ------------------------------------------------ -Our team has created a few products to automate common outputs and save you -precious research time. -The \texttt{ietoolkit} package includes two commands to export nicely formatted tables. -\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} creates and exports balance tables to excel or {\LaTeX}. -\texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} does the same for difference-in-differences regressions. -The \textbf{Stata Visual Library}\sidenote{ \url{https://worldbank.github.io/Stata-IE-Visual-Library/}} -has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com/}} -\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_visualization}} \index{data visualization} -is increasingly popular, but a great deal lacks in quality.\cite{healy2018data,wilke2019fundamentals} -We attribute some of this to the difficulty of writing code to create them. -Making a visually compelling graph would already be hard enough if you didn't have to go through many rounds of googling to understand a command. -The trickiest part of using plot commands is to get the data in the right format. -This is why the \textbf{Stata Visual Library} includes example data sets to use -with each do-file. + Whole books have been written on how to create good data visualizations, so we will not attempt to give you advice on it. @@ -486,6 +482,21 @@ \section{Writing data analysis code} \section{Exporting analysis outputs} +Our team has created a few products to automate common outputs and save you +precious research time. +The \texttt{ietoolkit} package includes two commands to export nicely formatted tables. +\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} creates and exports balance tables to excel or {\LaTeX}. +\texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} does the same for difference-in-differences regressions. +The \textbf{Stata Visual Library}\sidenote{ \url{https://worldbank.github.io/Stata-IE-Visual-Library/}} +has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com/}} +\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_visualization}} \index{data visualization} +is increasingly popular, but a great deal lacks in quality.\cite{healy2018data,wilke2019fundamentals} +We attribute some of this to the difficulty of writing code to create them. +Making a visually compelling graph would already be hard enough if you didn't have to go through many rounds of googling to understand a command. +The trickiest part of using plot commands is to get the data in the right format. +This is why the \textbf{Stata Visual Library} includes example data sets to use +with each do-file. + It's ok to not export each and every table and graph created during exploratory analysis. Final analysis scripts, on the other hand, should export final outputs, which are ready to be included to a paper or report. No manual edits, including formatting, should be necessary after exporting final outputs -- From 59c0b527ba910f600326cde8bbec40677612b93e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 11:23:50 -0500 Subject: [PATCH 475/854] bibtex scholar --- chapters/publication.tex | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 0fb9f1e67..8b5c199d8 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -149,10 +149,11 @@ \subsection{Technical writing with \LaTeX} (such as superscripts, inline references, etc.) as well as how the bibliography should be styled and in what order (such as Chicago, MLA, Harvard, or other common styles). -To obtain the references for the \texttt{.bib} file, -you can copy the specification directly from Google Scholar -by clicking ``BibTeX'' at the bottom of the Cite window. -When pasted into the \texttt{.bib} file they look like the following: +This tool is so widely used that it is natively integrated in Google Scholar. +To obtain a reference in the \texttt{.bib} format for any paper you find, +click ``BibTeX'' at the bottom of the Cite window (below the preformatted options). +Then, copy the code directly from Google Scholar into your \texttt{.bib} file. +They will look like the following: \codeexample{sample.bib}{./code/sample.bib} From 80ad7f6559caeea43f26c66cdba5499cd3712dc4 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 11:26:43 -0500 Subject: [PATCH 476/854] data ownership --- chapters/publication.tex | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 8b5c199d8..e906b7b05 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -276,13 +276,19 @@ \subsection{Publishing data for replication} and test alternative approaches to other questions. Therefore you should make clear in your study where and how data are stored, and how and under what circumstances it might be accessed. -You do not always have to complete the data publication data yourself, -as long as you cite or otherwise directly reference data that you cannot release. -Even if you think your raw data is owned by someone else, +You do not always have to publish the data yourself, +and in some cases you are legally not allowed to, +but what matters is that the data is published +(with or without access restrictions) +and that you cite or otherwise directly reference all data, +even data that you cannot release. +When your raw data is owned by someone else, +or for any other reason you are not able to publish it, in many cases you will have the right to release -at least some subset of your analytical dataset or the indicators you constructed. -Check with the data supplier or other professional about licensing questions, -particularly your right to publish derivative materials. +at least some subset of your constructed data set, +even if it is just the derived indicators you constructed. +If you have questions about your rights over original or derived materials, +check with the legal team at your organization or at the data provider's. You should only directly publish data which is fully de-identified and, to the extent required to ensure reasonable privacy, potentially identifying characteristics are further masked or removed. From 177316c4719c315084f2e95de24af00e1c4ff69b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 11:30:39 -0500 Subject: [PATCH 477/854] repositories --- chapters/publication.tex | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index e906b7b05..54171fac3 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -417,14 +417,15 @@ \subsection{Releasing a replication package} you can change or remove the contents at any time. A repository such as the Harvard Dataverse\sidenote{ \url{https://dataverse.harvard.edu}} -addresses these issues, as it is designed to be a citable code repository. +addresses these issues, as it is designed to be a citable data repository; +the IPA/J-PAL field experiment repository is especially relevant.\sidenote{ + \url{https://www.povertyactionlab.org/blog/9-11-19/new-hub-data-randomized-evaluations}} The Open Science Framework\sidenote{ \url{https://osf.io}} -also provides a balanced implementation -that holds both code and data (as well as simple version histories), -as does ResearchGate\sidenote{ +can also hold both code and data, +as can ResearchGate.\sidenote{ \url{https://https://www.researchgate.net}} -(both of which can also assign a permanent digital object identifier link for your work). +Some of these will also assign a permanent digital object identifier (DOI) link for your work. Any of these locations is acceptable -- the main requirement is that the system can handle the structured directory that you are submitting, @@ -432,13 +433,12 @@ \subsection{Releasing a replication package} and report exactly what, if any, modifications you have made since initial publication. You can even combine more than one tool if you prefer, as long as they clearly point to each other. -Emerging technologies such as CodeOcean\sidenote{ +Emerging technologies such as the ``containerization'' approach of Docker or CodeOcean\sidenote{ \url{https://codeocean.com}} offer to store both code and data, -and also provide an online workspace in which others -can execute and modify your code +and also provide an online workspace in which others can execute and modify your code without having to download your tools and match your local environment -when packages and other underlying softwares may have changed since publication. +when packages and other underlying software may have changed since publication. In addition to code and data, you may also want to release an author's copy or preprint From 43b19d12da22e76d861f6eb3c004a4decf6f60d0 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 11:58:52 -0500 Subject: [PATCH 478/854] =?UTF-8?q?=F0=9F=8F=85?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- chapters/research-design.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 9d2207b18..e087c6124 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -166,7 +166,8 @@ \subsection{Experimental and quasi-experimental research designs} if they had not been treated, and it is particularly effective at doing so as evidenced by its broad credibility in fields ranging from clinical medicine to development. Therefore RCTs are very popular tools for determining the causal impact -of specific programs or policy interventions. +of specific programs or policy interventions.\sidenote{ + \url{https://www.nobelprize.org/prizes/economic-sciences/2019/ceremony-speech/}} However, there are many other types of interventions that are impractical or unethical to effectively approach using an experimental strategy, and therefore there are limitations to accessing ``big questions'' From ebec7dc64aa11df443b792e40808f8affbcf1db3 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 12:10:29 -0500 Subject: [PATCH 479/854] analysis rewrite --- chapters/research-design.tex | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index e087c6124..27a2609ef 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -271,20 +271,21 @@ \subsection{Cross-sectional designs} \url{https://blogs.worldbank.org/impactevaluations/should-we-require-balance-t-tests-baseline-observables-randomized-experiments}} Analysis is typically straightforward \textit{once you have a strong understanding of the randomization}. -A typical analysis will include a description of the sampling and randomization process, -summary statistics for the eligible population, -balance checks for randomization and sample selection, -a primary regression specification (with multiple hypotheses appropriately adjusted), -additional specifications with adjustments for non-response, balance, and other potential contamination, -and randomization-inference analysis or other placebo regression approaches. -There are a number of tools that are also available -to help with the complete process of data collection,\sidenote{ +A typical analysis will include a description of the sampling and randomization results, +with analyses such as summary statistics for the eligible population, +and balance checks for randomization and sample selection. +The main results will usually be a primary regression specification +(with multiple hypotheses appropriately adjusted for), +and additional specifications with adjustments for non-response, balance, and other potential contamination. +Robustness checks might include randomization-inference analysis or other placebo regression approaches. +There are a number of user-written code tools that are also available +to help with the complete process of data analysis,\sidenote{ \url{https://toolkit.povertyactionlab.org/resource/coding-resources-randomized-evaluations}} -to analyze balance,\sidenote{ +including to analyze balance\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iebaltab}} and to visualize treatment effects.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iegraph}} -Tools and methods for analyzing selective non-response are available.\sidenote{ +Extensive tools and methods for analyzing selective non-response are available.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} %----------------------------------------------------------------------------------------------- From 7fdc00ca97007901446832de89a40784a3c351b3 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 12:13:18 -0500 Subject: [PATCH 480/854] pre-trends --- chapters/research-design.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 27a2609ef..666c9aa42 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -322,7 +322,8 @@ \subsection{Difference-in-differences} \url{https://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice}} Experimental approaches satisfy this requirement in expectation, but a given randomization should still be checked for pre-trends -as an extension of balance checking. +as an extension of balance checking.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/revisiting-difference-differences-parallel-trends-assumption-part-i-pre-trend}} There are two main types of data structures for differences-in-differences: \textbf{repeated cross-sections} and \textbf{panel data}. From 56941c406c46660c75cfd8a4ca7d3bc3423dae52 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 12:19:21 -0500 Subject: [PATCH 481/854] single unit --- chapters/research-design.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 666c9aa42..9255331fd 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -571,6 +571,7 @@ \subsection{Synthetic controls} do not exist in reality and there are very few (often only one) treatment units.\cite{abadie2015comparative} \index{synthetic controls} For example, state- or national-level policy changes +that can only be analyzed as a single unit are typically very difficult to find valid comparators for, since the set of potential comparators is usually small and diverse and therefore there are no close matches to the treated unit. From d1d5ba16f9b66defb203b4b8e58b0b7583a30a69 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 12:30:55 -0500 Subject: [PATCH 482/854] counterfactual --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 9255331fd..329eef0f6 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -87,7 +87,7 @@ \subsection{Estimating treatment effects using control groups} \index{average treatment effect} This is the parameter that most research designs attempt to estimate. Their goal is to establish a \textbf{counterfactual}\sidenote{ - \textbf{Counterfactual:} A statistical description of what would have happened to specific individuals in an alternative scenario.} + \textbf{Counterfactual:} A statistical description of what would have happened to specific individuals in an alternative scenario, for example, a different treatment assignment outcome.} for the treatment group with which outcomes can be directly compared. \index{counterfactual} There are several resources that provide more or less mathematically intensive From 300e3a6b90d0002627e16d8d94d2bec54c0c6d2a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 12:43:33 -0500 Subject: [PATCH 483/854] Accept suggestion MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index e0794b10d..aa932613c 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -20,7 +20,7 @@ Power calculations and randomization inference methods give us the tools to critically and quantitatively assess different -sampling and randomization designs in light of our theories of impact +sampling and randomization designs in light of our theories of change and to make optimal choices when planning studies. All random processes introduce statistical noise or uncertainty into the final estimates of effect sizes. From c1833fcd35537cd78a297cbb412c3981d9cf8918 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 12:45:10 -0500 Subject: [PATCH 484/854] truly random --- chapters/sampling-randomization-power.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index e0794b10d..4474cda35 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -52,8 +52,8 @@ \section{Random processes in Stata} Most experimental designs rely directly on random processes, particularly sampling and randomization, to be executed in code. The fundamental econometrics behind impact evaluation -depends on establishing that the observations in the sample -and any experimental treatment assignment processes are truly random. +depends on establishing that the sampling +and treatment assignment processes are truly random. Therefore, understanding and correctly programming for sampling and randomization is essential to ensuring that planned experiments are correctly implemented in the field, so that the results From f4a26e93cd799657fead5d8a8442889fe9ca7627 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 12:46:39 -0500 Subject: [PATCH 485/854] tiny human brain --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 2e963ef6f..5861510cc 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -64,7 +64,7 @@ \section{Random processes in Stata} which is a part of all tasks that include a random component.\sidenote{ \url{https://blog.stata.com/2016/03/10/how-to-generate-random-numbers-in-stata/}} -Randomization is challenging. It is deeply unintuitive for the human brain. +Randomization is challenging and its mechanics are unintuitive for the human brain. ``True'' randomization is also nearly impossible to achieve for computers, which are inherently deterministic.\sidenote{ \url{https://www.random.org/randomness/}} From 8ce50a2f974ced9310b22a85f5890742cd8d6a7e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 12:59:02 -0500 Subject: [PATCH 486/854] live reveal --- chapters/sampling-randomization-power.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 5861510cc..b3885421d 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -249,11 +249,11 @@ \subsection{Randomization} that is described in the experimental design, and fill in any gaps in the process before implmenting it in Stata. -Some types of experimental designs necessitate that randomization be done live in the field. +Some types of experimental designs necessitate that randomization results be revealed during data collection. It is possible to do this using survey software or live events. These methods typically do not leave a record of the randomization, -so particularly when the experiment is electronic, -it is best to execute the randomization in advance and preload the results if possible. +so particularly when the experiment is done as part of data collection, +it is best to execute the randomization in advance and preload the results. Even when randomization absolutely cannot be done in advance, it is still useful to build a corresponding model of the randomization process in Stata so that you can conduct statistical analysis later From 1f8b8c4631e1f49179b2c9377eb39fa70a5e5535 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 13:03:30 -0500 Subject: [PATCH 487/854] classroom --- chapters/sampling-randomization-power.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index b3885421d..4ac97a0b5 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -290,8 +290,8 @@ \subsection{Clustering} \url{https://dimewiki.worldbank.org/wiki/Multi-stage_(Cluster)_Sampling}} and the groups in which units are assigned to treatment are called clusters. The same principle extends to sampling: -it may be infeasible to decide whether to test individual children -within a single classroom, for example. +it may be be necessary to observe all the children +in a given teacher's classroom, for example. Clustering is procedurally straightforward in Stata, although it typically needs to be performed manually. From cf86c3e29859cabd64cb30c9b8a67705cb82daa4 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 13:12:04 -0500 Subject: [PATCH 488/854] clustering --- chapters/sampling-randomization-power.tex | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 4ac97a0b5..f29466bce 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -293,11 +293,10 @@ \subsection{Clustering} it may be be necessary to observe all the children in a given teacher's classroom, for example. -Clustering is procedurally straightforward in Stata, -although it typically needs to be performed manually. -To cluster a sampling or randomization, -create or use a data set where each cluster unit is an observation, -randomize on that data set, and then merge back the results. +Clustered sampling and randomization are straightforward in Stata. +To sample or randomize at the cluster level, +create or use a data set where each cluster unit is a row, +sample or randomize on that data set, and then merge back the results. When sampling or randomization is conducted using clusters, the clustering variable should be clearly identified since it will need to be used in subsequent statistical analysis. From 0792acb25f8371a00d25b220ae1b714e01e6a517 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 13:18:30 -0500 Subject: [PATCH 489/854] focus --- chapters/sampling-randomization-power.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index f29466bce..5bbea06c2 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -448,8 +448,9 @@ \subsection{Power calculations} It also helps you to untangle design issues before they occur. Therefore, simulation-based power analysis is often more of a design aid than an output for reporting requirements. -At the end of the day, you will probably have reduced -the complexity of your experiment significantly. +At the end of the day, power calculations will typically suggest +more efficient and focused study designs +with better power to answer your key questions. For reporting purposes, such as grant writing and registered reports, simulation ensures you will have understood the key questions well enough to report standard measures of power once your design is decided. From 5e594f55fd4ac1a107946629c3b2f1c4e1be2aba Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 13:20:41 -0500 Subject: [PATCH 490/854] treatment --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 5bbea06c2..ec2d0c4c5 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -458,7 +458,7 @@ \subsection{Power calculations} \subsection{Randomization inference} Randomization inference is used to analyze the likelihood -that the randomization process, by chance, +that the random treatment assignment process, by chance, would have created a false treatment effect as large as the one you observed. Randomization inference is a generalization of placebo tests, because it considers what the estimated results would have been From 43c92a172aa85d880957fe677c8abf99cb45b2f9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 13:23:08 -0500 Subject: [PATCH 491/854] re-randomization --- chapters/sampling-randomization-power.tex | 1 - 1 file changed, 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index ec2d0c4c5..f8c1640c3 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -276,7 +276,6 @@ \section{Clustering and stratification} They allow us to control the randomization process with high precision, which is often necessary for appropriate inference, particularly when samples or subgroups are small.\cite{athey2017econometrics} -(By contrast, re-randomizing or resampling are never appropriate for this.\cite{bruhn2009pursuit}) These techniques can be used in any random process; their implementation is nearly identical in both sampling and randomization. From 51947a084096d37e89455c4da40721a678ca1919 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 28 Jan 2020 13:25:54 -0500 Subject: [PATCH 492/854] [ch 4] Staying out of higher-level, ongoing statistics discussions --- chapters/sampling-randomization-power.tex | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index f8c1640c3..54e66365a 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -292,10 +292,11 @@ \subsection{Clustering} it may be be necessary to observe all the children in a given teacher's classroom, for example. -Clustered sampling and randomization are straightforward in Stata. -To sample or randomize at the cluster level, -create or use a data set where each cluster unit is a row, -sample or randomize on that data set, and then merge back the results. +Clustering is procedurally straightforward in Stata, +although it typically needs to be performed manually. +To cluster a sampling or randomization, +create or use a data set where each cluster unit is an observation, +randomize on that data set, and then merge back the results. When sampling or randomization is conducted using clusters, the clustering variable should be clearly identified since it will need to be used in subsequent statistical analysis. @@ -447,9 +448,8 @@ \subsection{Power calculations} It also helps you to untangle design issues before they occur. Therefore, simulation-based power analysis is often more of a design aid than an output for reporting requirements. -At the end of the day, power calculations will typically suggest -more efficient and focused study designs -with better power to answer your key questions. +At the end of the day, you will probably have reduced +the complexity of your experiment significantly. For reporting purposes, such as grant writing and registered reports, simulation ensures you will have understood the key questions well enough to report standard measures of power once your design is decided. @@ -457,7 +457,7 @@ \subsection{Power calculations} \subsection{Randomization inference} Randomization inference is used to analyze the likelihood -that the random treatment assignment process, by chance, +that the randomization process, by chance, would have created a false treatment effect as large as the one you observed. Randomization inference is a generalization of placebo tests, because it considers what the estimated results would have been From de44d9158b691768e48b27b6069327e863940564 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 28 Jan 2020 13:28:57 -0500 Subject: [PATCH 493/854] typo --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 54e66365a..47590d7a3 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -107,7 +107,7 @@ \subsection{Reproducibility in random Stata processes} and it will be impossible to recover the original result. In Stata, the \texttt{version} command ensures that the software algorithm is fixed.\sidenote{ At the time of writing we recommend using \texttt{version 13.1} for backward compatibility; -the algorithm was changed after Stata 14 but its improvements do not matter in practice.} +the algorithm was changed after Stata 14 but its improvements do not matter in practice. You will \textit{never} be able to reproduce a randomization in a different software, such as moving from Stata to R or vice versa.} The \texttt{ieboilstart} command in \texttt{ietoolkit} provides functionality to support this requirement.\sidenote{ From 6bb743e249266ee50dae9c9b19ec8b9bae0bd4fd Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 28 Jan 2020 13:37:36 -0500 Subject: [PATCH 494/854] [ch5] naming fields during questionnaire design --- chapters/data-collection.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index cf6e6848b..ffdb1cde3 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -69,6 +69,7 @@ \subsection{Questionnaire design for quantitative analysis} From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like \textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. Rigorous field testing is required to ensure that answer categories are comprehensive; however, it is best practice to include an \textit{other, specify} option. Keep track of those reponses in the first few weeks of fieldwork; adding an answer category for a response frequently showing up as \textit{other} can save time, as it avoids post-coding. +It is useful to name the fields in your paper questionnaire in a way that will also work in the data analysis software. There is debate over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. We recommend using descriptive names with clear prefixes so that variables within a module stay together when sorted From 8f91e80157b8a668e3c91035c4882809af6c0c7b Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 28 Jan 2020 13:38:27 -0500 Subject: [PATCH 495/854] [ch5] code font in variable names --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index ffdb1cde3..2a0987b98 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -80,7 +80,7 @@ \subsection{Questionnaire design for quantitative analysis} question numbering, as it discourages re-ordering, which is a common recommended change after the pilot. In the case of follow-up surveys, numbering can quickly become convoluted, too often resulting in variables names like -'ag\_15a', 'ag\_15\_new', 'ag\_15\_fup2', etc. +\texttt{ag\_15a}, \texttt{ag\_15\_new}, \texttt{ag\_15\_fup2}, etc. Questionnaires must include ways to document the reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up}. \index{attrition}\index{contamination} From 8e17ec4014da178f5fc7803b70e352e14d1d9283 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 28 Jan 2020 13:39:27 -0500 Subject: [PATCH 496/854] Update chapters/data-collection.tex Co-Authored-By: Luiza Andrade --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 2a0987b98..4e4f515fd 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -164,7 +164,7 @@ \subsection{High frequency checks} It is important to check every day that the units interviewed match the survey sample. Many survey software include case management features, through which sampled units are directly assigned to individual enumerators. Even with careful management, it is often the case that raw data includes duplicate entries, which may occur due to field errors or duplicated submissions to the server. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Duplicates_and_Survey_Logs}} +Even with careful management, it is often the case that raw data includes duplicate entries, which may occur due to field errors or duplicated submissions to the server.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Duplicates_and_Survey_Logs}} \texttt{ieduplicates}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ieduplicates}} provides a workflow for collaborating on the resolution of duplicate entries between you and the field team. Next, observed units in the data must be validated against the expected sample: From c0b31efeaaa173a0a5354e311506dcc5da265564 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 28 Jan 2020 13:39:48 -0500 Subject: [PATCH 497/854] Update chapters/data-collection.tex Co-Authored-By: Luiza Andrade --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 4e4f515fd..f60aadb91 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -168,7 +168,7 @@ \subsection{High frequency checks} \texttt{ieduplicates}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ieduplicates}} provides a workflow for collaborating on the resolution of duplicate entries between you and the field team. Next, observed units in the data must be validated against the expected sample: -this is as straightforward as \texttt{merging} the sample list with the survey data and checking for mismatches. +this is as straightforward as merging the sample list with the survey data and checking for mismatches. Reporting errors and duplicate observations in real-time allows the field team to make corrections efficiently. Tracking survey progress is important for monitoring attrition, so that it is clear early on if a change in protocols or additional tracking will be needed. It is also important to check interview completion rate and sample compliance by surveyor and survey team, to identify any under-performing individuals or teams. From 21e66a5ae92cc17ab83d582c19b1b1754ae7b4f9 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 13:40:06 -0500 Subject: [PATCH 498/854] cite CONSORT --- chapters/data-collection.tex | 95 ++++++++++++++++++------------------ 1 file changed, 47 insertions(+), 48 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 4e4f515fd..801ce5bd0 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -39,7 +39,7 @@ \subsection{Content-focused Pilot} The objective is to improve the structure and length of the questionnaire, refine the phrasing and translation of specific questions, and confirm coded response options are exhaustive.\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes}} In addition, it is an opportunity to test and refine all survey protocols, such as how units will be sampled or pre-selected units identified. The pilot must be done out-of-sample, but in a context as similar as possible to the study sample. \subsection{Data-focused pilot} -A second survey pilot should be done after the questionnaire is programmed. +A second survey pilot should be done after the questionnaire is programmed. The objective of the data-focused pilot\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. @@ -47,8 +47,8 @@ \subsection{Data-focused pilot} \section{Designing electronic questionnaires} -The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: -begin from broad concepts and slowly flesh out the specifics. +The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: +begin from broad concepts and slowly flesh out the specifics. It is essential to start with a clear understanding of the \textbf{theory of change}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} and \textbf{experimental design} for your project. The first step of questionnaire design is to list key outcomes of interest, as well as the main factors to control for (covariates) and variables needed for experimental design. @@ -67,25 +67,24 @@ \subsection{Questionnaire design for quantitative analysis} This book covers surveys designed to yield datasets useful for quantitative analysis. This is a subset of surveys, and there are specific design considerations that will help to ensure the raw data outputs are ready for analysis. From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like -\textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. Rigorous field testing is required to ensure that answer categories are comprehensive; however, it is best practice to include an \textit{other, specify} option. Keep track of those reponses in the first few weeks of fieldwork; adding an answer category for a response frequently showing up as \textit{other} can save time, as it avoids post-coding. +\textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. Rigorous field testing is required to ensure that answer categories are comprehensive; however, it is best practice to include an \textit{other, specify} option. Keep track of those reponses in the first few weeks of fieldwork; adding an answer category for a response frequently showing up as \textit{other} can save time, as it avoids post-coding. It is useful to name the fields in your paper questionnaire in a way that will also work in the data analysis software. There is debate over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. -We recommend using descriptive names with clear prefixes so that variables -within a module stay together when sorted +We recommend using descriptive names with clear prefixes so that variables +within a module stay together when sorted alphabetically.\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} - Variable names should never include spaces or mixed cases (all lower case is -best). Take care with the length: very long names will be cut off in certain -software, which could result in a loss of uniqueness. We discourage explicit -question numbering, as it discourages re-ordering, which is a common -recommended change after the pilot. In the case of follow-up surveys, numbering -can quickly become convoluted, too often resulting in variables names like + Variable names should never include spaces or mixed cases (all lower case is +best). Take care with the length: very long names will be cut off in certain +software, which could result in a loss of uniqueness. We discourage explicit +question numbering, as it discourages re-ordering, which is a common +recommended change after the pilot. In the case of follow-up surveys, numbering +can quickly become convoluted, too often resulting in variables names like \texttt{ag\_15a}, \texttt{ag\_15\_new}, \texttt{ag\_15\_fup2}, etc. Questionnaires must include ways to document the reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up}. \index{attrition}\index{contamination} -These are essential data components for completing CONSORT records, a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups. -\sidenote[][-3.5cm]{Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K. F., Simel, D., et al. (1996). Improving the quality of reporting of randomized controlled trials: The CONSORT statement. \textit{JAMA}, 276(8):637--639} +These are essential data components for completing CONSORT records, a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups.\cite{begg1996improving} Once the content of the questionnaire is finalized and translated, it is time to proceed with programming the electronic survey instrument. @@ -131,17 +130,17 @@ \subsection{Electronic survey features} \subsection{Compatibility with analysis software} All survey software include debugging and test options to correct syntax errors and make sure that the survey instruments will successfully compile. This is not sufficient, however, to ensure that the resulting dataset will load without errors in your data analysis software of choice. -We developed the \texttt{ietestform} -command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of +We developed the \texttt{ietestform} +command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of the Stata package \texttt{iefieldkit}, to implement a form-checking routine for \textbf{SurveyCTO}, a proprietary implementation of \textbf{Open Data Kit (ODK)}. -Intended for use during questionnaire programming and before fieldwork, -\texttt{ietestform} tests for best practices in coding, naming and labeling, +Intended for use during questionnaire programming and before fieldwork, +\texttt{ietestform} tests for best practices in coding, naming and labeling, and choice lists. Although \texttt{ietestform} is software-specific, many of the tests it runs are general and important to consider regardless of software choice. -To give a few examples, \texttt{ietestform} tests that no variable names exceed -32 characters, the limit in Stata (variable names that exceed that limit will -be truncated, and as a result may no longer be unique). It checks whether +To give a few examples, \texttt{ietestform} tests that no variable names exceed +32 characters, the limit in Stata (variable names that exceed that limit will +be truncated, and as a result may no longer be unique). It checks whether ranges are included for numeric variables. \texttt{ietestform} also removes all leading and trailing blanks from response lists, which could be handled inconsistently across software. @@ -173,7 +172,7 @@ \subsection{High frequency checks} Tracking survey progress is important for monitoring attrition, so that it is clear early on if a change in protocols or additional tracking will be needed. It is also important to check interview completion rate and sample compliance by surveyor and survey team, to identify any under-performing individuals or teams. -When all data collection is complete, the survey team should prepare a final field report, +When all data collection is complete, the survey team should prepare a final field report, which should report reasons for any deviations between the original sample and the dataset collected. Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of survey data. It is important to structure this reporting in a way that not only groups broad rationales into specific categories @@ -182,51 +181,51 @@ \subsection{High frequency checks} This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions and loss to follow-up occurred in the field and how they were implemented and resolved. -High frequency checks should also include survey-specific data checks. As electronic survey -software incorporates many data control features, discussed above, these checks -should focus on issues survey software cannot check automatically. As most of -these checks are survey specific, it is difficult to provide general guidance. -An in-depth knowledge of the questionnaire, and a careful examination of the -pre-analysis plan, is the best preparation. Examples include consistency -across multiple responses, complex calculation (such as crop yield, which first requires unit conversions), -suspicious patterns in survey timing, or atypical response patters from specific enumerators. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} -survey software typically provides rich metadata, which can be useful in -assessing interview quality. For example, automatically collected time stamps -show how long enumerators spent per question, and trace histories show how many +High frequency checks should also include survey-specific data checks. As electronic survey +software incorporates many data control features, discussed above, these checks +should focus on issues survey software cannot check automatically. As most of +these checks are survey specific, it is difficult to provide general guidance. +An in-depth knowledge of the questionnaire, and a careful examination of the +pre-analysis plan, is the best preparation. Examples include consistency +across multiple responses, complex calculation (such as crop yield, which first requires unit conversions), +suspicious patterns in survey timing, or atypical response patters from specific enumerators. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} +survey software typically provides rich metadata, which can be useful in +assessing interview quality. For example, automatically collected time stamps +show how long enumerators spent per question, and trace histories show how many times answers were changed before the survey was submitted. High-frequency checks will only improve data quality if the issues they catch are communicated to the field. -There are lots of ways to do this; what's most important is to find a way to create actionable information for your team, given field constraints. +There are lots of ways to do this; what's most important is to find a way to create actionable information for your team, given field constraints. `ipacheck` generates an excel sheet with results for each run; these can be sent directly to the field teams. Many teams choose other formats to display results, notably online dashboards created by custom scripts. It is also possible to automate communication of errors to the field team by adding scripts to link the HFCs with a messaging program such as whatsapp. -Any of these solutions are possible: what works best for your team will depend on such variables as cellular networks in fieldwork areas, whether field supervisors have access to laptops, internet speed, and coding skills of the team preparing the HFC workflows. +Any of these solutions are possible: what works best for your team will depend on such variables as cellular networks in fieldwork areas, whether field supervisors have access to laptops, internet speed, and coding skills of the team preparing the HFC workflows. \subsection{Data considerations for field monitoring} Careful monitoring of field work is essential for high quality data. -\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and +\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. -For back-checks, a random subset of the field sample is chosen and a subset of information from the full survey is +For back-checks, a random subset of the field sample is chosen and a subset of information from the full survey is verified through a brief interview with the original respondent. -Design of the back-check questionnaire follows the same survey design -principles discussed above: you should use the pre-analysis plan +Design of the back-check questionnaire follows the same survey design +principles discussed above: you should use the pre-analysis plan or list of key outcomes to establish which subset of variables to prioritize. -Real-time access to the survey data increases the potential utility of -back-checks dramatically, and both simplifies and improves the rigor of related +Real-time access to the survey data increases the potential utility of +back-checks dramatically, and both simplifies and improves the rigor of related workflows. -You can use the raw data to draw the back-check sample; assuring it is +You can use the raw data to draw the back-check sample; assuring it is appropriately apportioned across interviews and survey teams. -As soon as back-checks are complete, the back-check data can be tested against +As soon as back-checks are complete, the back-check data can be tested against the original data to identify areas of concern in real-time. \texttt{bcstats} is a useful tool for analyzing back-check data in Stata module. \sidenote{\url{https://ideas.repec.org/c/boc/bocode/s458173.html}} -Electronic surveys also provide a unique opportunity to do audits through audio recordings of the interview, +Electronic surveys also provide a unique opportunity to do audits through audio recordings of the interview, typically short recordings triggered at random throughout the questionnaire. -\textbf{Audio audits} are a useful means to assess whether the enumerator is conducting the interview +\textbf{Audio audits} are a useful means to assess whether the enumerator is conducting the interview as expected (and not sitting under a tree making up data). Do note, however, that audio audits must be included in the Informed Consent. @@ -239,8 +238,8 @@ \section{Collecting Data Securely} \subsection{Secure data in the field} All mainstream data collection software automatically \textbf{encrypt} -\sidenote{\textbf{Encryption:} the process of making information unreadable to -anyone without access to a specific deciphering +\sidenote{\textbf{Encryption:} the process of making information unreadable to +anyone without access to a specific deciphering key. \url{https://dimewiki.worldbank.org/wiki/Encryption}} all data submitted from the field while in transit (i.e., upload or download). Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. Therefore, as long as you are using an established survey software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, and accounts used have a logon password and are never left unlocked. From 4bd1952b5067edb34d56b502fd8e11a4b6c708c2 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 28 Jan 2020 13:40:10 -0500 Subject: [PATCH 499/854] Update chapters/data-collection.tex Co-Authored-By: Luiza Andrade --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index f60aadb91..10b88c5f6 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -190,7 +190,7 @@ \subsection{High frequency checks} pre-analysis plan, is the best preparation. Examples include consistency across multiple responses, complex calculation (such as crop yield, which first requires unit conversions), suspicious patterns in survey timing, or atypical response patters from specific enumerators. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} +timing, or atypical response patterns from specific enumerators.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} survey software typically provides rich metadata, which can be useful in assessing interview quality. For example, automatically collected time stamps show how long enumerators spent per question, and trace histories show how many From 200337a6217fa81a87ab66e626f5a336fb35eccd Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 28 Jan 2020 13:44:10 -0500 Subject: [PATCH 500/854] [ch5] password protection is not encryption! --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 2a0987b98..fe2a588de 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -268,7 +268,7 @@ \subsection{Secure data storage} This handling satisfies the rule of three: there are two on-site copies of the data and one off-site copy, so the data can never be lost in case of hardware failure. In addition, you should ensure that all teams take basic precautions to ensure the security of data, as most problems are due to human error. Ideally, the machine hard drives themselves should also be encrypted, as well as any external hard drives or flash drives used. -All files sent to the field containing PII data, such as sampling lists, should at least be password protected. This can be done using a zip-file creator. +All files sent to the field containing PII data, such as sampling lists, must be encrypted. You must never share passwords by email; rather, use a secure password manager. This significantly mitigates the risk in case there is a security breach such as loss, theft, hacking, or a virus, with little impact on day-to-day utilization. From afc7c0d49da7e9a6c641602904986a0383dd45b2 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 28 Jan 2020 13:52:47 -0500 Subject: [PATCH 501/854] [ch6] removing quotes --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index a0ca307d9..780012d71 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -343,7 +343,7 @@ \subsection{Construction tasks and how to approach them} Make sure to standardize units and recode categorical variables so their values are consistent. It's possible that your questionnaire asked respondents to report some answers as percentages and others as proportions, -or that in one variable 0 means "no" and 1 means "yes", while in another one the same answers were coded are 1 and 2. +or that in one variable 0 means ``no'' and 1 means ``yes'', while in another one the same answers were coded are 1 and 2. We recommend coding yes/no questions as either 1/0 or TRUE/FALSE, so they can be used numerically as frequencies in means and as dummies in regressions. Check that non-binary categorical variables have the same value-assignment, i.e., that labels and levels have the same correspondence across variables that use the same options. From 13c1f85ea3ebe1ef8c5bfe7d3a69a256e1a68745 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 28 Jan 2020 14:01:42 -0500 Subject: [PATCH 502/854] [ch6] no need for multiple files open --- chapters/data-analysis.tex | 1 - 1 file changed, 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 780012d71..aa3350043 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -442,7 +442,6 @@ \subsection{Organizing analysis code} This is a good way to make sure specifications are consistent throughout the analysis. Using pre-specified globals or objects also makes your code more dynamic, so it is easy to update specifications and results without changing every script. It is completely acceptable to have folders for each task, and compartmentalize each analysis as much as needed. -It is better to have more code files open than to keep scrolling inside a given file. \subsection{Exporting outputs} From 3a9f1d85dd1021980d8da5579b17679b338ee64d Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 28 Jan 2020 14:03:04 -0500 Subject: [PATCH 503/854] [ch6] remove lack of quality --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index aa3350043..d45c579f2 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -489,7 +489,7 @@ \section{Exporting analysis outputs} The \textbf{Stata Visual Library}\sidenote{ \url{https://worldbank.github.io/Stata-IE-Visual-Library/}} has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com/}} \textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_visualization}} \index{data visualization} -is increasingly popular, but a great deal lacks in quality.\cite{healy2018data,wilke2019fundamentals} +is increasingly popular, and is becoming a field in its own right.\cite{healy2018data,wilke2019fundamentals} We attribute some of this to the difficulty of writing code to create them. Making a visually compelling graph would already be hard enough if you didn't have to go through many rounds of googling to understand a command. The trickiest part of using plot commands is to get the data in the right format. From 77f20005f6d6325e45a5072a25ff0042769ecac8 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 14:14:20 -0500 Subject: [PATCH 504/854] publish both --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 54171fac3..d3e955e4c 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -327,7 +327,7 @@ \subsection{Publishing data for replication} the data collection instrument or survey questionnaire so that readers can understand which data components are collected directly in the field and which are derived. -You should provide a clean version of the data +You should publish both a clean version of the data which corresponds exactly to the original database or questionnaire as well as the constructed or derived dataset used for analysis. Wherever possible, you should also release the code From c68ad64725db6dbd98fcb375581a5e021bb988fd Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 28 Jan 2020 14:16:43 -0500 Subject: [PATCH 505/854] word --- chapters/publication.tex | 5 ----- 1 file changed, 5 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index d3e955e4c..1e5b235f9 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -72,11 +72,6 @@ \subsection{Dynamic documents} and ensuring they are not will become more difficult as the document grows. As time goes on, it therefore becomes more and more likely that a mistake will be made or something will be missed. -Furthermore, it is very hard to simultaneously edit or track changes -in a Microsoft Word document. -It is usually the case that a file needs to be passed back and forth -and the order of contributions strictly controlled -so that time-consuming resolutions of differences can be avoided. Therefore this is a broadly unsuitable way to prepare technical documents. There are a number of tools that can be used for dynamic documents. From 8ad87c3c166793097a37cc5e143a802f4399efba Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 30 Jan 2020 14:04:51 -0500 Subject: [PATCH 506/854] be clear --- chapters/publication.tex | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 1e5b235f9..05a47e090 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -260,7 +260,10 @@ \section{Preparing a complete replication package} all necessary de-identified data for the analysis, and all code necessary for the analysis. The code should exactly reproduce the raw outputs you have used for the paper, -and should include no documentation or PII data you would not share publicly. +and should not include documentation or data you would not share publicly. +This usually means removing project-related documentation such as contracts +and details of data collection and other field work, +and double-checking all datasets for potentially identifying information. \subsection{Publishing data for replication} From 9664a7a5798f88fe8b60e47264bfa3f98bf7683d Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 30 Jan 2020 14:05:53 -0500 Subject: [PATCH 507/854] no groups --- chapters/publication.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 05a47e090..b9136e164 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -75,7 +75,7 @@ \subsection{Dynamic documents} Therefore this is a broadly unsuitable way to prepare technical documents. There are a number of tools that can be used for dynamic documents. -In the first group are code-based tools such as R's RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com/}} +Some are code-based tools such as R's RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com/}} and Stata's \texttt{dyndoc}\sidenote{\url{https://www.stata.com/manuals/rptdyndoc.pdf}}. These tools ``knit'' or ``weave'' text and code together, and are programmed to insert code outputs in pre-specified locations. @@ -85,7 +85,7 @@ \subsection{Dynamic documents} because they tend to offer restricted editability outside the base software and often have limited abilities to incorporate precision formatting. -The second group of dynamic document tools do not require +Another type of dynamic document tools do not require direct operation of underlying code or software, but simply require that the writer have access to the updated outputs. One very simple one is Dropbox Paper, a free online writing tool From b41d703121bc2603a1753a30b53bb90b7c846d5a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 30 Jan 2020 14:06:49 -0500 Subject: [PATCH 508/854] Rewrite --- chapters/publication.tex | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index b9136e164..9f8cf4c81 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -85,14 +85,14 @@ \subsection{Dynamic documents} because they tend to offer restricted editability outside the base software and often have limited abilities to incorporate precision formatting. -Another type of dynamic document tools do not require -direct operation of underlying code or software, but simply require -that the writer have access to the updated outputs. -One very simple one is Dropbox Paper, a free online writing tool -that allows linkages to files in Dropbox, -which are then automatically updated anytime the file is replaced. -Dropbox Paper has very few formatting options, -but it is appropriate for working with collaborators who are not using statistical software. +There are other dynamic document tools +which do not require direct operation of the underlying code or software, +simply access to the updated outputs. +These can be useful for working on informal outputs, such as blogposts, +with collaborators who do not code. +An example of this is Dropbox Paper, +a free online writing tool that allows linkages to files in Dropbox, +which are automatically updated anytime the file is replaced. However, the most widely utilized software for dynamically managing both text and results is \LaTeX\ (pronounced ``lah-tek'').\sidenote{ From 9d5e10be1c12f6a7a20ded8927711b418f3305f2 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 30 Jan 2020 16:17:26 -0500 Subject: [PATCH 509/854] Edition notes for 1.1 --- chapters/notes.tex | 32 ++++++++++++++++++-------------- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/chapters/notes.tex b/chapters/notes.tex index fd5acbfe7..2a7f29008 100644 --- a/chapters/notes.tex +++ b/chapters/notes.tex @@ -1,25 +1,29 @@ -This is a pre-publication edition of +This is a draft peer review edition of \textit{Data for Development Impact: -The DIME Analytics Resource Guide}, mainly intended for initial feedback. -The DIME Analytics team is releasing this edition -to coincide with the annual \textit{Manage Successful Impact Evaluations} training, -so that a wide group of people who make up -part of the core audience for this book -can use the book to supplement that training -and provide feedback and improvements on it. +The DIME Analytics Resource Guide}. +This version of the book has been substantially revised +since the first release in June 2019 +with feedback from readers and other experts. +It now contains most of the major content +that we hope to include in the finished version, +and we are in the process of making final additions +and polishing the materials to formally publish it. -Whether you are a training participant, -a DIME team member, or you work for the World Bank -or another organization or university, -we ask that you read the contents of this book carefully and critically. -It is available as a PDF for your convenience at: +This book is intended to remain a living product +that is written and maintained in the open. +The raw code and edit history are online at: +\url{https://github.com/worldbank/d4di}. +You can get a PDF copy at: \url{https://worldbank.github.com/d4di}. -This website also includes the most updated instructions +The website also includes the most updated instructions for providing feedback, as well as a log of errata and updates that have been made to the content. \subsection{Feedback} +Whether you are a DIME team member or you work for the World Bank +or another organization or university, +we ask that you read the contents of this book carefully and critically. We encourage feedback and corrections so that we can improve the contents of the book in future editions. Please visit From a26709c6fbd10c77a5026966a489dc1bccb84c1c Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 30 Jan 2020 16:58:10 -0500 Subject: [PATCH 510/854] [bib] all item must have year for me --- bibliography.bib | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/bibliography.bib b/bibliography.bib index 3928513d6..f53422fdb 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -9,12 +9,13 @@ @Article{tidy-data year = {2014}, bdsk-url-1 = {http://www.jstatsoft.org/v59/i10/} } - -@MISC {88491, + +@MISC{88491, TITLE = {What is meant by the standard error of a maximum likelihood estimate?}, AUTHOR = {{Alecos Papadopoulos (\url{https://stats.stackexchange.com/users/28746/alecos-papadopoulos})}}, HOWPUBLISHED = {Cross Validated}, NOTE = {\url{https://stats.stackexchange.com/q/88491} (version: 2014-03-04)}, + year={2014}, EPRINT = {https://stats.stackexchange.com/q/88491}, URL = {https://stats.stackexchange.com/q/88491} } From 445e690a6d97875f75753627cf86fd44032ad44d Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 4 Feb 2020 14:59:11 -0500 Subject: [PATCH 511/854] #53 : run code from master --- chapters/planning-data-work.tex | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 7fa335392..80d67d2a2 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -544,12 +544,20 @@ \subsection{Documenting and organizing code} The master script is also where all the settings are established, such as versions, folder paths, functions, and constants used throughout the project. -\texttt{iefolder} creates these as master do-files.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} -Master scripts are a key element of code organization and collaboration, -and we will discuss some important features soon. -The master script should mimic the structure of the \texttt{DataWork} folder. -This is done through the creation of globals (in Stata) or string scalars (in R). +Try to create the habit of running your code from the master script. +Creating ``section switches'' using macros or objects to run only the codes related to a certain task +should always be preferred to manually open different scripts to run them in a certain order +(see Part 1 of \texttt{stata-master-dofile.do} for an example of how to do this). +Furthermore, running all scripts related to a particular task through the master whenever one of them is edited +helps you identify unintended consequences of the changes you made. +Say, for example, that you changed the name of a variable created in one script. +This may break another script that refers to this variable. +But unless you run both of them when the change is made, it may take time for that to happen, +and when it does, it may take time for you to understand what's causing an error. +The same applies to changes in data sets and results. + +To link code, data and outputs, the master script reflects the structure of the \texttt{DataWork} folder in code +through globals (in Stata) or string scalars (in R). These coding shortcuts can refer to subfolders, so that those folders can be referenced without repeatedly writing out their absolute file paths. Because the \texttt{DataWork} folder is shared by the whole team, @@ -560,6 +568,10 @@ \subsection{Documenting and organizing code} the only change necessary to run the entire code from a new computer is to change the path to the project folder to reflect the filesystem and username. The code in \texttt{stata-master-dofile.do} shows how folder structure is reflected in a master do-file. +Because writing and maintaining a master script can be challenging as a project grows, +an important feature of the \texttt{iefolder} is to write master do-files +and add to them whenever new subfolders are created in the \texttt{DataWork} folder.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} In order to maintain these practices and ensure they are functioning well, you should agree with your team on a plan to review code as it is written. From 49850b8303c0c9ff561e44ded51f5bde39572813 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 4 Feb 2020 17:10:24 -0500 Subject: [PATCH 512/854] Update stata-guide.tex --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 48645affa..123373857 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -14,7 +14,7 @@ must be proficient programmers, and that includes more than being able to compute the correct numbers. This appendix first has a short section with instructions on how to access and use the code shared in -this book. The second section contains a the current DIME Analytics style guide for Stata code. +this book. The second section contains the current DIME Analytics style guide for Stata code. Widely accepted and used style guides are common in most programming languages, and we think that using such a style guide greatly improves the quality of research projects coded in Stata. We hope that this guide can help to increase the emphasis in the Stata community on using, From 035a14f8b226e20b7486e9238709eebba1dd5f72 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 5 Feb 2020 11:22:48 -0500 Subject: [PATCH 513/854] Draft "outro" --- chapters/conclusion.tex | 40 ++++++++++++++++++++++++++++++++++++++++ manuscript.tex | 8 ++++++++ 2 files changed, 48 insertions(+) create mode 100644 chapters/conclusion.tex diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex new file mode 100644 index 000000000..5f95249ad --- /dev/null +++ b/chapters/conclusion.tex @@ -0,0 +1,40 @@ +We hope you have enjoyed \textit{Data for Development Impact: The DIME Analytics Resource Guide}. +It lays out a complete vision of the tasks of a modern researcher, +from planning a project's data governance to publishing code and data +to accompany a research product. +We have tried to set the text up as a resource guide +so that you will always be able to return to it +as your work requires you to become progressively more familiar +with each of the topics included in the guide. + +We motivated the guide with a discussion of research as a public service: +one that requires you to be accountable to both research participants +and research consumers. +We then discussed the current research environment, +which requires you to cooperate with a diverse group of collaborators +using modern approaches to computing technology. +We outlined common research methods in impact evaluation +that motivate how field and data work is structured. +We discussed how to ensure that evaluation work is well-designed +and able to accomplish its goals. +We discussed the collection of primary data +and methods of analysis using statistical software, +as well as tools and practices for making this work publicly accessible. +This mindset and workflow, from top to bottom, +should outline the tasks and responsibilities +that make up a researcher's role as a truth-seeker and truth-teller. + +But as you probably noticed, the text itself only provides what we think is +just enough detail to get you started: +an understanding of the purpose and function of each of the core research steps. +The references and resources get into the complete details +of how you will realistically implement these tasks. +From the DIME Wiki pages that detail the specific code practices +and field procedures that our team uses, +to the theoretical papers that will help you figure out +how to handle the unique cases you will undoubtedly encounter, +we hope you will keep the book on your desk +(or the PDF on your desktop) +and come back to it anytime you need more information. +We wish you all the best in your work +and will love to hear any input you have on ours! diff --git a/manuscript.tex b/manuscript.tex index a56a64209..496ca31b3 100644 --- a/manuscript.tex +++ b/manuscript.tex @@ -91,6 +91,14 @@ \chapter{Chapter 7: Publishing collaborative research} \input{chapters/publication.tex} +%---------------------------------------------------------------------------------------- +% Conclusion +%---------------------------------------------------------------------------------------- + +\chapter{Bringing it all together} + +\input{chapters/conclusion.tex} + %---------------------------------------------------------------------------------------- % APPENDIX : Stata Style Guide %---------------------------------------------------------------------------------------- From b2b9d96edf52abcfd5e92d21fbb59b5f6182f8fb Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 5 Feb 2020 16:45:03 -0500 Subject: [PATCH 514/854] Rewrite Chp 3 introduction (#333) --- chapters/research-design.tex | 42 +++++++++++++++++++----------------- 1 file changed, 22 insertions(+), 20 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 329eef0f6..17f66876b 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -7,28 +7,30 @@ and there are lots of good resources out there that focus on designing interventions and evaluations as well as on econometric approaches. -Therefore, without going into technical detail, -this section will present a brief overview -of the most common methods that are used in development research, -particularly those that are widespread in program evaluation. -These ``causal inference'' methods will turn up in nearly every project, -so you will need to have a broad knowledge of how the methods in your project -are used in order to manage data and code appropriately. -The intent of this chapter is for you to obtain an understanding of -the way in which each method constructs treatment and control groups, -the data structures needed to estimate the corresponding effects, -and some available code tools designed for each method (the list, of course, is not exhaustive). - -Thinking through your design before starting data work is important for several reasons. -If you do not know how to calculate the correct estimator for your study, -you will not be able to assess the statistical power of your research design. -You will also be unable to make decisions in the field +But you do need to understand the intuitive approach of the main methods +in order to be able to collect, store, and analyze data effectively. +Laying out the research design before starting data work +ensure you will know how to assess the statistical power of your research design +and calculate the correct estimate of your results. +While you are in the field, understanding the research design +will enable you to make decisions in the field when you inevitably have to allocate scarce resources -between tasks like maximizing sample size -and ensuring follow-up with specific individuals. -You will save a lot of time by understanding the way +between costly tasks like maximizing sample size +or ensuring follow-up with specific respondents. + +Therefore, this section includes a brief overview +of some of the most common research designs that are used in development research, +especially those that are widespread in program evaluation. +These \textbf{causal inference} methods appear in nearly every project, +so the intent of this chapter is for you to obtain an understanding of +the way in which each method constructs treatment and control groups, +what data structures are needed to estimate program effects for reach design, +and what available code tools are designed for each method. +The list, of course, is not exhaustive. +If you are familiar with these methods, +you will save a lot of time by understanding the way your data needs to be organized -in order to be able to calculate meaningful results. +in order to be able to produce meaningful analytics throughout your projects. Just as importantly, familiarity with each of these approaches will allow you to keep your eyes open for research opportunities: many of the most interesting projects occur because people in the field From acfb5e84ca648a7f9ffe693279d08ed8b5df3304 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 5 Feb 2020 16:49:28 -0500 Subject: [PATCH 515/854] Rewrite Chp 7 introduction (#354) --- chapters/publication.tex | 30 ++++++++++++++++++------------ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 9f8cf4c81..8418ea60d 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -1,7 +1,24 @@ %------------------------------------------------ \begin{fullwidth} -Publishing academic research today extends well beyond writing up and submitting a Word document alone. +For most research projects, completing a manuscript is not the end of the task. +Academic journals increasingly require submission of a replication package, +which contains the code and materials needed to create the results. +These represent an intellectual contribution in their own right, +because they enable others to learn from your process +and better understand the results you have obtained. +Holding code and data to the same standards a written work +is a new practice for many researchers. +In this chapter, we provide guidelines that will help you +prepare a functioning and informative replication package. +Ideally, if you have organized your analytical work +according to the general principles outlined throughout this book, +then preparing to release materials will not require +substantial reorganization of the work you have already done. +Hence, this step represents the conclusion of the system +of transparent, reproducible, and credible research we introduced +from the very first chapter of this book. + Typically, various contributors collaborate on both code and writing, manuscripts go through many iterations and revisions, and the final package for publication includes not just a manuscript @@ -14,17 +31,6 @@ In this section we suggest several methods -- collectively refered to as ``dynamic documents'' -- for managing the process of collaboration on any technical product. - -For most research projects, completing a manuscript is not the end of the task. -Academic journals increasingly require submission of a replication package, -which contains the code and materials needed to create the results. -These represent an intellectual contribution in their own right, -because they enable others to learn from your process -and better understand the results you have obtained. -Holding code and data to the same standards a written work -is a new practice for many researchers. -In this chapter, we provide guidelines that will help you -prepare a functioning and informative replication package. In all cases, we note that technology is rapidly evolving and that the specific tools noted here may not remain cutting-edge, but the core principles involved in publication and transparency will endure. From f09035dd2cb3366d2167dd51e7aee6a13e376b90 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 5 Feb 2020 16:54:22 -0500 Subject: [PATCH 516/854] Why coding is important (#178) --- chapters/planning-data-work.tex | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 80d67d2a2..7c44cecfc 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -303,7 +303,14 @@ \subsection{Choosing software} % ---------------------------------------------------------------------------------------------- \section{Organizing code and data} -Organizing files and folders is not a trivial task. +We assume you are going to do nearly all of your analytical work through code. +Good code, like a good recipe, allows other people to read and replicate it, +and this functionality is now considered an essential component of a research output. +You may do some exploratory tasks in an ``interactive'' way, +but anything that is included in a research output +must be coded up in an organized fashion so that you can release +the exact code recipe that goes along with your final results. +But organizing files and folders is not a trivial task. What is intuitive to one person rarely comes naturally to another, and searching for files and folders is everybody's least favorite task. As often as not, you come up with the wrong one, @@ -552,7 +559,7 @@ \subsection{Documenting and organizing code} helps you identify unintended consequences of the changes you made. Say, for example, that you changed the name of a variable created in one script. This may break another script that refers to this variable. -But unless you run both of them when the change is made, it may take time for that to happen, +But unless you run both of them when the change is made, it may take time for that to happen, and when it does, it may take time for you to understand what's causing an error. The same applies to changes in data sets and results. From 8a2cb7b68c36d38b8a4277438d1014824c84358c Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 5 Feb 2020 16:59:52 -0500 Subject: [PATCH 517/854] Data ownership and licensing: WB references (#156) --- chapters/publication.tex | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 8418ea60d..be7f1746f 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -290,7 +290,8 @@ \subsection{Publishing data for replication} or for any other reason you are not able to publish it, in many cases you will have the right to release at least some subset of your constructed data set, -even if it is just the derived indicators you constructed. +even if it is just the derived indicators you constructed and their documentation.\sidenote{ + \url{https://guide-for-data-archivists.readthedocs.io/en/latest/}} If you have questions about your rights over original or derived materials, check with the legal team at your organization or at the data provider's. You should only directly publish data which is fully de-identified @@ -303,6 +304,10 @@ \subsection{Publishing data for replication} and communicate them to any future users of the data. You must provide a license with any data release.\sidenote{ \url{https://iatistandard.org/en/guidance/preparing-organisation/organisation-data-publication/how-to-license-your-data/}} +Some common license types are documented at the World Bank Data Catalog\sidenote{ + \url{https://datacatalog.worldbank.org/public-licenses}} +and the World Bank Open Data Policy has futher examples of licenses that are used there.\sidenote{ + \url{https://microdata.worldbank.org/index.php/terms-of-use}} This document need not be extremely detailed, but it should clearly communicate to the reader what they are allowed to do with your data and how credit should be given and to whom in further work that uses it. From 5a1844006140352f949ddcddbc0b9d12afb6eb6e Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 27 Jan 2020 17:39:47 -0500 Subject: [PATCH 518/854] [code] mark code not to use in book --- code/minimum-detectable-effect.do | 42 +------------------------------ code/minimum-sample-size.do | 40 +---------------------------- code/randomization-program-1.do | 18 +------------ code/randomization-program-2.do | 15 +---------- code/sample-noise.do | 27 +------------------- 5 files changed, 5 insertions(+), 137 deletions(-) diff --git a/code/minimum-detectable-effect.do b/code/minimum-detectable-effect.do index 49db2e3b0..f75dc754e 100755 --- a/code/minimum-detectable-effect.do +++ b/code/minimum-detectable-effect.do @@ -1,41 +1 @@ -* Simulate power for different treatment effect sizes - clear - set matsize 5000 - cap mat drop results - set seed 852526 // Timestamp: 2019-02-26 23:18:42 UTC - -* Loop over treatment effect sizes (te) -* of 0 to 0.5 standard deviations -* in increments of 0.05 SDs - qui forval te = 0(0.05)0.5 { - forval i = 1/100 { // Loop 100 times - clear // New simulation - set obs 1000 // Set sample size to 1000 - - * Randomly assign treatment - * Here you could call a randomization program instead: - gen t = (rnormal() > 0) - - * Simulate assumed effect sizes - gen e = rnormal() // Include a normal error term - gen y = 1 + `te'*t + e // Set functional form for DGP - - * Does regression detect an effect in this assignment? - reg y t - - * Store the result - mat a = r(table) // Reg results - mat a = a[....,1] // t parameters - mat results = nullmat(results) \ a' , [`te'] // First run and accumulate - } // End iteration loop - } // End incrementing effect size - -* Load stored results into data - clear - svmat results , n(col) - -* Analyze all the regressions we ran against power 80% - gen sig = (pvalue <= 0.05) // Flag significant runs - -* Proportion of significant results in each effect size group (80% power) - graph bar sig , over(c10) yline(0.8) +* Do not use diff --git a/code/minimum-sample-size.do b/code/minimum-sample-size.do index d7688619a..f75dc754e 100755 --- a/code/minimum-sample-size.do +++ b/code/minimum-sample-size.do @@ -1,39 +1 @@ -* Power for varying sample size & a fixed treatment effect - clear - set matsize 5000 - cap mat drop results - set seed 510402 // Timestamp: 2019-02-26 23:19:00 UTC - -* Loop over sample sizes (ss) 100 to 1000, increments of 100 - qui forval ss = 100(100)1000 { - forval i = 1/100 { // 100 iterations per each - clear - set obs `ss' // Simulation with new sample size - - * Randomly assign treatment - * Here you could call a randomization program instead - gen t = (rnormal() > 0) - - * Simulate assumed effect size: here 0.2SD - gen e = rnormal() // Normal error term - gen y = 1 + 0.2*t + e // Functional form for DGP - - * Does regression detect an effect in this assignment? - reg y t - - * Store the result - mat a = r(table) // Reg results - mat a = a[....,1] // t parameters - mat results = nullmat(results) \ a' , [`ss'] // First run and accumulate - } // End iteration loop - } // End incrementing sample size - -* Load stored results into data - clear - svmat results , n(col) - -* Analyze all the regressions we ran against power 80% - gen sig = (pvalue <= 0.05) // Flag significant runs - -* Proportion of significant results in each effect size group (80% power) - graph bar sig , over(c10) yline(0.8) +* Do not use diff --git a/code/randomization-program-1.do b/code/randomization-program-1.do index f4f39dc6a..f75dc754e 100644 --- a/code/randomization-program-1.do +++ b/code/randomization-program-1.do @@ -1,17 +1 @@ -* Define a randomization program -cap prog drop my_randomization - prog def my_randomization - - * Syntax with open options for [ritest] - syntax , [*] - cap drop treatment - - * Group 2/5 in treatment and 3/5 in control - xtile group = runiform() , n(5) - recode group (1/2=0 "Control") (3/5=1 "Treatment") , gen(treatment) - drop group - - * Cleanup - lab var treatment "Treatment Arm" - -end +* Do not use diff --git a/code/randomization-program-2.do b/code/randomization-program-2.do index 0de9324fd..f75dc754e 100644 --- a/code/randomization-program-2.do +++ b/code/randomization-program-2.do @@ -1,14 +1 @@ -* Reproducible setup: data, isid, version, seed - sysuse auto.dta , clear - isid make, sort - version 13.1 - set seed 107738 // Timestamp: 2019-02-25 23:34:33 UTC - -* Call the program - my_randomization - tab treatment - -* Show randomization variation with [ritest] - ritest treatment _b[treatment] /// - , samplingprogram(my_randomization) kdensityplot /// - : reg price treatment +* Do not use diff --git a/code/sample-noise.do b/code/sample-noise.do index fe6334170..f75dc754e 100644 --- a/code/sample-noise.do +++ b/code/sample-noise.do @@ -1,26 +1 @@ -* Reproducible setup: data, isid, version, seed - sysuse auto.dta , clear - isid make, sort - version 13.1 - set seed 556292 // Timestamp: 2019-02-25 23:30:39 UTC - -* Get true population parameter for price mean - sum price - local theMean = `r(mean)' - -* Sample 20 units 1000 times and store the mean of [price] - cap mat drop results // Make matrix free - qui forvalues i = 1/1000 { - preserve - sample 20 , count // Remove count for 20% - sum price // Calculate sample mean - * Allow first run and append each estimate - mat results = nullmat(results) \ [`r(mean)'] - restore - } - -* Load the results into memory and graph the distribution - clear - mat colnames results = "price_mean" - svmat results , n(col) - kdensity price_mean , norm xline(`theMean') +* Do not use From b5c8555510743a47635a0925d4a2b7beeaa5430c Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 13:58:06 -0500 Subject: [PATCH 519/854] [ch 2] move master scrip to inside code --- chapters/planning-data-work.tex | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 80d67d2a2..c503eb8fa 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -544,6 +544,8 @@ \subsection{Documenting and organizing code} The master script is also where all the settings are established, such as versions, folder paths, functions, and constants used throughout the project. +\codeexample{stata-master-dofile.do}{./code/stata-master-dofile.do} + Try to create the habit of running your code from the master script. Creating ``section switches'' using macros or objects to run only the codes related to a certain task should always be preferred to manually open different scripts to run them in a certain order @@ -674,8 +676,3 @@ \subsection{Output management} Take into account ease of use for different team members, but keep in mind that learning how to use a new tool may require some time investment upfront that will be paid off as your project advances. - - -% ---------------------------------------------------------------------------------------------- - -\codeexample{stata-master-dofile.do}{./code/stata-master-dofile.do} From e7193932ad4f49e28557209c27f3705e765e9e09 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 14:56:18 -0500 Subject: [PATCH 520/854] [ch4] update basic randomization file --- chapters/sampling-randomization-power.tex | 4 ++-- code/replicability.do | 23 +++++++++++++---------- 2 files changed, 15 insertions(+), 12 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 47590d7a3..41642addf 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -144,6 +144,8 @@ \subsection{Reproducibility in random Stata processes} so carefully confirm exactly how your code runs before finalizing it.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Randomization_in_Stata}} +\codeexample{replicability.do}{./code/replicability.do} + To confirm that a randomization has worked well before finalizing its results, save the outputs of the process in a temporary location, re-run the code, and use \texttt{cf} or \texttt{datasignature} to ensure @@ -505,8 +507,6 @@ \subsection{Randomization inference} % code -\codeexample{replicability.do}{./code/replicability.do} - \codeexample{randomization-cf.do}{./code/randomization-cf.do} \codeexample{simple-sample.do}{./code/simple-sample.do} diff --git a/code/replicability.do b/code/replicability.do index 246149c48..67be9a004 100644 --- a/code/replicability.do +++ b/code/replicability.do @@ -1,20 +1,23 @@ -* Set the version +* VERSIONING - Set the version ieboilstart , v(13.1) `r(version)' -* Load the auto dataset and sort uniquely +* Load the auto dataset sysuse auto.dta , clear + +* SORTING - sort on the uniquely identifying variable "make" isid make, sort -* Set the seed using random.org (range: 100000 - 999999) - set seed 287608 // Timestamp: 2019-02-17 23:06:36 UTC +* SEEDING - Seed picked using http://bit.ly/stata-random + set seed 287608 -* Demonstrate stability under the three rules - gen check1 = rnormal() - gen check2 = rnormal() +* Demonstrate stability after VERSIONING, SORTING and SEEDING + gen check1 = rnormal() //Create random number + gen check2 = rnormal() //Create a second random number without resetting seed - set seed 287608 - gen check3 = rnormal() + set seed 287608 //Reset the seed + gen check3 = rnormal() //Create a third random number after resetting seed -//Visualize randomization results +* Visualize randomization results. See how check1 and check3 are identical, +* but check2 is random relative check1 and check3 graph matrix check1 check2 check3 , half From 0a3fa2bc712571f5a12b12e27b8d67d14820134d Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 15:14:24 -0500 Subject: [PATCH 521/854] [ch4] stable sort when num obs is changing --- chapters/sampling-randomization-power.tex | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 41642addf..46a025dc6 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -122,8 +122,11 @@ \subsection{Reproducibility in random Stata processes} Because numbers are assigned to each observation in row-by-row starting from the top row, changing their order will change the result of the process. -A corollary is that the underlying data must be unchanged between runs: -you must make a fixed final copy of the data when you run a randomization for fieldwork. +Since the exact order must be unchanged, the underlying data itself must be unchanged as well between runs. +This means that if you expect the number of observations to change (for example increase during +ongoing data collection) your randomization will not be stable unless you split your data up into +smaller fixed data set where the number of observations does not change. You can combine all +those smaller data sets after your randomization. In Stata, the only way to guarantee a unique sorting order is to use \texttt{isid [id\_variable], sort}. (The \texttt{sort, stable} command is insufficient.) You can additionally use the \texttt{datasignature} command to make sure the From 049e67bae7e705eb686dd7058c4c657ec55bd8f4 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 15:25:23 -0500 Subject: [PATCH 522/854] [ch4] consistently use "list of random numbers" in text, not algorithm --- chapters/sampling-randomization-power.tex | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 46a025dc6..24565a2be 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -89,9 +89,9 @@ \subsection{Reproducibility in random Stata processes} can be re-obtained at a future time. All random methods must be reproducible.\cite{orozco2018make} Stata, like most statistical software, uses a \textbf{pseudo-random number generator}. -Basically, it has a really long ordered list of numbers with the property that -knowing the previous one gives you precisely zero information about the next one. -Stata uses one of these numbers every time it has a task that is non-deterministic. +Basically, it has a pre-calculated really long ordered list of numbers with the property that +knowing the previous one gives you precisely zero information about the next one, i.e. a list of random numbers. +Stata uses one number from this list every time it has a task that is non-deterministic. In ordinary use, it will cycle through these numbers starting from a fixed point every time you restart Stata, and by the time you get to any given script, the current state and the subsequent states will be as good as random.\sidenote{ @@ -103,11 +103,11 @@ \subsection{Reproducibility in random Stata processes} \textbf{versioning}, \textbf{sorting}, and \textbf{seeding}. \textbf{Versioning} means using the same version of the software each time you run the random process. -If anything is different, the underlying randomization algorithms may have changed, +If anything is different, the underlying list of random numbers may have changed, and it will be impossible to recover the original result. -In Stata, the \texttt{version} command ensures that the software algorithm is fixed.\sidenote{ +In Stata, the \texttt{version} command ensures that the list of random numbers is fixed.\sidenote{ At the time of writing we recommend using \texttt{version 13.1} for backward compatibility; -the algorithm was changed after Stata 14 but its improvements do not matter in practice. +the algorithm used to create this list of random numbers was changed after Stata 14 but its improvements do not matter in practice. You will \textit{never} be able to reproduce a randomization in a different software, such as moving from Stata to R or vice versa.} The \texttt{ieboilstart} command in \texttt{ietoolkit} provides functionality to support this requirement.\sidenote{ @@ -132,7 +132,7 @@ \subsection{Reproducibility in random Stata processes} You can additionally use the \texttt{datasignature} command to make sure the data is unchanged. -\textbf{Seeding} means manually setting the start-point of the randomization algorithm. +\textbf{Seeding} means manually setting the start-point in the list of random numbers. You can draw a uniformly distributed six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. (This link is a just shortcut to request such a random seed on \url{https://www.random.org}.) There are many more seeds possible but this is a large enough set for most purposes. From da140e26fdf203e40899b479bee6f7fc5d78eb56 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 15:25:49 -0500 Subject: [PATCH 523/854] [ch4] put most of this in side note as it was so specific --- chapters/sampling-randomization-power.tex | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 24565a2be..ee79a2a25 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -133,16 +133,16 @@ \subsection{Reproducibility in random Stata processes} data is unchanged. \textbf{Seeding} means manually setting the start-point in the list of random numbers. -You can draw a uniformly distributed six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. -(This link is a just shortcut to request such a random seed on \url{https://www.random.org}.) -There are many more seeds possible but this is a large enough set for most purposes. -In Stata, \texttt{set seed [seed]} will set the generator to that state. -You should use exactly one unique, different, and randomly created seed per randomization process. +The seed is a number that should be at least six digits long and you should use exactly +one unique, different, and randomly created seed per randomization process.\sidenote{You +can draw a uniformly distributed six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. +(This link is a just shortcut to request such a random seed on \url{https://www.random.org}.) +There are many more seeds possible but this is a large enough set for most purposes.} +In Stata, \texttt{set seed [seed]} will set the generator to that start-point. To be clear: you should not set a single seed once in the master do-file, but instead you should set a new seed in code right before each random process. The most important thing is that each of these seeds is truly random, so do not use shortcuts such as the current date or a seed you have used before. -You will see in the code below that we include the source and timestamp for verification. Other commands may induce randomness in the data or alter the seed without you realizing it, so carefully confirm exactly how your code runs before finalizing it.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Randomization_in_Stata}} From db921fb04c6f7eb6e47018721e9eff5d622b5599 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 15:27:41 -0500 Subject: [PATCH 524/854] [ch4] explain auto.dta --- code/replicability.do | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code/replicability.do b/code/replicability.do index 67be9a004..bc52ab1bc 100644 --- a/code/replicability.do +++ b/code/replicability.do @@ -2,7 +2,7 @@ ieboilstart , v(13.1) `r(version)' -* Load the auto dataset +* Load the auto dataset (auto.dta is a test data set included in all Stata installations) sysuse auto.dta , clear * SORTING - sort on the uniquely identifying variable "make" From e3cebd279e2adb27d2f16f75d5b50041e7a57792 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 16:44:59 -0500 Subject: [PATCH 525/854] [ch2] remove comment specific to encrypted drive --- code/stata-master-dofile.do | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code/stata-master-dofile.do b/code/stata-master-dofile.do index 3d7da5cd7..ee7f6530a 100644 --- a/code/stata-master-dofile.do +++ b/code/stata-master-dofile.do @@ -28,7 +28,7 @@ if "`c(username)'" == "ResearchAssistant" { global github "C:/Users/RA/Documents/GitHub/d4di/DataWork" global dropbox "C:/Users/RA/Dropbox/d4di/DataWork" - global encrypted "A:/DataWork/EncryptedData" // Always mount to A disk! + global encrypted "M:/DataWork/EncryptedData" } From d090e98cf42636f9640a9605ca8d4e3da181f912 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 17:05:50 -0500 Subject: [PATCH 526/854] [ch4] document how code was selected Bringing back one thing mention in sentence removed in da140e26fdf203e40899b479bee6f7fc5d78eb56 --- chapters/sampling-randomization-power.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index ee79a2a25..2cab9d302 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -143,6 +143,7 @@ \subsection{Reproducibility in random Stata processes} but instead you should set a new seed in code right before each random process. The most important thing is that each of these seeds is truly random, so do not use shortcuts such as the current date or a seed you have used before. +You should also describe in your code how the seed was selected. Other commands may induce randomness in the data or alter the seed without you realizing it, so carefully confirm exactly how your code runs before finalizing it.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Randomization_in_Stata}} From 5c0b81b310f9c015f189500a41d56a9423eb01d8 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 17:58:26 -0500 Subject: [PATCH 527/854] [ch 4] simple sampling --- chapters/sampling-randomization-power.tex | 4 ++- code/simple-sample.do | 42 +++++++++++------------ 2 files changed, 24 insertions(+), 22 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 2cab9d302..bf0f45a4e 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -226,6 +226,8 @@ \subsection{Sampling} Ex post changes to the study scope using a sample drawn for a different purpose usually involve tedious calculations of probabilities and should be avoided. +\codeexample{simple-sample.do}{./code/simple-sample.do} + \subsection{Randomization} \textbf{Randomization}, in this context, is the process of assigning units into treatment arms. @@ -513,7 +515,7 @@ \subsection{Randomization inference} \codeexample{randomization-cf.do}{./code/randomization-cf.do} -\codeexample{simple-sample.do}{./code/simple-sample.do} + \codeexample{sample-noise.do}{./code/sample-noise.do} diff --git a/code/simple-sample.do b/code/simple-sample.do index 212fc1ca8..067a86c6d 100644 --- a/code/simple-sample.do +++ b/code/simple-sample.do @@ -1,26 +1,26 @@ -/* - Simple reproducible sampling -*/ +* Set up reproducbilitiy - VERSIONING, SORTTING and SEEDING + ieboilstart , v(13.1) // Version + `r(version)' // Version + sysuse bpwide.dta, clear // Load data + isid patient, sort // Sort + set seed 215597 // Seed - drawn using http://bit.ly/stata-random -* Set up reproducbilitiy - ieboilstart , v(12) // Version - `r(version)' // Version - sysuse auto.dta, clear // Load data - isid make, sort // Sort - set seed 215597 // Timestamp: 2019-04-26 17:51:02 UTC +* Generate a random number and use it to sort the observation. Then +* the order the observations are sorted in is random. + gen sample_rand = rnormal() //Generate a random number + sort sample_rand //Sort based on the random number -* Take a sample of 20% - preserve - sample 20 - tempfile sample - save `sample' , replace - restore +* Use the sort order to sample 20% (0.20) of the observations. _N in +* Stata is the number of observations in the active data set , and _n +* is the row number for each observation. The bpwide.dta has 120 +* observations, 120*0.20 = 24, so (_n <= _N * 0.20) is 1 for observations +* with a row number equal to or less than 24, and 0 for all other +* observations. Since the sort order is randomized this mean that we +* have randomly assigned 20% of the sample. + gen sample = (_n <= _N * 0.20) -* Merge and complete - merge 1:1 make using `sample' - recode _merge (3 = 1 "Sampled") (* = 0 "Not Sampled") , gen(sample) - label var sample "Sampled" - drop _merge +* Restore the original sort order + isid patient, sort -* Check +* Check your result tab sample From 35a6d3201b9231ee55ab95a14b10212910369460 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 18:11:00 -0500 Subject: [PATCH 528/854] [ch 4] randomization example --- chapters/sampling-randomization-power.tex | 6 ++++-- code/simple-multi-arm-randomization.do | 25 +++++++++++++++++++++++ 2 files changed, 29 insertions(+), 2 deletions(-) create mode 100644 code/simple-multi-arm-randomization.do diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index bf0f45a4e..a4e1a64d9 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -231,7 +231,7 @@ \subsection{Sampling} \subsection{Randomization} \textbf{Randomization}, in this context, is the process of assigning units into treatment arms. -Most of the code processses used for sampling are the same as those used for randomization, +Most of the code processes used for randomization are the same as those used for sampling, since randomization is also a process of splitting a sample into groups. Where sampling determines whether a particular individual will be observed at all in the course of data collection, @@ -255,7 +255,7 @@ \subsection{Randomization} Complexity can therefore grow very quickly in randomization and it is doubly important to fully understand the conceptual process that is described in the experimental design, -and fill in any gaps in the process before implmenting it in Stata. +and fill in any gaps in the process before implementing it in Stata. Some types of experimental designs necessitate that randomization results be revealed during data collection. It is possible to do this using survey software or live events. @@ -269,6 +269,8 @@ \subsection{Randomization} Understanding that process will also improve the ability of the team to ensure that the field randomization process is appropriately designed and executed. +\codeexample{simple-multi-arm-randomization.do}{./code/simple-multi-arm-randomization.do} + %----------------------------------------------------------------------------------------------- \section{Clustering and stratification} diff --git a/code/simple-multi-arm-randomization.do b/code/simple-multi-arm-randomization.do new file mode 100644 index 000000000..65c77c8b7 --- /dev/null +++ b/code/simple-multi-arm-randomization.do @@ -0,0 +1,25 @@ +* Set up reproducbilitiy - VERSIONING, SORTTING and SEEDING + ieboilstart , v(13.1) // Version + `r(version)' // Version + sysuse bpwide.dta, clear // Load data + isid patient, sort // Sort + set seed 654697 // Seed - drawn using http://bit.ly/stata-random + +* Generate a random number and use it to sort the observation. Then +* the order the observations are sorted in is random. + gen treatment_rand = rnormal() //Generate a random number + sort treatment_rand //Sort based on the random number + +* See simple-sample.do example for explination of "(_n <= _N * X)". The code +* below randomly selects one third into group 0, one third into group 1 and +* one third into group 2. Typically 0 represents the control group and 1 and +* 2 represents two treatment arms + generate treatment = 0 //Set all observations to 0 + replace treatment = 1 if (_n <= _N * (2/3)) //Set only the first two thirds to 1 + replace treatment = 2 if (_n <= _N * (1/3)) //Set only the first third to 2 + +* Restore the original sort order + isid patient, sort + +* Check your result + tab treatment From dfdd4f480daa937aa28c92dd7fc0b5caf3da93ae Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 18:39:30 -0500 Subject: [PATCH 529/854] [ch 4] randtreat uneven groups --- chapters/sampling-randomization-power.tex | 5 +- code/randtreat-strata.do | 58 +++++++++++++---------- 2 files changed, 35 insertions(+), 28 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index a4e1a64d9..2777b3a66 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -345,6 +345,8 @@ \subsection{Stratification} and the process of assigning the leftover ``misfit'' observations imposes an additional layer of randomization above the specified division. +\codeexample{randtreat-strata.do}{./code/randtreat-strata.do} + Whenever stratification is used for randomization, the analysis of differences within the strata (especially treatment effects) requires a control in the form of an indicator variable for all strata (fixed effects). @@ -361,7 +363,6 @@ \subsection{Stratification} The exact formula depends on the analysis being performed, but is usually related to the inverse of the likelihood of inclusion. - %----------------------------------------------------------------------------------------------- \section{Power calculation and randomization inference} @@ -525,7 +526,7 @@ \subsection{Randomization inference} \codeexample{randomization-program-2.do}{./code/randomization-program-2.do} -\codeexample{randtreat-strata.do}{./code/randtreat-strata.do} + \codeexample{randtreat-clusters.do}{./code/randtreat-clusters.do} diff --git a/code/randtreat-strata.do b/code/randtreat-strata.do index f61a7a419..40d154d02 100644 --- a/code/randtreat-strata.do +++ b/code/randtreat-strata.do @@ -1,36 +1,42 @@ -* Use [randtreat] in randomization program ---------------- -cap prog drop my_randomization - prog def my_randomization +* If user written command randtreat is not installed, install it here + cap which randtreat + if _rc ssc install randtreat - * Syntax with open options for [ritest] - syntax , [*] - cap drop treatment - cap drop strata +* Set up reproducbilitiy - VERSIONING, SORTTING and SEEDING + ieboilstart , v(13.1) // Version + `r(version)' // Version + sysuse bpwide.dta, clear // Load data + isid patient, sort // Sort + set seed 796683 // Seed - drawn using http://bit.ly/stata-random - * Create strata indicator - egen strata = group(sex agegrp) , label - label var strata "Strata Group" +* Create strata indicator. The indicator is a categorical varaible with +* one value for each unique combination of gender and age group. + egen sex_agegroup = group(sex agegrp) , label + label var sex_agegroup "Strata Gender and Age Group" - * Group 1/5 in control and each treatment +* Use the user written command randtreat to randomize when the groups +* cannot be evenly distributed into treatment arms. There are 20 +* observations in each strata, and there is no way to evenly distribute +* 20 observations in 6 groups. If we assign 3 observation to each +* treatment arm we have 2 observations in each strata left. The remaining +* observations are called "misfits". In randtreat we can use the "global" +* misfit strategy, meaning that the misfits will be randomized into +* treatment groups so that the sizes of the treatment groups are as +* balanced as possible globally (read helpfile for more information). +* This way we have 6 treatment groups with exactly 20 observations +* in each, and it is randomized which strata that has an extra +* observation in each treatment arm. randtreat, /// - generate(treatment) /// New variable name - multiple(6) /// 6 arms - strata(strata) /// 6 strata - misfits(global) /// Randomized altogether + generate(treatment) /// New variable name + multiple(6) /// 6 treatment arms + strata(sex_agegroup) /// Variable to use as strata + misfits(global) /// Misfit strategy if uneven groups - * Cleanup + * Label the treatment variable lab var treatment "Treatment Arm" lab def treatment 0 "Control" 1 "Treatment 1" 2 "Treatment 2" /// 3 "Treatment 3" 4 "Treatment 4" 5 "Treatment 5" , replace lab val treatment treatment -end // ---------------------------------------------------- -* Reproducible setup: data, isid, version, seed - sysuse bpwide.dta , clear - isid patient, sort - version 13.1 - set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC - -* Randomize - my_randomization - tab treatment strata +* Show result of randomization + tab treatment sex_agegroup From 80af81da690dec114c13bdbb28c0ebf728c66aed Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 18:40:19 -0500 Subject: [PATCH 530/854] [ch 4] two more code do not use --- code/randomization-cf.do | 30 +------------------------- code/randtreat-clusters.do | 43 +------------------------------------- 2 files changed, 2 insertions(+), 71 deletions(-) diff --git a/code/randomization-cf.do b/code/randomization-cf.do index 520b75008..f75dc754e 100644 --- a/code/randomization-cf.do +++ b/code/randomization-cf.do @@ -1,29 +1 @@ -* Make one randomization - sysuse bpwide.dta , clear - isid patient, sort - version 13.1 - set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC - - sample 100 - -* Save for comparison - tempfile sample - save `sample' , replace - -* Identical randomization - sysuse bpwide.dta , clear - isid patient, sort - version 13.1 - set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC - - sample 100 - cf _all using `sample' - -* Do something wrong - sysuse bpwide.dta , clear - sort bp* - version 13.1 - set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC - - sample 100 - cf _all using `sample' +* Do not use diff --git a/code/randtreat-clusters.do b/code/randtreat-clusters.do index 07688d9c6..f75dc754e 100644 --- a/code/randtreat-clusters.do +++ b/code/randtreat-clusters.do @@ -1,42 +1 @@ -* Use [randtreat] in randomization program ---------------- -cap prog drop my_randomization - prog def my_randomization - - * Syntax with open options for [ritest] - syntax , [*] - cap drop treatment - cap drop cluster - - * Create cluster indicator - egen cluster = group(sex agegrp) , label - label var cluster "Cluster Group" - - * Save data set with all observations - tempfile ctreat - save `ctreat' , replace - - * Keep only one from each cluster for randomization - bysort cluster : keep if _n == 1 - - * Group 1/2 in control and treatment in new variable treatment - randtreat, generate(treatment) multiple(2) - - * Keep only treatment assignment and merge back to all observations - keep cluster treatment - merge 1:m cluster using `ctreat' , nogen - - * Cleanup - lab var treatment "Treatment Arm" - lab def treatment 0 "Control" 1 "Treatment" , replace - lab val treatment treatment -end // ---------------------------------------------------- - -* Reproducible setup: data, isid, version, seed - sysuse bpwide.dta , clear - isid patient, sort - version 13.1 - set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC - -* Randomize - my_randomization - tab cluster treatment +* Do not use From e2cc7015db51fd841fa41a13afd77cc155d6db9e Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 18:42:52 -0500 Subject: [PATCH 531/854] [ch4] remove unused code example files --- chapters/sampling-randomization-power.tex | 20 -------------------- code/minimum-detectable-effect.do | 1 - code/minimum-sample-size.do | 1 - code/randomization-cf.do | 1 - code/randomization-program-1.do | 1 - code/randomization-program-2.do | 1 - code/randtreat-clusters.do | 1 - code/sample-noise.do | 1 - 8 files changed, 27 deletions(-) delete mode 100755 code/minimum-detectable-effect.do delete mode 100755 code/minimum-sample-size.do delete mode 100644 code/randomization-cf.do delete mode 100644 code/randomization-program-1.do delete mode 100644 code/randomization-program-2.do delete mode 100644 code/randtreat-clusters.do delete mode 100644 code/sample-noise.do diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 2777b3a66..c8b5975c7 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -513,23 +513,3 @@ \subsection{Randomization inference} or if results seem to depend dramatically on the placement of a small number of individuals, randomization inference will flag those issues before the experiment is fielded and allow adjustments to the design to be made. - -% code - -\codeexample{randomization-cf.do}{./code/randomization-cf.do} - - - -\codeexample{sample-noise.do}{./code/sample-noise.do} - -\codeexample{randomization-program-1.do}{./code/randomization-program-1.do} - -\codeexample{randomization-program-2.do}{./code/randomization-program-2.do} - - - -\codeexample{randtreat-clusters.do}{./code/randtreat-clusters.do} - -\codeexample{minimum-detectable-effect.do}{./code/minimum-detectable-effect.do} - -\codeexample{minimum-sample-size.do}{./code/minimum-sample-size.do} diff --git a/code/minimum-detectable-effect.do b/code/minimum-detectable-effect.do deleted file mode 100755 index f75dc754e..000000000 --- a/code/minimum-detectable-effect.do +++ /dev/null @@ -1 +0,0 @@ -* Do not use diff --git a/code/minimum-sample-size.do b/code/minimum-sample-size.do deleted file mode 100755 index f75dc754e..000000000 --- a/code/minimum-sample-size.do +++ /dev/null @@ -1 +0,0 @@ -* Do not use diff --git a/code/randomization-cf.do b/code/randomization-cf.do deleted file mode 100644 index f75dc754e..000000000 --- a/code/randomization-cf.do +++ /dev/null @@ -1 +0,0 @@ -* Do not use diff --git a/code/randomization-program-1.do b/code/randomization-program-1.do deleted file mode 100644 index f75dc754e..000000000 --- a/code/randomization-program-1.do +++ /dev/null @@ -1 +0,0 @@ -* Do not use diff --git a/code/randomization-program-2.do b/code/randomization-program-2.do deleted file mode 100644 index f75dc754e..000000000 --- a/code/randomization-program-2.do +++ /dev/null @@ -1 +0,0 @@ -* Do not use diff --git a/code/randtreat-clusters.do b/code/randtreat-clusters.do deleted file mode 100644 index f75dc754e..000000000 --- a/code/randtreat-clusters.do +++ /dev/null @@ -1 +0,0 @@ -* Do not use diff --git a/code/sample-noise.do b/code/sample-noise.do deleted file mode 100644 index f75dc754e..000000000 --- a/code/sample-noise.do +++ /dev/null @@ -1 +0,0 @@ -* Do not use From 48dc02bf54e766d47d496fa622e1446b20f43f2f Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 18:53:38 -0500 Subject: [PATCH 532/854] [ch4] fix spacing --- code/randtreat-strata.do | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/code/randtreat-strata.do b/code/randtreat-strata.do index 40d154d02..1b5eb1a67 100644 --- a/code/randtreat-strata.do +++ b/code/randtreat-strata.do @@ -1,6 +1,6 @@ * If user written command randtreat is not installed, install it here - cap which randtreat - if _rc ssc install randtreat + cap which randtreat + if _rc ssc install randtreat * Set up reproducbilitiy - VERSIONING, SORTTING and SEEDING ieboilstart , v(13.1) // Version From 4656c118e396cd12d2428fd690f1972c8e384e21 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 6 Feb 2020 18:55:46 -0500 Subject: [PATCH 533/854] [ch4] fix spacing --- code/randtreat-strata.do | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code/randtreat-strata.do b/code/randtreat-strata.do index 1b5eb1a67..bc51014d6 100644 --- a/code/randtreat-strata.do +++ b/code/randtreat-strata.do @@ -39,4 +39,4 @@ lab val treatment treatment * Show result of randomization - tab treatment sex_agegroup + tab treatment sex_agegroup From 11b65884acfe2e5b76f1674ada115be54a98e9ee Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 7 Feb 2020 08:42:10 -0500 Subject: [PATCH 534/854] [ch5] 3-2-1 rule consistency - fixes issue #332 --- chapters/data-collection.tex | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 3c61a1c32..ebef8cbd5 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -264,7 +264,10 @@ \subsection{Secure data storage} \end{enumerate} -This handling satisfies the rule of three: there are two on-site copies of the data and one off-site copy, so the data can never be lost in case of hardware failure. +This handling satisfies the \textbf{3-2-1 rule}: there are +two on-site copies of the data and one off-site copy, so the data can never +be lost in case of hardware +failure.\sidenote{\url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} In addition, you should ensure that all teams take basic precautions to ensure the security of data, as most problems are due to human error. Ideally, the machine hard drives themselves should also be encrypted, as well as any external hard drives or flash drives used. All files sent to the field containing PII data, such as sampling lists, must be encrypted. From 96d08b71b3d52214c669c762986b56c9799be711 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 7 Feb 2020 11:57:07 -0500 Subject: [PATCH 535/854] Setup and survey design --- bibliography.bib | 9 +- chapters/data-collection.tex | 233 ++++++++++++++++++++--------------- 2 files changed, 145 insertions(+), 97 deletions(-) diff --git a/bibliography.bib b/bibliography.bib index f53422fdb..47cefaf8c 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -8,7 +8,14 @@ @Article{tidy-data volume = {59}, year = {2014}, bdsk-url-1 = {http://www.jstatsoft.org/v59/i10/} - } +} + +@book{glewwe2000designing, + title={Designing household survey questionnaires for developing countries: lessons from 15 years of the living standards measurement study}, + author={Glewwe, Paul and Grosh, Margaret E}, + year={2000}, + publisher={World Bank} +} @MISC{88491, TITLE = {What is meant by the standard error of a maximum likelihood estimate?}, diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index ebef8cbd5..b5affbe7f 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -13,84 +13,105 @@ \end{fullwidth} %------------------------------------------------ -\section{Survey development workflow} -A well-designed questionnaire results from careful planning, consideration of analysis and indicators, close review of existing questionnaires, survey pilots, and research team and stakeholder review. There are many excellent resources on questionnaire design, such as from the World Bank's Living Standards Measurement Survey. -\sidenote{Grosh, Margaret; Glewwe, Paul. 2000. Designing Household Survey Questionnaires for Developing Countries : Lessons from 15 Years of the Living Standards Measurement Study, Volume 3. Washington, DC: World Bank. © World Bank.\url{https://openknowledge.worldbank.org/handle/10986/15195 License: CC BY 3.0 IGO.}} -The focus of this chapter is the particular design challenges for electronic surveys (often referred to as Computer Assisted Personal Interviews (CAPI)). -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Computer-Assisted\_Personal\_Interviews\_(CAPI)}} +\section{Collecting primary data with development partners} -Although most surveys are now collected electronically, by tablet, mobile phone or web browser, -\textbf{questionnaire design}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Design}} -\index{questionnaire design} -(content development) and questionnaire programming (functionality development) should be seen as two strictly separate tasks. -The research team should agree on all questionnaire content and design a paper version before programming an electronic version. -This facilitates a focus on content during the design process and ensures teams have a readable, printable paper version of their questionnaire. -Most importantly, it means the research, not the technology, drives the questionnaire design. - -An easy-to-read paper version of the questionnaire is particularly critical for training enumerators, so they can get an overview of the survey content and structure before diving into the programming. -It is much easier for enumerators to understand all possible response pathways from a paper version than from swiping question by question. -Finalizing the questionnaire before programming also avoids version control concerns that arise from concurrent work on paper and electronic survey instruments. -Finally, a paper questionnaire is an important documentation for data publication. +\subsection{Who owns data?} +\subsection{Data licensing agreements} -\subsection{Content-focused Pilot} -A \textbf{survey pilot}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} is essential to finalize questionnaire design. -A content-focused pilot\sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} is best done on pen-and-paper, before the questionnaire is programmed. -The objective is to improve the structure and length of the questionnaire, refine the phrasing and translation of specific questions, and confirm coded response options are exhaustive.\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes}} In addition, it is an opportunity to test and refine all survey protocols, such as how units will be sampled or pre-selected units identified. The pilot must be done out-of-sample, but in a context as similar as possible to the study sample. +\subsection{Receiving data from development partners} -\subsection{Data-focused pilot} -A second survey pilot should be done after the questionnaire is programmed. -The objective of the data-focused pilot\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. -Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. -It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. -The data-focused pilot should be done in advance of enumerator training +%------------------------------------------------ +\section{Collecting primary data using electronic surveys} + +\subsection{Developing a survey instrument} + +A well-designed questionnaire results from careful planning, +consideration of analysis and indicators, close review of existing questionnaires, +survey pilots, and research team and stakeholder review. +There are many excellent resources on questionnaire design, +such as from the World Bank's Living Standards Measurement Survey.\cite{glewwe2000designing} +The focus of this section is the design of electronic field surveys, +often referred to as Computer Assisted Personal Interviews (CAPI).\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Computer-Assisted\_Personal\_Interviews\_(CAPI)}} +Although most surveys are now collected electronically, by tablet, mobile phone or web browser, +\textbf{questionnaire design}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Questionnaire_Design}} + \index{questionnaire design} +(content development) and questionnaire programming (functionality development) +should be seen as two strictly separate tasks. +Therefore, the research team should agree on all questionnaire content +and design a paper version of the survey before beginning to program the electronic version. +This facilitates a focus on content during the design process +and ensures teams have a readable, printable version of their questionnaire. +Most importantly, it means the research, not the technology, drives the questionnaire design. +We recomment this approach because an easy-to-read paper questionnaire +is especially useful for training data collection staff, +by focusing on the survey content and structure before diving into the technical component. +It is much easier for enumerators to understand the range of possible participant responses +and how to hand them correctly on a paper survey than on a tablet, +and it is much easier for them to translate that logic to digital functionality later. +Finalizing this version of the questionnaire before beginning any programming +also avoids version control concerns that arise from concurrent work +on paper and electronic survey instruments. +Finally, a readable paper questionnaire is a necessary component of data documentation, +since it is difficult to work backwards from the survey program to the intended concepts. -\section{Designing electronic questionnaires} The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: begin from broad concepts and slowly flesh out the specifics. It is essential to start with a clear understanding of the -\textbf{theory of change}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} and \textbf{experimental design} for your project. -The first step of questionnaire design is to list key outcomes of interest, as well as the main factors to control for (covariates) and variables needed for experimental design. -The ideal starting point for this is a \textbf{pre-analysis plan}.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} - -Use the list of key outcomes to create an outline of questionnaire \textit{modules} (do not number the modules yet; instead use a short prefix so they can be easily reordered). For each module, determine if the module is applicable to the full sample, the appropriate respondent, and whether or how often, the module should be repeated. A few examples: a module on maternal health only applies to household with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. - -Each module should then be expanded into specific indicators to observe in the field. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} -At this point, it is useful to do a \textbf{content-focused pilot} -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} of the questionnaire. -Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. - - -\subsection{Questionnaire design for quantitative analysis} -This book covers surveys designed to yield datasets useful for quantitative analysis. This is a subset of surveys, and there are specific design considerations that will help to ensure the raw data outputs are ready for analysis. - -From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like -\textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. Rigorous field testing is required to ensure that answer categories are comprehensive; however, it is best practice to include an \textit{other, specify} option. Keep track of those reponses in the first few weeks of fieldwork; adding an answer category for a response frequently showing up as \textit{other} can save time, as it avoids post-coding. - -It is useful to name the fields in your paper questionnaire in a way that will also work in the data analysis software. -There is debate over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. -We recommend using descriptive names with clear prefixes so that variables -within a module stay together when sorted -alphabetically.\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} - Variable names should never include spaces or mixed cases (all lower case is -best). Take care with the length: very long names will be cut off in certain -software, which could result in a loss of uniqueness. We discourage explicit -question numbering, as it discourages re-ordering, which is a common -recommended change after the pilot. In the case of follow-up surveys, numbering -can quickly become convoluted, too often resulting in variables names like -\texttt{ag\_15a}, \texttt{ag\_15\_new}, \texttt{ag\_15\_fup2}, etc. - -Questionnaires must include ways to document the reasons for \textbf{attrition}, treatment \textbf{contamination}, and \textbf{loss to follow-up}. +\textbf{theory of change}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} +and \textbf{experimental design} for your project. +The first step of questionnaire design is to list key outcomes of interest, +as well as the main covariates to control for and any variables needed for experimental design. +The ideal starting point for this is a \textbf{pre-analysis plan}.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} + +Use the list of key outcomes to create an outline of questionnaire \textit{modules}. +Do not number the modules yet; instead use a short prefix so they can be easily reordered. +For each module, determine if the module is applicable to the full sample, +the appropriate respondent, and whether or how often, the module should be repeated. +A few examples: a module on maternal health only applies to household with a woman who has children, +a household income module should be answered by the person responsible for household finances, +and a module on agricultural production might be repeated for each crop the household cultivated. +Each module should then be expanded into specific indicators to observe in the field.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} +At this point, it is useful to do a \textbf{content-focused pilot}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} +of the questionnaire. +Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, +as there is no need to factor in costs of re-programming, +and as a result improves the overall quality of the survey instrument. +Questionnaires must also include ways to document the reasons for \textbf{attrition}, +treatment \textbf{contamination}, and \textbf{loss to follow-up}. \index{attrition}\index{contamination} -These are essential data components for completing CONSORT records, a standardized system for reporting enrollment, intervention allocation, follow-up, and data analysis through the phases of a randomized trial of two groups.\cite{begg1996improving} +These are essential data components for completing CONSORT records, +a standardized system for reporting enrollment, intervention allocation, follow-up, +and data analysis through the phases of a randomized trial.\cite{begg1996improving} + +Once the content of the survey is drawn up, +the team should conduct a small \textbf{survey pilot}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} +using the paper forms to finalize questionnaire design and detect any content issues. +A content-focused pilot\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} +is best done on pen and paper, before the questionnaire is programmed, +because changes at this point may be deep and structural, which are hard to adjust in code. +The objective is to improve the structure and length of the questionnaire, +refine the phrasing and translation of specific questions, +and confirm coded response options are exhaustive.\sidenote{ + \url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes}} +In addition, it is an opportunity to test and refine all survey protocols, +such as how units will be sampled or pre-selected units identified. +The pilot must be done out-of-sample, +but in a context as similar as possible to the study sample. +Once the team is satisfied with the content and structure of the survey, +it is time to move on to implementing it electronically. + +\subsection{Designing electronic questionnaires} -Once the content of the questionnaire is finalized and translated, it is time to proceed with programming the electronic survey instrument. - - -%------------------------------------------------ -\section{Programming electronic questionnaires} Electronic data collection has great potential to simplify survey implementation and improve data quality. Electronic questionnaires are typically created in a spreadsheet (e.g. Excel or Google Sheets) or software-specific form builder, accessible even to novice users. \sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} @@ -101,15 +122,12 @@ \section{Programming electronic questionnaires} However, these are not fully automatic: you still need to actively design and manage the survey. Here, we discuss specific practices that you need to follow to take advantage of electronic survey features and ensure that the exported data is compatible with the software that will be used for analysis. - As discussed above, the starting point for questionnare programming is a complete paper version of the questionnaire, piloted for content, and translated where needed. Doing so reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. When programming, do not start with the first question and proceed straight through to the last question. Instead, code from high level to small detail, following the same questionnaire outline established at design phase. The outline provides the basis for pseudocode, allowing you to start with high level structure and work down to the level of individual questions. This will save time and reduce errors. - -\subsection{Electronic survey features} Electronic surveys are more than simply a paper questionnaire displayed on a mobile device or web browser. All common survey software allow you to automate survey logic and add in hard and soft constraints on survey responses. These features make enumerators' work easier, and they create the opportunity to identify and resolve data issues in real-time, simplifying data cleaning and improving response quality. @@ -127,7 +145,6 @@ \subsection{Electronic survey features} \item{\textbf{Calculations}}: make the electronic survey instrument do all math, rather than relying on the enumerator or asking them to carry a calculator. \end{itemize} -\subsection{Compatibility with analysis software} All survey software include debugging and test options to correct syntax errors and make sure that the survey instruments will successfully compile. This is not sufficient, however, to ensure that the resulting dataset will load without errors in your data analysis software of choice. We developed the \texttt{ietestform} @@ -144,9 +161,42 @@ \subsection{Compatibility with analysis software} ranges are included for numeric variables. \texttt{ietestform} also removes all leading and trailing blanks from response lists, which could be handled inconsistently across software. +From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like +\textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. Rigorous field testing is required to ensure that answer categories are comprehensive; however, it is best practice to include an \textit{other, specify} option. Keep track of those reponses in the first few weeks of fieldwork; adding an answer category for a response frequently showing up as \textit{other} can save time, as it avoids post-coding. + +It is useful to name the fields in your questionnaire in a way that will also work in the data analysis software. +There is debate over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. +We recommend using descriptive names with clear prefixes so that variables +within a module stay together when sorted +alphabetically.\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} + Variable names should never include spaces or mixed cases (all lower case is +best). Take care with the length: very long names will be cut off in certain +software, which could result in a loss of uniqueness. We discourage explicit +question numbering, as it discourages re-ordering, which is a common +recommended change after the pilot. In the case of follow-up surveys, numbering +can quickly become convoluted, too often resulting in variables names like +\texttt{ag\_15a}, \texttt{ag\_15\_new}, \texttt{ag\_15\_fup2}, etc. + +A second survey pilot should be done after the questionnaire is programmed. +The objective of the data-focused pilot\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. +Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. +It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. +The data-focused pilot should be done in advance of enumerator training + +\subsection{Programming electronic questionnaires} %------------------------------------------------ -\section{Data quality assurance} +\section{Data quality assurance and data security} + +\subsection{Implementing high frequency quality checks} + +\subsection{Conducting back-checks and data validation} + +\subsection{Receiving, storing, and sharing data securely} + +%------------------------------------------------ + + A huge advantage of electronic surveys, compared to traditional paper surveys, is the ability to access and analyze the data while the survey is ongoing. Data issues can be identified and resolved in real-time. Designing systematic data checks, and running them routinely throughout data collection, simplifies monitoring and improves data quality. As part of survey preparation, the research team should develop a \textbf{data quality assurance plan} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Quality_Assurance_Plan}}. @@ -155,7 +205,7 @@ \section{Data quality assurance} Data quality assurance requires a combination of real-time data checks and survey audits. Careful field supervision is also essential for a successful survey; however, we focus on the first two in this chapter, as they are the most directly data related. -\subsection{High frequency checks} + High-frequency checks (HFCs) should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, and that additional field effort is centered where it is most important. Data quality checks should be run on the data every time it is downloaded (ideally on a daily basis), to flag irregularities in survey progress, sample completeness or response quality. \texttt{ipacheck}\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks}} is a very useful command that automates some of these tasks. @@ -181,18 +231,18 @@ \subsection{High frequency checks} This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions and loss to follow-up occurred in the field and how they were implemented and resolved. -High frequency checks should also include survey-specific data checks. As electronic survey -software incorporates many data control features, discussed above, these checks -should focus on issues survey software cannot check automatically. As most of -these checks are survey specific, it is difficult to provide general guidance. -An in-depth knowledge of the questionnaire, and a careful examination of the -pre-analysis plan, is the best preparation. Examples include consistency -across multiple responses, complex calculation (such as crop yield, which first requires unit conversions), -suspicious patterns in survey timing, or atypical response patters from specific enumerators. +High frequency checks should also include survey-specific data checks. As electronic survey +software incorporates many data control features, discussed above, these checks +should focus on issues survey software cannot check automatically. As most of +these checks are survey specific, it is difficult to provide general guidance. +An in-depth knowledge of the questionnaire, and a careful examination of the +pre-analysis plan, is the best preparation. Examples include consistency +across multiple responses, complex calculation (such as crop yield, which first requires unit conversions), +suspicious patterns in survey timing, or atypical response patters from specific enumerators. timing, or atypical response patterns from specific enumerators.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} -survey software typically provides rich metadata, which can be useful in -assessing interview quality. For example, automatically collected time stamps -show how long enumerators spent per question, and trace histories show how many +survey software typically provides rich metadata, which can be useful in +assessing interview quality. For example, automatically collected time stamps +show how long enumerators spent per question, and trace histories show how many times answers were changed before the survey was submitted. High-frequency checks will only improve data quality if the issues they catch are communicated to the field. @@ -202,8 +252,6 @@ \subsection{High frequency checks} It is also possible to automate communication of errors to the field team by adding scripts to link the HFCs with a messaging program such as whatsapp. Any of these solutions are possible: what works best for your team will depend on such variables as cellular networks in fieldwork areas, whether field supervisors have access to laptops, internet speed, and coding skills of the team preparing the HFC workflows. - -\subsection{Data considerations for field monitoring} Careful monitoring of field work is essential for high quality data. \textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. @@ -229,21 +277,17 @@ \subsection{Data considerations for field monitoring} as expected (and not sitting under a tree making up data). Do note, however, that audio audits must be included in the Informed Consent. -%------------------------------------------------ -\section{Collecting Data Securely} Primary data collection almost always includes \textbf{personally-identifiable information (PII)} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. PII must be handled with great care at all points in the data collection and management process, to comply with ethical requirements and avoid breaches of confidentiality. Access to PII must be restricted to team members granted that permission by the applicable Institutional Review Board or a data licensing agreement with a partner agency. Research teams must maintain strict protocols for data security at each stage of the process: data collection, storage, and sharing. -\subsection{Secure data in the field} All mainstream data collection software automatically \textbf{encrypt} \sidenote{\textbf{Encryption:} the process of making information unreadable to anyone without access to a specific deciphering key. \url{https://dimewiki.worldbank.org/wiki/Encryption}} all data submitted from the field while in transit (i.e., upload or download). Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. Therefore, as long as you are using an established survey software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, and accounts used have a logon password and are never left unlocked. -\subsection{Secure data storage} \textbf{Encryption at rest} is the only way to ensure that PII data remains private when it is stored on someone else's server on the internet. You must keep your data encrypted on the server whenever PII data is collected. Encryption makes data files completely unusable without access to a security key specific to that data -- a higher level of security than password-protection. Encryption at rest requires active participation from the user, and you should be fully aware that if your private key is lost, there is absolutely no way to recover your data. @@ -264,9 +308,9 @@ \subsection{Secure data storage} \end{enumerate} -This handling satisfies the \textbf{3-2-1 rule}: there are -two on-site copies of the data and one off-site copy, so the data can never -be lost in case of hardware +This handling satisfies the \textbf{3-2-1 rule}: there are +two on-site copies of the data and one off-site copy, so the data can never +be lost in case of hardware failure.\sidenote{\url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} In addition, you should ensure that all teams take basic precautions to ensure the security of data, as most problems are due to human error. Ideally, the machine hard drives themselves should also be encrypted, as well as any external hard drives or flash drives used. @@ -274,7 +318,6 @@ \subsection{Secure data storage} You must never share passwords by email; rather, use a secure password manager. This significantly mitigates the risk in case there is a security breach such as loss, theft, hacking, or a virus, with little impact on day-to-day utilization. -\subsection{Secure data sharing} To simplify workflow, it is best to remove PII variables from your data at the earliest possible opportunity, and save a de-identified copy of the data. Once the data is de-identified, it no longer needs to be encrypted - therefore you can interact with it directly, without having to provide the keyfile. @@ -302,8 +345,6 @@ \subsection{Secure data sharing} In cases where PII data is required for analysis, we recommend embargoing the sensitive variables when publishing the data. Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. - - With the raw data securely stored and backed up, and a de-identified dataset to work with, you are ready to move to data cleaning, and analysis. %------------------------------------------------ From a7270c9920a314f29fcacbbd65a815254291cc85 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 7 Feb 2020 12:13:10 -0500 Subject: [PATCH 536/854] Electronic design and programming --- chapters/data-collection.tex | 159 ++++++++++++++++++++++++----------- 1 file changed, 110 insertions(+), 49 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index b5affbe7f..05db218f4 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -110,30 +110,100 @@ \subsection{Developing a survey instrument} Once the team is satisfied with the content and structure of the survey, it is time to move on to implementing it electronically. -\subsection{Designing electronic questionnaires} +\subsection{Designing surveys for electronic deployment} Electronic data collection has great potential to simplify survey implementation and improve data quality. -Electronic questionnaires are typically created in a spreadsheet (e.g. Excel or Google Sheets) or software-specific form builder, accessible even to novice users. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} -We will not address software-specific form design in this book; rather, we focus on coding conventions that are important to follow for electronic surveys regardless of software choice. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} +Electronic questionnaires are typically created in a spreadsheet (e.g. Excel or Google Sheets) +or software-specific form builder, which are accessible even to novice users.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} +We will not address software-specific form design in this book; +rather, we focus on coding conventions that are important to follow +for electronic surveys regardless of software choice.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} +Survey software tools provide a wide range of features +designed to make implementing even highly complex surveys easy, scalable, and secure. +However, these are not fully automatic: you need to actively design and manage the survey. +Here, we discuss specific practices that you need to follow +to take advantage of electronic survey features +and ensure that the exported data is compatible with your analysis software. + +From a data perspective, questions with pre-coded response options +are always preferable to open-ended questions. +The content-based pilot is an excellent time to ask open-ended questions +and refine fixed responses for the final version of the questionnaire -- +do not count on coding up lots of free text after a full survey. +Coding responses helps to ensure that the data will be useful for quantitative analysis. +Two examples help illustrate the point. +First, instead of asking ``How do you feel about the proposed policy change?'', +use techniques like \textbf{Likert scales}\sidenote{ + \textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. +Second, if collecting data on medication use or supplies, you could collect: +the brand name of the product; the generic name of the product; the coded compound of the product; +or the broad category to which each product belongs (antibiotic, etc.). +All four may be useful for different reasons, +but the latter two are likely to be the most useful for data analysis. +The coded compound requires providing a translation dictionary to field staff, +but enables automated rapid recoding for analysis with no loss of information. +The generic class requires agreement on the broad categories of interest, +but allows for much more comprehensible top-line statistics and data quality checks. +Rigorous field testing is required to ensure that answer categories are comprehensive; +however, it is best practice to include an \textit{other, specify} option. +Keep track of those reponses in the first few weeks of fieldwork. +Adding an answer category for a response frequently showing up as \textit{other} can save time, +as it avoids extensive post-coding. + +It is essential to name the fields in your questionnaire +in a way that will also work in your data analysis software. +Most survey programs will not enforce this by default, +since limits vary by software, +and surveys will subtly encourage you to use long sentences +and detailed descriptions of choice options. +This is what you want for the enumerator-respondent interaction, +but you should already have analysis-compatible labels programmed in the background +so the resulting data can be rapidly imported in anlytical software. +There is some debate over how exactly individual questions should be identified: +formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, +but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. +We recommend using descriptive names with clear prefixes so that variables +within a module stay together when sorted alphabetically.\sidenote{ + \url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} +Variable names should never include spaces or mixed cases +(we prefer all-lowercase naming). +Take special care with the length: very long names will be cut off in some softwares, +which could result in a loss of uniqueness and lots of manual work to restore compatibility. +We further discourage explicit question numbering, as it discourages re-ordering, +which is a common recommended change after the pilot. +In the case of follow-up surveys, numbering can quickly become convoluted, +too often resulting in uninformative variables names like +\texttt{ag\_15a}, \texttt{ag\_15\_new}, \texttt{ag\_15\_fup2}, and so on. -Survey software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. -However, these are not fully automatic: you still need to actively design and manage the survey. -Here, we discuss specific practices that you need to follow to take advantage of electronic survey features and ensure that the exported data is compatible with the software that will be used for analysis. +\subsection{Programming electronic questionnaires} -As discussed above, the starting point for questionnare programming is a complete paper version of the questionnaire, piloted for content, and translated where needed. -Doing so reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. -When programming, do not start with the first question and proceed straight through to the last question. -Instead, code from high level to small detail, following the same questionnaire outline established at design phase. -The outline provides the basis for pseudocode, allowing you to start with high level structure and work down to the level of individual questions. This will save time and reduce errors. +The starting point for questionnare programming is therefore a complete paper version of the questionnaire, +piloted for content and translated where needed. +Doing so reduces version control issues that arise from making significant changes +to concurrent paper and electronic survey instruments. +Changing structural components of the survey after programming has been started +often requires the coder to substantially re-work the entire code. +This is because the more efficient way to code surveys is non-linear. +When programming, we do not start with the first question and proceed through to the last question. +Instead, we code from high level to small detail, +following the same questionnaire outline established at design phase. +The outline provides the basis for pseudocode, +allowing you to start with high level structure and work down to the level of individual questions. +This will save time and reduce errors, +particularly where sections or field are interdependent or repeated in complex ways. Electronic surveys are more than simply a paper questionnaire displayed on a mobile device or web browser. -All common survey software allow you to automate survey logic and add in hard and soft constraints on survey responses. -These features make enumerators' work easier, and they create the opportunity to identify and resolve data issues in real-time, simplifying data cleaning and improving response quality. +All common survey software allow you to automate survey logic +and add in hard and soft constraints on survey responses. +These features make enumerators' work easier, +and they create the opportunity to identify and resolve data issues in real-time, +simplifying data cleaning and improving response quality. Well-programmed questionnaires should include most or all of the following features: \begin{itemize} + \item{\textbf{Localizations}}: the survey instrument should display full text questions and responses in the survey language, and it should also have English and code-compatible versions of all text and labels. \item{\textbf{Survey logic}}: build in all logic, so that only relevant questions appear, rather than relying on enumerators to follow complex survey logic. This covers simple skip codes, as well as more complex interdependencies (e.g., a child health module is only asked to households that report the presence of a child under 5). \item{\textbf{Range checks}}: add range checks for all numeric variables to catch data entry mistakes (e.g. age must be less than 120). \item{\textbf{Confirmation of key variables}}: require double entry of essential information (such as a contact phone number in a survey with planned phone follow-ups), with automatic validation that the two entries match. @@ -145,45 +215,36 @@ \subsection{Designing electronic questionnaires} \item{\textbf{Calculations}}: make the electronic survey instrument do all math, rather than relying on the enumerator or asking them to carry a calculator. \end{itemize} -All survey software include debugging and test options to correct syntax errors and make sure that the survey instruments will successfully compile. -This is not sufficient, however, to ensure that the resulting dataset will load without errors in your data analysis software of choice. -We developed the \texttt{ietestform} -command,\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietestform}} part of -the Stata package -\texttt{iefieldkit}, to implement a form-checking routine for \textbf{SurveyCTO}, a proprietary implementation of \textbf{Open Data Kit (ODK)}. +All survey softwares include debugging and test options +to correct syntax errors and make sure that the survey instruments will successfully compile. +This is not sufficient, however, to ensure that the resulting dataset +will load without errors in your data analysis software of choice. +We developed the \texttt{ietestform} command,\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/ietestform}} +part of the Stata package \texttt{iefieldkit}, +to implement a form-checking routine for \textbf{SurveyCTO}, +a proprietary implementation of the \textbf{Open Data Kit (ODK)} software. Intended for use during questionnaire programming and before fieldwork, -\texttt{ietestform} tests for best practices in coding, naming and labeling, -and choice lists. -Although \texttt{ietestform} is software-specific, many of the tests it runs are general and important to consider regardless of software choice. +\texttt{ietestform} tests for best practices in coding, naming and labeling, and choice lists. +Although \texttt{ietestform} is software-specific, +many of the tests it runs are general and important to consider regardless of software choice. To give a few examples, \texttt{ietestform} tests that no variable names exceed 32 characters, the limit in Stata (variable names that exceed that limit will -be truncated, and as a result may no longer be unique). It checks whether -ranges are included for numeric variables. -\texttt{ietestform} also removes all leading and trailing blanks from response lists, which could be handled inconsistently across software. - -From a data perspective, questions with pre-coded response options are always preferable to open-ended questions (the content-based pilot is an excellent time to ask open-ended questions, and refine responses for the final version of the questionnaire). Coding responses helps to ensure that the data will be useful for quantitative analysis. Two examples help illustrate the point. First, instead of asking ``How do you feel about the proposed policy change?'', use techniques like -\textbf{Likert scales}\sidenote{\textbf{Likert scale:} an ordered selection of choices indicating the respondent's level of agreement or disagreement with a proposed statement.}. Second, if collecting data on medication use or supplies, you could collect: the brand name of the product; the generic name of the product; the coded compound of the product; or the broad category to which each product belongs (antibiotic, etc.). All four may be useful for different reasons, but the latter two are likely to be the most useful for data analysis. The coded compound requires providing a translation dictionary to field staff, but enables automated rapid recoding for analysis with no loss of information. The generic class requires agreement on the broad categories of interest, but allows for much more comprehensible top-line statistics and data quality checks. Rigorous field testing is required to ensure that answer categories are comprehensive; however, it is best practice to include an \textit{other, specify} option. Keep track of those reponses in the first few weeks of fieldwork; adding an answer category for a response frequently showing up as \textit{other} can save time, as it avoids post-coding. - -It is useful to name the fields in your questionnaire in a way that will also work in the data analysis software. -There is debate over how individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. -We recommend using descriptive names with clear prefixes so that variables -within a module stay together when sorted -alphabetically.\sidenote{\url{https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8}} - Variable names should never include spaces or mixed cases (all lower case is -best). Take care with the length: very long names will be cut off in certain -software, which could result in a loss of uniqueness. We discourage explicit -question numbering, as it discourages re-ordering, which is a common -recommended change after the pilot. In the case of follow-up surveys, numbering -can quickly become convoluted, too often resulting in variables names like -\texttt{ag\_15a}, \texttt{ag\_15\_new}, \texttt{ag\_15\_fup2}, etc. +be truncated, and as a result may no longer be unique). +It checks whether ranges are included for numeric variables. +\texttt{ietestform} also removes all leading and trailing blanks from response lists, +which could be handled inconsistently across software. A second survey pilot should be done after the questionnaire is programmed. -The objective of the data-focused pilot\sidenote{\url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} is to validate programming and export a sample dataset. -Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. -It is important to plan for multiple days of piloting, so that any further debugging or other revisions to the electronic survey instrument can be made at the end of each day and tested the following, until no further field errors arise. -The data-focused pilot should be done in advance of enumerator training - -\subsection{Programming electronic questionnaires} +The objective of this \textbf{data-focused pilot}\sidenote{ + \url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} +is to validate the programming and export a sample dataset. +Significant desk-testing of the instrument is required to debug the programming +as fully as possible before going to the field. +It is important to plan for multiple days of piloting, +so that any further debugging or other revisions to the electronic survey instrument +can be made at the end of each day and tested the following, until no further field errors arise. +The data-focused pilot should be done in advance of enumerator training. %------------------------------------------------ \section{Data quality assurance and data security} From ad9c2b6d8c925b11b0edb7bf8dbd78be16795110 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 12:39:42 -0500 Subject: [PATCH 537/854] Intro re-write --- chapters/introduction.tex | 119 ++++++++++++++++++-------------------- 1 file changed, 55 insertions(+), 64 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 913eae67f..54614bfff 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -1,81 +1,72 @@ \begin{fullwidth} Welcome to Data for Development Impact. -This book is intended to serve as a resource guide -for people who collect or use data for development research. -In particular, the book is intended to guide the reader -through the process of research using primary survey data, -from research design to fieldwork to data management to analysis. -This book will not teach you econometrics or epidemiology or agribusiness. -This book will not teach you how to design an impact evaluation. -This book will not teach you how to do data analysis, or how to code. -There are lots of really good resources out there for all of these things, -and they are much better than what we would be able to squeeze into this book. - -What this book will teach you is how to think about quantitative data, -keeping in mind that you are not going to be the only person -collecting it, using it, or looking back on it. -We hope to provide you two key tools by the time you finish this book. -First, we want you to form a mental model of data collection as a ``social process'', +This book is intended to teach you how to handle data effectively, efficiently, and ethically +at all stages of the research process: design, data acquisition, and analysis. +This book is not sector-specific. +It will not teach you econometrics, or how to design an impact evaluation. +It will teach you how to think about all aspects of your research from a data perspective: +how to structure every stage of your research to maximize data quality +and institute transparent and reproducible workflows. +The central premise of this book is that data work is a ``social process'', in which many people need to have the same idea about what is to be done, and when and where and by whom, -so that they can collaborate effectively on large, long-term projects. -Second, we want to provide a map of concrete resources for supporting these processes in practice. -As research teams and timespans have grown dramatically over the last decade, -it has become inefficient for everyone to have their own personal style -dictating how they use different functions, how they store data, and how they write code. +so that they can collaborate effectively on large, long-term research projects. + +An [empirical revolution]{\sidenote\url{https://www.bloomberg.com/opinion/articles/2018-08-02/how-economics-went-from-philosophy-to-science}} +has changed the face of research economics rapidly over the last decade. +Economics graduate students of the 2000s expected to work with primarily "clean" data from secondhand sources. +Today, especially in the development subfield, working with raw data- +whether collected through surveys or acquired through 'big' data sources like sensors, satellites, or call data records- +is a key skill for researchers and their staff. +However, most graduates have little or no experience working with raw data when they are recruited as research assistants. +Therefore they tend to have a large "skills gap" on the practical tasks of development economics research. +Yet there are few guides to the conventions, standards, and best practices that are fast becoming a necessity for impact evaluation projects. +This book aims to fill that gap, providing a practical resource complete with code snippets and references to concrete resources that allow the reader to immediately put recommended processes into practice. + \end{fullwidth} %------------------------------------------------ \section{Doing credible research at scale} +Development economics is increasingly dominated by empirical research.\cite{angrist2017economic} +The scope and scale of empirical research projects has expanded rapidly in recent years: +more people are working on the same data over longer timeframes. +As the ambition of development researchers grows, so too has the complexity of the data +on which they rely to make policy-relevant research conclusions from \textbf{field experiments}.\sidenote{ +\textbf{Field experiment:} experimental intervention in the real world, rather than in a laboratory.} +Unfortunately, this seems to have happened (so far) without the creation of +standards for practitioners to collaborate efficiently or structure data work for maximal reproducibility. +This book contributes by providing practical guidance on how to handle data efficiently, transparently and collaboratively. The team responsible for this book is known as \textbf{DIME Analytics}.\sidenote{ \url{http://www.worldbank.org/en/research/dime/data-and-analytics}} -The DIME Analytics team works within the \textbf{Development Impact Evaluation (DIME)} group \sidenote{ +The DIME Analytics team works within the \textbf{Development Impact Evaluation (DIME)} Department \sidenote{ \url{http://www.worldbank.org/en/research/dime}} -at the World Bank's \textbf{Development Economics group (DEC)}.\sidenote{ -\url{http://www.worldbank.org/en/research/}} -After years of supporting hundreds of projects and staff in total, -DIME Analytics has built up a set of ideas, practices, and software tools -that support all of the research projects operated at DIME. - -In the time that we have been working in the development field, -the proportion of projects that rely on \textbf{primary data} has soared.\cite{angrist2017economic} -Today, the scope and scale of those projects continue to expand rapidly. -More and more people are working on the same data over longer timeframes. -This is because, while administrative datasets -and \textbf{big data} have important uses, -primary data\sidenote{\textbf{Primary data:} data collected from first-hand sources.} -is critical for answering specific research questions.\cite{levitt2009field} -As the ambition of development researchers grows, so too has the complexity of the data -on which they rely to make policy-relevant research conclusions from \textbf{field experiments}.\sidenote{\textbf{Field experiment:} experimental intervention in the real world, rather than in a laboratory.} -Unfortunately, this seems to have happened (so far) without the creation of -guidelines for practitioners to handle data efficiently and transparently, -which could provide relevant and objective quality markers for research consumers. - -One important lesson we have learned from doing field work over this time is that -the most overlooked parts of primary data work are reproducibility and collaboration. -You may be working with people -who have very different skillsets and mindsets than you, -from and in a variety of cultures and contexts, and you will have to adopt workflows -that everyone can agree upon, and that save time and hassle on every project. -This is not easy. But for some reason, the people who agreed to write this book enjoy doing it. -(In part this is because it has saved ourselves a lot of time and effort.) -As we have worked with more and more DIME recruits -we have realized that we barely have the time to give everyone the attention they deserve. -This book itself is therefore intended to be a vehicle to document our experiences and share it with with future DIME team members. - -The \textbf{DIME Wiki} is one of our flagship resources designed for teams engaged in impact evaluation projects. -It is available as a free online collection of our resources and best practices.\sidenote{\url{http://dimewiki.worldbank.org/}} -This book therefore complements the detailed-but-unstructured DIME Wiki -with a guided tour of the major tasks that make up primary data collection.\sidenote{Like this: \url{https://dimewiki.worldbank.org/wiki/Primary_Data_Collection}} +at the World Bank's \textbf{Development Economics (DEC) Vice Presidency}.\sidenote{ +\url{https://www.worldbank.org/en/about/unit/unit-dec}} +DIME generates high-quality and operationally relevant data and research +to transform development policy, help reduce extreme poverty, and secure shared prosperity. +It develops customized data and evidence ecosystems to produce actionable information +and recommend specific policy pathways to maximize impact. +DIME conducts research in 60 countries with 200 agencies, leveraging a +US\$180 million research budget to shape the design and implementation of +US\$18 billion in development finance. +DIME also provides advisory services to 30 multilateral and bilateral development agencies. +Finally, DIME invests in public goods to improve the quality and reproducibility of development research around the world. + +DIME Analytics was created take advantage of the concentration and scale of research at DIME to develop and test solutions, +to ensure high quality of data collection and research across the DIME portfolio, +and to make public training and tools available to the larger community of development researchers. +Data for Development Impact compiles the ideas, best practices and software tools Analytics +has developed while supporting DIME's global impact evaluation portfolio. +The \textbf{DIME Wiki} is one of our flagship products, a free online collection of our resources and best practices.\sidenote{ +\url{http://dimewiki.worldbank.org/}} +This book complements the DIME Wiki by providing a structure narrative of the data workflow for a typical research project. We will not give a lot of highly specific details in this text, -but we will point you to where they can be found, -and give you a sense of what you need to find next. -Each chapter will focus on one task, -and give a primarily narrative account of: +but we will point you to where they can be found.\sidenote{Like this: +\url{https://dimewiki.worldbank.org/wiki/Primary_Data_Collection}} +Each chapter focuses on one task, providing a primarily narrative account of: what you will be doing; where in the workflow this task falls; -when it should be done; who you will be working with; -why this task is important; and how to implement the most basic form of the task. +when it should be done; and how to implement it according to best practices. We will use broad terminology throughout this book to refer to different team members: From 1b16ba4e6ff6590bcd9c9890d0a52028eb73547e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 7 Feb 2020 12:49:03 -0500 Subject: [PATCH 538/854] HFCs --- chapters/data-collection.tex | 129 ++++++++++++++++++++++++----------- 1 file changed, 89 insertions(+), 40 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 05db218f4..3d61d415c 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -249,8 +249,93 @@ \subsection{Programming electronic questionnaires} %------------------------------------------------ \section{Data quality assurance and data security} +Data quality assurance requires a combination of real-time data checks and back-checks or validation audits. +Careful field supervision is also essential for a successful survey; +however, we focus on the first two in this chapter, as they are the most directly data-related. + \subsection{Implementing high frequency quality checks} +A key advantage of continuous electronic data intake methods, +as compared to traditional paper surveys and one-time data dumps, +is the ability to access and analyze the data while the project is ongoing. +Data issues can be identified and resolved in real-time. +Designing systematic data checks and running them routinely throughout data intake +simplifies monitoring and improves data quality. +As part of data collection preparation, +the research team should develop a \textbf{data quality assurance plan}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Data_Quality_Assurance_Plan}}. +While data collection is ongoing, +a research assistant or data analyst should work closely with the field team or partner +to ensure that the data collection is progressing correctly, +and set up and perform \textbf{high-frequency checks (HFCs)} with the incoming data.\sidenote{ + \url{https://github.com/PovertyAction/high-frequency-checks/wiki}} + +High-frequency checks (HFCs) should carefully inspect key treatment and outcome variables +so that the data quality of core experimental variables is uniformly high, +and that additional effort is centered where it is most important. +Data quality checks should be run on the data every time it is received (ideally on a daily basis) +to flag irregularities in survey progress, sample completeness or response quality. +\texttt{ipacheck}\sidenote{ + \url{https://github.com/PovertyAction/high-frequency-checks}} +is a very useful command that automates some of these tasks, +regardless of the source of the data. + +It is important to check continuously that the observations in the data match the intended sample. +Many survey softwares provide some form of case management features +through which sampled units are directly assigned to individual enumerators. +For data recieved from partners this may be harder to validate, +since they are the authoritative source of the data, +so cross-referencing with other data sources may be necessary to validate data. +Even with careful management, it is often the case that raw data includes duplicate or missing entries, +which may occur due to data entry errors or failed submissions to data servers.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Duplicates_and_Survey_Logs}} +\texttt{ieduplicates}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/ieduplicates}} +provides a workflow for collaborating on the resolution of duplicate entries between you and the provider. +Then, observed units in the data must be validated against the expected sample: +this is as straightforward as merging the sample list with the survey data and checking for mismatches. +Reporting errors and duplicate observations in real-time allows the field team to make corrections efficiently. +Tracking data collection progress is important for monitoring attrition, +so that it is clear early on if a change in protocols or additional tracking will be needed. +It is also important to check data collection completion rate +and sample compliance by surveyor and survey team, if applicable, +or compare data missingness across administrative regions, +to identify any clusters that may be providing data of suspect quality. + +High frequency checks should also include content-specific data checks. +Electronic survey and data entry software often incorporates many quality control features, +so these checks should focus on issues survey software cannot check automatically. +As most of these checks are project specific, +it is difficult to provide general guidance. +An in-depth knowledge of the questionnaire and a careful examination of the analysis plan +is the best preparation. +Examples include verifying consistency across multiple response fields, +validation of complex calculations like crop yields or medicine stocks (which require unit conversions), +suspicious patterns in survey timing, +or atypical response patters from specific data sources or enumerators.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} +Electronic data entry software typically provides rich metadata, +which can be useful in assessing data quality. +For example, automatically collected timestamps show when data was submitted +and (for surveys) how long enumerators spent on each question, +and trace histories show how many +times answers were changed before or after the data was submitted. + +High-frequency checks will only improve data quality +if the issues they catch are communicated to the data collection team. +There are lots of ways to do this; +what's most important is to find a way to create actionable information for your team. +\texttt{ipacheck}, for example, generates a spreadsheet with flagged errors; +these can be sent directly to the data collection teams. +Many teams choose other formats to display results, +such as online dashboards created by custom scripts. +It is also possible to automate communication of errors to the field team +by adding scripts to link the HFCs with a messaging program such as WhatsApp. +Any of these solutions are possible: +what works best for your team will depend on such variables as +cellular networks in fieldwork areas, whether field supervisors have access to laptops, +internet speed, and coding skills of the team preparing the HFC workflows. + \subsection{Conducting back-checks and data validation} \subsection{Receiving, storing, and sharing data securely} @@ -258,30 +343,13 @@ \subsection{Receiving, storing, and sharing data securely} %------------------------------------------------ -A huge advantage of electronic surveys, compared to traditional paper surveys, is the ability to access and analyze the data while the survey is ongoing. -Data issues can be identified and resolved in real-time. Designing systematic data checks, and running them routinely throughout data collection, simplifies monitoring and improves data quality. -As part of survey preparation, the research team should develop a \textbf{data quality assurance plan} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Data_Quality_Assurance_Plan}}. -While data collection is ongoing, a research assistant or data analyst should work closely with the field team to ensure that the survey is progressing correctly, and perform \textbf{high-frequency checks (HFCs)} of the incoming data. -\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks/wiki}} -Data quality assurance requires a combination of real-time data checks and survey audits. Careful field supervision is also essential for a successful survey; however, we focus on the first two in this chapter, as they are the most directly data related. -High-frequency checks (HFCs) should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, and that additional field effort is centered where it is most important. -Data quality checks should be run on the data every time it is downloaded (ideally on a daily basis), to flag irregularities in survey progress, sample completeness or response quality. \texttt{ipacheck}\sidenote{\url{https://github.com/PovertyAction/high-frequency-checks}} -is a very useful command that automates some of these tasks. -It is important to check every day that the units interviewed match the survey sample. -Many survey software include case management features, through which sampled units are directly assigned to individual enumerators. -Even with careful management, it is often the case that raw data includes duplicate entries, which may occur due to field errors or duplicated submissions to the server. -Even with careful management, it is often the case that raw data includes duplicate entries, which may occur due to field errors or duplicated submissions to the server.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Duplicates_and_Survey_Logs}} -\texttt{ieduplicates}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ieduplicates}} -provides a workflow for collaborating on the resolution of duplicate entries between you and the field team. -Next, observed units in the data must be validated against the expected sample: -this is as straightforward as merging the sample list with the survey data and checking for mismatches. -Reporting errors and duplicate observations in real-time allows the field team to make corrections efficiently. -Tracking survey progress is important for monitoring attrition, so that it is clear early on if a change in protocols or additional tracking will be needed. -It is also important to check interview completion rate and sample compliance by surveyor and survey team, to identify any under-performing individuals or teams. + + + When all data collection is complete, the survey team should prepare a final field report, which should report reasons for any deviations between the original sample and the dataset collected. @@ -292,26 +360,7 @@ \subsection{Receiving, storing, and sharing data securely} This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions and loss to follow-up occurred in the field and how they were implemented and resolved. -High frequency checks should also include survey-specific data checks. As electronic survey -software incorporates many data control features, discussed above, these checks -should focus on issues survey software cannot check automatically. As most of -these checks are survey specific, it is difficult to provide general guidance. -An in-depth knowledge of the questionnaire, and a careful examination of the -pre-analysis plan, is the best preparation. Examples include consistency -across multiple responses, complex calculation (such as crop yield, which first requires unit conversions), -suspicious patterns in survey timing, or atypical response patters from specific enumerators. -timing, or atypical response patterns from specific enumerators.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} -survey software typically provides rich metadata, which can be useful in -assessing interview quality. For example, automatically collected time stamps -show how long enumerators spent per question, and trace histories show how many -times answers were changed before the survey was submitted. - -High-frequency checks will only improve data quality if the issues they catch are communicated to the field. -There are lots of ways to do this; what's most important is to find a way to create actionable information for your team, given field constraints. -`ipacheck` generates an excel sheet with results for each run; these can be sent directly to the field teams. -Many teams choose other formats to display results, notably online dashboards created by custom scripts. -It is also possible to automate communication of errors to the field team by adding scripts to link the HFCs with a messaging program such as whatsapp. -Any of these solutions are possible: what works best for your team will depend on such variables as cellular networks in fieldwork areas, whether field supervisors have access to laptops, internet speed, and coding skills of the team preparing the HFC workflows. + Careful monitoring of field work is essential for high quality data. \textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and From c67ec6cca0b4689e8d16234c25f4dc29eff5a08b Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 12:56:41 -0500 Subject: [PATCH 539/854] Intro re-write --- chapters/introduction.tex | 75 +++++++++++++++++++-------------------- 1 file changed, 36 insertions(+), 39 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 54614bfff..d76358c07 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -75,31 +75,33 @@ \section{Doing credible research at scale} \textbf{field coordinators (FCs)} who are responsible for the operation of the project on the ground; and \textbf{research assistants (RAs)} who are responsible for -handling technical capacity and analytical tasks. +handling raw data processing and analytical tasks. \section{Writing reproducible code in a collaborative environment} Research reproduciblity and data quality follow naturally from -good code and standardized practices. -Process standardization means that there is -little ambiguity about how something ought to be done, -and therefore the tools that are used to do it are set in advance. -Good code is easier to read and replicate, making it easier to spot mistakes. -The resulting data contains substantially less noise -that is due to sampling, randomization, and cleaning errors. -And all data work can be easily reviewed before it's published and replicated afterwards. +good code and standardized processes. +Good code practices are a core part of the new data science of development research. +Code today is no longer a means to an end (such as a research paper), +rather it is part of the output itself: a means for communicating how something was done, +in a world where the credibility and transparency of data cleaning and analysis is increasingly important. -A good do-file consists of code that has two elements: + +"Good" code has two elements: - it is correct (doesn't produce any errors along the way) -- it is useful and comprehensible to someone who hasn't seen it before (such that the person who wrote this code isn't lost if they see it three weeks later) -Most research assistants that join our unit have only been trained in how to code correctly. -While correct results are extremely important, we usually tell our new research assistants that -\textit{when your code runs on your computer and you get the correct results then you are only half-done writing \underline{good} code.} - -Just as data collection and management processes have become more social and collaborative, -code processes have as well.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Stata_Coding_Practices}} This means other people need to be able to read your code. -Not only are things like documentation and commenting important, -but code should be readable in the sense that others can: +- it is useful and comprehensible to someone who hasn't seen it before (including the author three weeks later) +Many researchers have been trained to code correctly. +However, when your code runs on your computer and you get the correct results, you are only half-done writing \underline{good} code. +Good code is easy to read and replicate, making it easier to spot mistakes. +Good code reduces noise due to sampling, randomization, and cleaning errors. +Good code can easily be reviewed by others before it's published and replicated afterwards. + +Process standardization means that there is +little ambiguity about how something ought to be done, +and therefore the tools to do it can be set in advance. +Standard processes for code help other people to ready your code.\sidenote{ +\url{https://dimewiki.worldbank.org/wiki/Stata_Coding_Practices}} +Code should be well-documented, contain extensive comments, and be readable in the sense that others can: (1) quickly understand what a portion of code is supposed to be doing; (2) evaluate whether or not it does that thing correctly; and (3) modify it efficiently either to test alternative hypotheses @@ -107,8 +109,8 @@ \section{Writing reproducible code in a collaborative environment} To accomplish that, you should think of code in terms of three major elements: \textbf{structure}, \textbf{syntax}, and \textbf{style}. -We always tell people to ``code as if a stranger would read it'', -from tomorrow, that stranger will be you. +We always tell people to "code as if a stranger would read it" +(from tomorrow, that stranger will be you). The \textbf{structure} is the environment your code lives in: good structure means that it is easy to find individual pieces of code that correspond to tasks. Good structure also means that functional blocks are sufficiently independent from each other @@ -128,34 +130,29 @@ \section{Writing reproducible code in a collaborative environment} \codeexample{code.do}{./code/code.do} -We have tried really hard to make sure that all the Stata code runs, -and that each block is well-formatted and uses built-in functions. -We will also point to user-written functions when they provide important tools. -In particular, we have written two suites of Stata commands, -\texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietoolkit}} and \texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}} -that standardize some of our core data collection workflows. -Providing some standardization to Stata code style is also a goal of this team, -since groups are collaborating on code in Stata more than ever before. -We will not explain Stata commands unless the behavior we are exploiting +For the code examples, we ensure that each block runs, is well-formatted, and uses built-in functions as much as possible. +We will point to user-written functions when they provide important tools. +In particular, we point to two suites of Stata commands developed by DIME Analytics, +\texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietoolkit}} and \texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}}, +which standardize our core data collection workflows. +We do not explain Stata commands unless the behavior we are exploiting is outside the usual expectation of its functionality; we will comment the code generously (as you should), but you should reference Stata help-files \texttt{h [command]} whenever you do not understand the functionality that is being used. We hope that these snippets will provide a foundation for your code style. -Alongside the collaborative view of data that we outlined above, -good code practices are a core part of the new data science of development research. -Code today is no longer a means to an end (such as a paper), -but it is part of the output itself: it is a means for communicating how something was done, -in a world where the credibility and transparency of data cleaning and analysis is increasingly important. +Providing some standardization to Stata code style is also a goal of this team, +we provide our guidance on this in the DIME Analytics Stata Style Guide Appendix. + While adopting the workflows and mindsets described in this book requires an up-front cost, it should start to save yourself and others a lot of time and hassle very quickly. -In part this is because you will learn how to do the essential things directly; -in part this is because you will find tools for the more advanced things; -and in part this is because you will have the mindset to doing everything else in a high-quality way. +In part this is because you will learn how to implement essential practices directly; +in part because you will find tools for the more advanced practices; +and most importantly because you will acquire the mindset of doing research with a high-quality data focus. We hope you will find this book helpful for accomplishing all of the above, -and that you will find that mastery of data helps you make an impact! +and that mastery of data helps you make an impact! \textbf{-- The DIME Analytics Team} From 55eaea5030ccfcb3957a2d8ac776c94fa5ea6c69 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 7 Feb 2020 13:10:43 -0500 Subject: [PATCH 540/854] Back-checks and validations --- chapters/data-collection.tex | 81 ++++++++++++++++-------------------- 1 file changed, 36 insertions(+), 45 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 3d61d415c..1fdbeaf1b 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -338,55 +338,38 @@ \subsection{Implementing high frequency quality checks} \subsection{Conducting back-checks and data validation} -\subsection{Receiving, storing, and sharing data securely} - -%------------------------------------------------ - - - - - - - - - - -When all data collection is complete, the survey team should prepare a final field report, -which should report reasons for any deviations between the original sample and the dataset collected. -Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of survey data. -It is important to structure this reporting in a way that not only groups broad rationales into specific categories -but also collects all the detailed, open-ended responses to questions the field team can provide for any observations that they were unable to complete. -This reporting should be validated and saved alongside the final raw data, and treated the same way. -This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions -and loss to follow-up occurred in the field and how they were implemented and resolved. - - - -Careful monitoring of field work is essential for high quality data. +Careful validation of data is essential for high-quality data. +Since we cannot control natural measurement error +that comes from variation in the realization of key outcomes, +primary data collection provides the opportunity to make sure +that there is no error arising from inaccuracies in the data itself. \textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and -other survey audits help ensure that enumerators are following established protocols, and are not falsifying data. -For back-checks, a random subset of the field sample is chosen and a subset of information from the full survey is -verified through a brief interview with the original respondent. -Design of the back-check questionnaire follows the same survey design -principles discussed above: you should use the pre-analysis plan -or list of key outcomes to establish which subset of variables to prioritize. - -Real-time access to the survey data increases the potential utility of -back-checks dramatically, and both simplifies and improves the rigor of related -workflows. -You can use the raw data to draw the back-check sample; assuring it is -appropriately apportioned across interviews and survey teams. -As soon as back-checks are complete, the back-check data can be tested against +other validation audits help ensure that data collection is following established protocols, +and that data is not fasified, incomplete, or otherwise suspect. +For back-checks and validation audies, a random subset of the main data is selected, +and a subset of information from the full survey is +verified through a brief targeted survey with the original respondent +or a cross-referenced data set from another source. +Design of the back-checks or validations follows the same survey design +principles discussed above: you should use the analysis plan +or list of key outcomes to establish which subset of variables to prioritize, +and similarly focus on errors that would be major flags for poor quality data. + +Real-time access to the data massively increases the potential utility of validation, +and both simplifies and improves the rigor of the associated workflows. +You can use the raw primary data to draw the back-check or validation sample; +this ensures that the validation is correctly apportioned across observations. +As soon as checking is complete, the comparator data can be tested against the original data to identify areas of concern in real-time. -\texttt{bcstats} is a useful tool for analyzing back-check data in Stata module. -\sidenote{\url{https://ideas.repec.org/c/boc/bocode/s458173.html}} - -Electronic surveys also provide a unique opportunity to do audits through audio recordings of the interview, +The \texttt{bcstats} command is a useful tool for analyzing back-check data in Stata.\sidenote{ + \url{https://ideas.repec.org/c/boc/bocode/s458173.html}} +Some electronic surveys surveys also provide a unique opportunity +to do audits through audio recordings of the interview, typically short recordings triggered at random throughout the questionnaire. -\textbf{Audio audits} are a useful means to assess whether the enumerator is conducting the interview -as expected (and not sitting under a tree making up data). -Do note, however, that audio audits must be included in the Informed Consent. +\textbf{Audio audits} are a useful means to assess whether enumerators are conducting interviews as expected. +Do note, however, that audio audits must be included in the informed consent for the respondents. +\subsection{Receiving, storing, and sharing data securely} Primary data collection almost always includes \textbf{personally-identifiable information (PII)} \sidenote{\url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. @@ -455,6 +438,14 @@ \subsection{Receiving, storing, and sharing data securely} In cases where PII data is required for analysis, we recommend embargoing the sensitive variables when publishing the data. Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. +When all data collection is complete, the survey team should prepare a final field report, +which should report reasons for any deviations between the original sample and the dataset collected. +Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of survey data. +It is important to structure this reporting in a way that not only groups broad rationales into specific categories +but also collects all the detailed, open-ended responses to questions the field team can provide for any observations that they were unable to complete. +This reporting should be validated and saved alongside the final raw data, and treated the same way. +This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions +and loss to follow-up occurred in the field and how they were implemented and resolved. With the raw data securely stored and backed up, and a de-identified dataset to work with, you are ready to move to data cleaning, and analysis. %------------------------------------------------ From adca8e13cf219285d59f3be6cbb5a1ca663b436f Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 15:28:30 -0500 Subject: [PATCH 541/854] Intro re-write First pass done. Restructured, added background on DIME, added preview of book content, and added material from our original proposal to the lede. --- chapters/introduction.tex | 98 +++++++++++++++++++++++---------------- 1 file changed, 59 insertions(+), 39 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index d76358c07..be7ffdbab 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -1,41 +1,43 @@ \begin{fullwidth} Welcome to Data for Development Impact. -This book is intended to teach you how to handle data effectively, efficiently, and ethically -at all stages of the research process: design, data acquisition, and analysis. -This book is not sector-specific. -It will not teach you econometrics, or how to design an impact evaluation. -It will teach you how to think about all aspects of your research from a data perspective: -how to structure every stage of your research to maximize data quality -and institute transparent and reproducible workflows. -The central premise of this book is that data work is a ``social process'', -in which many people need to have the same idea about what is to be done, and when and where and by whom, -so that they can collaborate effectively on large, long-term research projects. +This book is intended to teach all users of development data +how to handle data effectively, efficiently, and ethically. -An [empirical revolution]{\sidenote\url{https://www.bloomberg.com/opinion/articles/2018-08-02/how-economics-went-from-philosophy-to-science}} + +An [empirical revolution]{\cite{angrist2017economic}} has changed the face of research economics rapidly over the last decade. Economics graduate students of the 2000s expected to work with primarily "clean" data from secondhand sources. Today, especially in the development subfield, working with raw data- -whether collected through surveys or acquired through 'big' data sources like sensors, satellites, or call data records- +whether collected through surveys or acquired from 'big' data sources like sensors, satellites, or call data records- is a key skill for researchers and their staff. -However, most graduates have little or no experience working with raw data when they are recruited as research assistants. -Therefore they tend to have a large "skills gap" on the practical tasks of development economics research. -Yet there are few guides to the conventions, standards, and best practices that are fast becoming a necessity for impact evaluation projects. -This book aims to fill that gap, providing a practical resource complete with code snippets and references to concrete resources that allow the reader to immediately put recommended processes into practice. +At the same time, the scope and scale of empirical research projects is expanding: +more people are working on the same data over longer timeframes. +As the ambition of development researchers grows, so too has the complexity of the data +on which they rely to make policy-relevant research conclusions. +Yet there are few guides to the conventions, standards, and best practices +that are fast becoming a necessity for empirical research. +This book aims to fill that gap, providing guidance on how to handle data efficiently, transparently and collaboratively. + +This book is targeted to everyone who interacts with development data: +graduate students, research assistants, policymakers, and empirical researchers. +It covers data workflows at all stages of the research process: design, data acquisition, and analysis. +This book is not sector-specific; it will not teach you econometrics, or how to design an impact evaluation. +There are many excellent existing resources on those topics. +Instead, this book will teach you how to think about all aspects of your research from a data perspective, +how to structure research projects to maximize data quality, +and how to institute transparent and reproducible workflows. +The central premise of this book is that data work is a ``social process'', +in which many people need to have the same idea about what is to be done, and when and where and by whom, +so that they can collaborate effectively on large, long-term research projects. +It aims to be a highly practical resource: each chapter offers code snippets, links to checklists and other practical tools, +and references to primary resources that allow the reader to immediately put recommended processes into practice. + \end{fullwidth} %------------------------------------------------ \section{Doing credible research at scale} -Development economics is increasingly dominated by empirical research.\cite{angrist2017economic} -The scope and scale of empirical research projects has expanded rapidly in recent years: -more people are working on the same data over longer timeframes. -As the ambition of development researchers grows, so too has the complexity of the data -on which they rely to make policy-relevant research conclusions from \textbf{field experiments}.\sidenote{ -\textbf{Field experiment:} experimental intervention in the real world, rather than in a laboratory.} -Unfortunately, this seems to have happened (so far) without the creation of -standards for practitioners to collaborate efficiently or structure data work for maximal reproducibility. -This book contributes by providing practical guidance on how to handle data efficiently, transparently and collaboratively. The team responsible for this book is known as \textbf{DIME Analytics}.\sidenote{ \url{http://www.worldbank.org/en/research/dime/data-and-analytics}} @@ -68,24 +70,41 @@ \section{Doing credible research at scale} what you will be doing; where in the workflow this task falls; when it should be done; and how to implement it according to best practices. -We will use broad terminology throughout this book -to refer to different team members: + + +\section{Outline of this book} +The book progresses through the typical workflow of an empirical research project. +We start with ethical principles to guide empirical research, +focusing on research transparency and the right to privacy. +The second chapter discusses the importance of planning data work at the outset of the research project- +long before any data is acquired - and provide suggestions for collaborative workflows and tools. +Next, we turn to common research designs for \textbf{causal inference}{\sidenote{causal inference: identifying the change in outcome \textit{caused} by a particular intervention}}, and consider their implications for data structure. +The fourth chapter covers how to implement sampling and randomization to ensure research credibility, +and includes details on power calculation and randomization inference. +The fifth chapter provides guidance on high quality primary data collection, particularly for projects that use surveys. +The sixth chapter turns to data processing, +focusing on how to organize data work so that it is easy to code the desired analysis. +In the final chapter, we discuss publishing collaborative research- +both the research paper and the code and materials needed to recreate the results. + +We will use broad terminology throughout this book to refer to research team members: \textbf{principal investigators (PIs)} who are responsible for -the overall success of the project; +the overall design and stewardship of the study; \textbf{field coordinators (FCs)} who are responsible for the operation of the project on the ground; and \textbf{research assistants (RAs)} who are responsible for handling raw data processing and analytical tasks. -\section{Writing reproducible code in a collaborative environment} -Research reproduciblity and data quality follow naturally from -good code and standardized processes. -Good code practices are a core part of the new data science of development research. +\section{Writing reproducible code in a collaborative environment} +Throughout the book, we refer to the importance of good coding practices. +These are the foundation of reproducible and credible data work, +and a core part of the new data science of development research. Code today is no longer a means to an end (such as a research paper), rather it is part of the output itself: a means for communicating how something was done, in a world where the credibility and transparency of data cleaning and analysis is increasingly important. - +As this is fundamental to the remainder of the book's content, +we provide here a brief introduction to "good" code and standardized practices. "Good" code has two elements: - it is correct (doesn't produce any errors along the way) @@ -126,11 +145,13 @@ \section{Writing reproducible code in a collaborative environment} For some implementation portions where precise code is particularly important, we will provide minimal code examples either in the book or on the DIME Wiki. -In the book, they will be presented like the following: +All code guidance is software-agnostic, but code examples are provided in Stata +(we offer analogous examples in R as much as possible). +In the book, code examples will be presented like the following: \codeexample{code.do}{./code/code.do} -For the code examples, we ensure that each block runs, is well-formatted, and uses built-in functions as much as possible. +We ensure that each code block runs independently, is well-formatted, and uses built-in functions as much as possible. We will point to user-written functions when they provide important tools. In particular, we point to two suites of Stata commands developed by DIME Analytics, \texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietoolkit}} and \texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}}, @@ -144,10 +165,9 @@ \section{Writing reproducible code in a collaborative environment} Providing some standardization to Stata code style is also a goal of this team, we provide our guidance on this in the DIME Analytics Stata Style Guide Appendix. - -While adopting the workflows and mindsets described in this book -requires an up-front cost, -it should start to save yourself and others a lot of time and hassle very quickly. +\section{Adopting reproducible workflows} +While adopting the workflows and mindsets described in this book requires an up-front cost, +it will save you (and your collaborators) a lot of time and hassle very quickly. In part this is because you will learn how to implement essential practices directly; in part because you will find tools for the more advanced practices; and most importantly because you will acquire the mindset of doing research with a high-quality data focus. From c82dfb60600f44bcfcc91128f35bac9a62f2e798 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 7 Feb 2020 15:36:33 -0500 Subject: [PATCH 542/854] Data and security --- chapters/data-collection.tex | 181 ++++++++++++++++++++++++----------- 1 file changed, 125 insertions(+), 56 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 1fdbeaf1b..e461b7fcf 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -371,81 +371,150 @@ \subsection{Conducting back-checks and data validation} \subsection{Receiving, storing, and sharing data securely} -Primary data collection almost always includes \textbf{personally-identifiable information (PII)} -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}}. -PII must be handled with great care at all points in the data collection and management process, to comply with ethical requirements and avoid breaches of confidentiality. Access to PII must be restricted to team members granted that permission by the applicable Institutional Review Board or a data licensing agreement with a partner agency. Research teams must maintain strict protocols for data security at each stage of the process: data collection, storage, and sharing. - -All mainstream data collection software automatically \textbf{encrypt} -\sidenote{\textbf{Encryption:} the process of making information unreadable to -anyone without access to a specific deciphering -key. \url{https://dimewiki.worldbank.org/wiki/Encryption}} -all data submitted from the field while in transit (i.e., upload or download). Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection), until it reaches the server. Therefore, as long as you are using an established survey software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, and accounts used have a logon password and are never left unlocked. - -\textbf{Encryption at rest} is the only way to ensure that PII data remains private when it is stored on someone else's server on the internet. You must keep your data encrypted on the server whenever PII data is collected. -Encryption makes data files completely unusable without access to a security key specific to that data -- a higher level of security than password-protection. -Encryption at rest requires active participation from the user, and you should be fully aware that if your private key is lost, there is absolutely no way to recover your data. - -You should not assume that your data is encrypted by default: indeed, for most survey software platforms, encryption needs to be enabled by the user. -To enable it, you must confirm you know how to operate the encryption system and understand the consequences if basic protocols are not followed. -When you enable encryption, the service will allow you to download -- once -- the keyfile pair needed to decrypt the data. -You must download and store this in a secure location, such as a password manager. Make sure you store keyfiles with descriptive names to match the survey to which they correspond. -Any time anyone accesses the data- either when viewing it in the browser or downloading it to your computer- they will be asked to provide the keyfile. -Only project team members named in the IRB are allowed access to the private keyfile. - -To proceed with data analysis, you typically need a working copy of the data accessible from a personal computer. The following workflow allows you to receive data from the server and store it securely, without compromising data security. +Primary data collection, whether in surveys or from partners, +almost always includes \textbf{personally-identifiable information (PII)}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}} +from the people who are described in the dataset. +PII must be handled with great care at all points in the data collection and management process, +in order to comply with ethical and legal requirements +and to avoid breaches of confidentiality. +Access to PII must be restricted exclusively to team members +who are granted that permission by the applicable Institutional Review Board +or the data licensing agreement with the partner agency. +Research teams must maintain strict protocols for data security at each stage of the process, +including data collection, storage, and sharing. + +In field surveys, most common data collection software will automatically \textbf{encrypt}\sidenote{ + \textbf{Encryption:} the process of making information unreadable to + anyone without access to a specific deciphering + key. \url{https://dimewiki.worldbank.org/wiki/Encryption}} +all data submitted from the field while in transit (i.e., when uploading or downloading). +If this is implemented by the software, +the data will be encrypted from the time it leaves the device or browser until it reaches the server. +Therefore, as long as you are using an established survey software, this step is largely taken care of. +Of course, the research team must ensure that all computers, tablets, and accounts +that are used in data collection have a secure logon password and are never left unlocked. + +Even though your data is therefore usually safe while it is being transmitted, +it is not automatially secure when it is being stored. +\textbf{Encryption at rest} is the only way to ensure +that PII data remains private when it is stored on someone else's server on the internet. +You must keep your data encrypted on the data collection server whenever PII data is collected. +If you do not, the raw data will be accessible by individuals +who are not approved by your IRB agreement, +such as tech support personnel, server administrators, and other third-party staff. +Encryption at rest must be used to make data files completely unusable +without access to a security key specific to that data +-- a higher level of security than simple password-protection. +Encryption at rest requires active participation from the user, +and you should be fully aware that if your private encryption key is lost, +there is absolutely no way to recover your data. + +You should not assume that your data is encrypted by default: +because of the careful protocols necessary, for most data collection platforms, +encryption at rest needs to be explicitly enabled and operated by the user. +There is no automatic way to implement this protocol, +because the encryption key that is generated +can never pass through the hands of a third party, including the data storage application. +To enable encryption at rest, you must confirm +you know how to operate the encryption system +and understand the consequences if the correct protocols are not followed. +When you enable encryption at rest, the service typically will allow you to download -- once -- +the keyfile pair needed to decrypt the data. +You must download and store this keyfile in a secure location, such as a password manager. +Make sure you store keyfiles with descriptive names to match the survey to which they correspond. +The keyfiles must only be accessible to people who are IRB-approved to use PII data. +Any time anyone accesses the data -- either when viewing it in the browser or downloading it to your computer -- they will be asked to provide the keyfile. +If they cannot, the data is inaccessible. +This makes keyfile encryption the recommended storage for any data service +that is not enterprise-grade. +Enterprise-grade storage services typically implement a similar protocol +and are legally and technically configured so that your organization +is able to hold keys safely and allow data access based on verification of your identity. + +For most analytical needs, you will therefore need to create +a copy of the data which has all direct identifiers removed. +This working copy can be stored using unencrypted storage methods, +staff who are not IRB-approved can access and use the data, +and it can be shared with other people involved in the research without strict protocols. +The following workflow allows you to receive data and store it securely, +without compromising data security: \begin{enumerate} \item Download data \item Store a ``master'' copy of the data into an encrypted location that will remain accessible on disk and be regularly backed up - \item Create a ``gold master'' copy of the raw data in a secure location, such as a long-term cloud storage service or an encrypted physical hard drive stored in a separate location. If you remain lucky, you will never have to access this copy -- you just want to know it is out there, safe, if you need it. + \item Create a ``gold master'' copy of the raw data in a secure location, such as a long-term cloud storage service or an encrypted physical hard drive stored in a separate location. If you remain lucky, you will never have to access this copy -- you just want to know it is there, safe, if you need it. \end{enumerate} This handling satisfies the \textbf{3-2-1 rule}: there are two on-site copies of the data and one off-site copy, so the data can never -be lost in case of hardware -failure.\sidenote{\url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} +be lost in case of hardware failure.\sidenote{ + \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} In addition, you should ensure that all teams take basic precautions to ensure the security of data, as most problems are due to human error. -Ideally, the machine hard drives themselves should also be encrypted, as well as any external hard drives or flash drives used. +Ideally, the machine hard drives themselves should also be encrypted, +as well as any external hard drives or flash drives used. All files sent to the field containing PII data, such as sampling lists, must be encrypted. You must never share passwords by email; rather, use a secure password manager. -This significantly mitigates the risk in case there is a security breach such as loss, theft, hacking, or a virus, with little impact on day-to-day utilization. - -To simplify workflow, it is best to remove PII variables from your data at the earliest possible opportunity, and save a de-identified copy of the data. -Once the data is de-identified, it no longer needs to be encrypted - therefore you can interact with it directly, without having to provide the keyfile. - -We recommend de-identification in two stages: an initial process to remove direct identifiers to create a working de-identified dataset, and a final process to remove all possible identifiers to create a publishable dataset. -The \textbf{initial de-identification} should happen directly after the encrypted data is downloaded to disk. At this time, for each variable that contains PII, ask: will this variable be needed for analysis? -If not, the variable should be dropped. Examples include respondent names, enumerator names, interview date, respondent phone number. -If the variable is needed for analysis, ask: can I encode or otherwise construct a variable to use for the analysis that masks the PII, and drop the original variable? +This significantly mitigates the risk in case there is a security breach +such as loss, theft, hacking, or a virus, with little impact on day-to-day utilization. + +To simplify workflow, it is best to remove PII variables from your data +at the earliest possible opportunity, and save a de-identified copy of the data. +Once the data is de-identified, it no longer needs to be encrypted +-- therefore you can interact with it directly without having to provide the keyfile. +We recommend de-identification in two stages: +an initial process to remove direct identifiers to create a working de-identified dataset, +and a final process to remove all possible identifiers to create a publishable dataset. +The \textbf{initial de-identification} should happen directly after the encrypted data is downloaded to disk. +At this time, for each variable that contains PII, ask: will this variable be needed for analysis? +If not, the variable should be dropped. +Examples include respondent names, enumerator names, interview dates, and respondent phone numbers. +If the variable is needed for analysis, ask: +can I encode or otherwise construct a variable to use for the analysis that masks the PII, +and drop the original variable? Examples include geocoordinates (after constructing measures of distance or area, drop the specific location), and names for social network analysis (can be encoded to unique numeric IDs). -If PII variables are directly required for the analysis itself, it will be necessary to keep at least a subset of the data encrypted through the data analysis process. - -Flagging all potentially identifying variables in the questionnaire design stage, as recommended above, simplifies the initial de-identification. -You already have the list of variables to assess, and ideally have already assessed those against the pre-analysis plan. -If so, all you need to do is write a script to drop the variables that are not required for analysis, encode or otherwise mask those that are required, and save a working version of the data. - -The \textbf{final de-identification} is a more involved process, with the objective of creating a dataset for publication that cannot be manipulated or linked to identify any individual research participant. -You must remove all direct and indirect identifiers, and assess the risk of statistical disclosure. -\sidenote{Disclosure risk: the likelihood that a released data record can be associated with an individual or organization}. +If PII variables are directly required for the analysis itself, +it will be necessary to keep at least a subset of the data encrypted through the data analysis process. + +Flagging all potentially identifying variables in the questionnaire design stage, +as recommended above, simplifies the initial de-identification. +You already have the list of variables to assess, +and ideally have already assessed those against the analysis plan. +If so, all you need to do is write a script to drop the variables that are not required for analysis, + encode or otherwise mask those that are required, and save a working version of the data. + +The \textbf{final de-identification} is a more involved process, +with the objective of creating a dataset for publication +that cannot be manipulated or linked to identify any individual research participant. +You must remove all direct and indirect identifiers, and assess the risk of statistical disclosure.\sidenote{ + \textbf{Disclosure risk:} the likelihood that a released data record can be associated with an individual or organization.} \index{statistical disclosure} -There will almost always be a trade-off between accuracy and privacy. For publicly disclosed data, you should favor privacy. -There are a number of useful tools for de-identification: PII scanners for Stata -\sidenote{\url{https://github.com/J-PAL/stata_PII_scan}} or R -\sidenote{\url{https://github.com/J-PAL/PII-Scan}}, -and tools for statistical disclosure control. -\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/}} -In cases where PII data is required for analysis, we recommend embargoing the sensitive variables when publishing the data. +There will almost always be a trade-off between accuracy and privacy. +For publicly disclosed data, you should favor privacy. +There are a number of useful tools for de-identification: PII scanners for Stata\sidenote{ + \url{https://github.com/J-PAL/stata_PII_scan}} +or R\sidenote{ + \url{https://github.com/J-PAL/PII-Scan}}, +and tools for statistical disclosure control.\sidenote{ + \url{https://sdcpractice.readthedocs.io/en/latest/}} +In cases where PII data is required for analysis, +we recommend embargoing the sensitive variables when publishing the data. Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. When all data collection is complete, the survey team should prepare a final field report, which should report reasons for any deviations between the original sample and the dataset collected. -Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of survey data. -It is important to structure this reporting in a way that not only groups broad rationales into specific categories -but also collects all the detailed, open-ended responses to questions the field team can provide for any observations that they were unable to complete. +Identification and reporting of \textbf{missing data} and \textbf{attrition} +is critical to the interpretation of survey data. +It is important to structure this reporting in a way that not only +groups broad rationales into specific categories +but also collects all the detailed, open-ended responses +to questions the field team can provide for any observations that they were unable to complete. This reporting should be validated and saved alongside the final raw data, and treated the same way. -This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions +This information should be stored as a dataset in its own right +-- a \textbf{tracking dataset} -- that records all events in which survey substitutions and loss to follow-up occurred in the field and how they were implemented and resolved. -With the raw data securely stored and backed up, and a de-identified dataset to work with, you are ready to move to data cleaning, and analysis. +With the raw data securely stored and backed up, +and a de-identified dataset to work with, you are ready to move to data cleaning and analysis. %------------------------------------------------ From 272a517d859b7a1010ba8712beefc0050940bcdb Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 15:41:27 -0500 Subject: [PATCH 543/854] Intro re-write fixed formatting errors --- chapters/introduction.tex | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index be7ffdbab..544357d04 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -2,10 +2,8 @@ Welcome to Data for Development Impact. This book is intended to teach all users of development data how to handle data effectively, efficiently, and ethically. - - -An [empirical revolution]{\cite{angrist2017economic}} -has changed the face of research economics rapidly over the last decade. +An empirical revolution has changed the face of research economics rapidly over the last decade. +%had to remove cite {\cite{angrist2017economic}} because of full page width Economics graduate students of the 2000s expected to work with primarily "clean" data from secondhand sources. Today, especially in the development subfield, working with raw data- whether collected through surveys or acquired from 'big' data sources like sensors, satellites, or call data records- @@ -45,6 +43,7 @@ \section{Doing credible research at scale} \url{http://www.worldbank.org/en/research/dime}} at the World Bank's \textbf{Development Economics (DEC) Vice Presidency}.\sidenote{ \url{https://www.worldbank.org/en/about/unit/unit-dec}} + DIME generates high-quality and operationally relevant data and research to transform development policy, help reduce extreme poverty, and secure shared prosperity. It develops customized data and evidence ecosystems to produce actionable information @@ -62,6 +61,7 @@ \section{Doing credible research at scale} has developed while supporting DIME's global impact evaluation portfolio. The \textbf{DIME Wiki} is one of our flagship products, a free online collection of our resources and best practices.\sidenote{ \url{http://dimewiki.worldbank.org/}} + This book complements the DIME Wiki by providing a structure narrative of the data workflow for a typical research project. We will not give a lot of highly specific details in this text, but we will point you to where they can be found.\sidenote{Like this: From 4c0210ebf9b5ff4a088d888ef60e7fa26649130d Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 15:42:29 -0500 Subject: [PATCH 544/854] Intro re-write solves issue #138 --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 544357d04..8c50bbfc5 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -91,7 +91,7 @@ \section{Outline of this book} \textbf{principal investigators (PIs)} who are responsible for the overall design and stewardship of the study; \textbf{field coordinators (FCs)} who are responsible for -the operation of the project on the ground; +the implementation of the study on the ground; and \textbf{research assistants (RAs)} who are responsible for handling raw data processing and analytical tasks. From 4d0b3dd78c70fc288dbc7ecafb5bb27c505dec81 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 15:49:42 -0500 Subject: [PATCH 545/854] Intro re-write Small edits --- chapters/introduction.tex | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 8c50bbfc5..0b2b30213 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -62,7 +62,7 @@ \section{Doing credible research at scale} The \textbf{DIME Wiki} is one of our flagship products, a free online collection of our resources and best practices.\sidenote{ \url{http://dimewiki.worldbank.org/}} -This book complements the DIME Wiki by providing a structure narrative of the data workflow for a typical research project. +This book complements the DIME Wiki by providing a structured narrative of the data workflow for a typical research project. We will not give a lot of highly specific details in this text, but we will point you to where they can be found.\sidenote{Like this: \url{https://dimewiki.worldbank.org/wiki/Primary_Data_Collection}} @@ -151,13 +151,15 @@ \section{Writing reproducible code in a collaborative environment} \codeexample{code.do}{./code/code.do} -We ensure that each code block runs independently, is well-formatted, and uses built-in functions as much as possible. +We ensure that each code block runs independently, is well-formatted, +and uses built-in functions as much as possible. We will point to user-written functions when they provide important tools. In particular, we point to two suites of Stata commands developed by DIME Analytics, -\texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietoolkit}} and \texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}}, +\texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietoolkit}} and +\texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}}, which standardize our core data collection workflows. We do not explain Stata commands unless the behavior we are exploiting -is outside the usual expectation of its functionality; +is outside of the usual expectation of its functionality; we will comment the code generously (as you should), but you should reference Stata help-files \texttt{h [command]} whenever you do not understand the functionality that is being used. From 002c42e500037fcd5b75500670e1057943d49a7b Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 15:55:02 -0500 Subject: [PATCH 546/854] Intro re-write fixed quotation marks --- chapters/introduction.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 0b2b30213..ecd3a3c49 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -4,9 +4,9 @@ how to handle data effectively, efficiently, and ethically. An empirical revolution has changed the face of research economics rapidly over the last decade. %had to remove cite {\cite{angrist2017economic}} because of full page width -Economics graduate students of the 2000s expected to work with primarily "clean" data from secondhand sources. +Economics graduate students of the 2000s expected to work with primarily ``clean'' data from secondhand sources. Today, especially in the development subfield, working with raw data- -whether collected through surveys or acquired from 'big' data sources like sensors, satellites, or call data records- +whether collected through surveys or acquired from `big' data sources like sensors, satellites, or call data records- is a key skill for researchers and their staff. At the same time, the scope and scale of empirical research projects is expanding: more people are working on the same data over longer timeframes. From 9a2ab39414ac78ebd38b697b56796489d8d9534f Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 16:06:02 -0500 Subject: [PATCH 547/854] Intro re-write solves issue #105 --- chapters/introduction.tex | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 0b2b30213..27d4af3af 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -158,11 +158,10 @@ \section{Writing reproducible code in a collaborative environment} \texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietoolkit}} and \texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}}, which standardize our core data collection workflows. -We do not explain Stata commands unless the behavior we are exploiting -is outside of the usual expectation of its functionality; -we will comment the code generously (as you should), +We will not explain Stata commands unless the command is rarely used or the feature we are using is outside common use case of that command. +We will comment the code generously (as you should), but you should reference Stata help-files \texttt{h [command]} -whenever you do not understand the functionality that is being used. +whenever you do not understand the command that is being used. We hope that these snippets will provide a foundation for your code style. Providing some standardization to Stata code style is also a goal of this team, we provide our guidance on this in the DIME Analytics Stata Style Guide Appendix. From 65b79dd72ecb49418e077bb916815fabb48a0e9a Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 16:24:14 -0500 Subject: [PATCH 548/854] Conclusion: MRJ inputs Some additions and edits, mostly minor. --- chapters/conclusion.tex | 37 +++++++++++++++++++++---------------- 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex index 5f95249ad..281483d1a 100644 --- a/chapters/conclusion.tex +++ b/chapters/conclusion.tex @@ -1,5 +1,6 @@ We hope you have enjoyed \textit{Data for Development Impact: The DIME Analytics Resource Guide}. -It lays out a complete vision of the tasks of a modern researcher, +Our aim was to teach you to handle data more efficiently, effectively, and ethically. +We laid out a complete vision of the tasks of a modern researcher, from planning a project's data governance to publishing code and data to accompany a research product. We have tried to set the text up as a resource guide @@ -7,34 +8,38 @@ as your work requires you to become progressively more familiar with each of the topics included in the guide. -We motivated the guide with a discussion of research as a public service: +We started the book with a discussion of research as a public service: one that requires you to be accountable to both research participants and research consumers. We then discussed the current research environment, -which requires you to cooperate with a diverse group of collaborators +which necessitates cooperation with a diverse group of collaborators using modern approaches to computing technology. -We outlined common research methods in impact evaluation -that motivate how field and data work is structured. -We discussed how to ensure that evaluation work is well-designed -and able to accomplish its goals. +We outlined common research methods in impact evaluation, +with an eye toward structuring data work. +We discussed how to implement reproducible routines for sampling and randomization, +and to analyze statistical power and use randomization inference. We discussed the collection of primary data and methods of analysis using statistical software, as well as tools and practices for making this work publicly accessible. +Throughout, we emphasized that data work is a ``social process'', +involving multiple team members with different roles and technical abilities. This mindset and workflow, from top to bottom, -should outline the tasks and responsibilities -that make up a researcher's role as a truth-seeker and truth-teller. +outline the tasks and responsibilities +that are fundamental to doing credible research. -But as you probably noticed, the text itself only provides what we think is +However, as you probably noticed, the text itself provides just enough detail to get you started: an understanding of the purpose and function of each of the core research steps. -The references and resources get into the complete details -of how you will realistically implement these tasks. -From the DIME Wiki pages that detail the specific code practices -and field procedures that our team uses, +The references and resources get into the details +of how you will realistically implement these tasks: +from DIME Wiki pages detail specific code conventions +and field procedures that our team considers best practices, to the theoretical papers that will help you figure out -how to handle the unique cases you will undoubtedly encounter, -we hope you will keep the book on your desk +how to handle the unique cases you will undoubtedly encounter. +We hope you will keep the book on your desk (or the PDF on your desktop) and come back to it anytime you need more information. We wish you all the best in your work and will love to hear any input you have on ours! + +\url{https://github.com/worldbank/d4di} From 927ee4317f601fe2994c8a1f46c14936f7da9509 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 7 Feb 2020 17:05:22 -0500 Subject: [PATCH 549/854] Intros for existing content --- chapters/data-collection.tex | 31 ++++++++++++++++++++++++++++--- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index e461b7fcf..7fb2a07bd 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -24,6 +24,20 @@ \subsection{Receiving data from development partners} %------------------------------------------------ \section{Collecting primary data using electronic surveys} +If you are collecting data directly from the research subjects yourself, +you are most likely designing and fielding an electronic survey. +These types of data collection technologies +have greatly accelerated our ability to bring in high-quality data +using purpose-built survey instruments, +and therefore improved the precision of research. +At the same time, electronic surveys create some pitfalls to avoid. +Programming surveys efficiently requires a very different mindset +than simply designing them in word processing software, +and ensuring that they flow correctly and produce data +that can be used in statistical software requires careful organization. +This section will outline the major steps and technical considerations +you will need to follow whenever you field a custom survey instrument. + \subsection{Developing a survey instrument} A well-designed questionnaire results from careful planning, @@ -249,9 +263,20 @@ \subsection{Programming electronic questionnaires} %------------------------------------------------ \section{Data quality assurance and data security} -Data quality assurance requires a combination of real-time data checks and back-checks or validation audits. -Careful field supervision is also essential for a successful survey; -however, we focus on the first two in this chapter, as they are the most directly data-related. +Whether you are handling data from a partner or collecting it directly, +it is important to make sure that data faithfully reflects ground realities. +Data quality assurance requires a combination of real-time data checks +and back-checks or validation audits, which often means tracking down +the people whose information is in the dataset. +However, since that data also likely contains sensitive or personal information, +it is important to keep it safe throughout the entire process. +All sensitive data must be handled in a way +where there is no risk that anyone who is not approved by an IRB +for the specific project has the ability to access the data. +Data can be sensitive for multiple reasons, +but the most common reasons are that it contains personally identifiable information (PII) +or that the partner providing the data does not want it to be released. +This section will detail principles and practices for the verification and handling of these datasets. \subsection{Implementing high frequency quality checks} From a75258517bb3f4e05c26db325ab954959c5091a7 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 7 Feb 2020 17:29:14 -0500 Subject: [PATCH 550/854] [intro] use '' (two single ') instead of " (regular quotation mark) --- chapters/introduction.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 1fc1f4c95..d16c14855 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -104,9 +104,9 @@ \section{Writing reproducible code in a collaborative environment} rather it is part of the output itself: a means for communicating how something was done, in a world where the credibility and transparency of data cleaning and analysis is increasingly important. As this is fundamental to the remainder of the book's content, -we provide here a brief introduction to "good" code and standardized practices. +we provide here a brief introduction to ''good'' code and standardized practices. -"Good" code has two elements: +''Good'' code has two elements: - it is correct (doesn't produce any errors along the way) - it is useful and comprehensible to someone who hasn't seen it before (including the author three weeks later) Many researchers have been trained to code correctly. @@ -128,7 +128,7 @@ \section{Writing reproducible code in a collaborative environment} To accomplish that, you should think of code in terms of three major elements: \textbf{structure}, \textbf{syntax}, and \textbf{style}. -We always tell people to "code as if a stranger would read it" +We always tell people to ''code as if a stranger would read it'' (from tomorrow, that stranger will be you). The \textbf{structure} is the environment your code lives in: good structure means that it is easy to find individual pieces of code that correspond to tasks. From cce12b4b323e925a12a335c885bb3a44a76116dd Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 17:29:51 -0500 Subject: [PATCH 551/854] Ch1 intro --- chapters/handling-data.tex | 65 ++++++++++++++++---------------------- 1 file changed, 28 insertions(+), 37 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 37ffc7f6c..075df38f4 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -1,57 +1,48 @@ %------------------------------------------------ \begin{fullwidth} + Development research does not just \textit{involve} real people -- it also \textit{affects} real people. Policy decisions are made every day using the results of briefs and studies, and these can have wide-reaching consequences on the lives of millions. - As the range and importance of the policy-relevant questions - asked by development researchers grow, - so does the (rightful) scrutiny under which methods and results are placed. - Additionally, research also involves looking deeply into real people's - personal lives, financial conditions, and other sensitive subjects. - The rights and responsibilities involved in having such access - to personal information are a core responsibility of collecting personal data. - Ethical scrutiny involves two major components: \textbf{data handling} and \textbf{research transparency}. - Performing at a high standard in both means that - consumers of research can have confidence in its conclusions, - and that research participants are appropriately protected. - What we call ethical standards in this chapter are a set of practices - for research quality and data management that address these two principles. - - Neither transparency nor privacy is an ``all-or-nothing'' objective. - We expect that teams will do as much as they can to make their work - conform to modern practices of credibility, transparency, and reproducibility. - Similarly, we expect that teams will ensure the privacy of participants in research - by intelligently assessing and proactively averting risks they might face. - We also expect teams will report what they have and have not done - in order to provide objective measures of a research product's performance in both. + As the range and importance of the policy-relevant questions asked by development researchers grow, + so too does the (rightful) scrutiny under which methods and results are placed. + It is useful to think of research as a public service, + one that requires you to be accountable to both research participants and research consumers. + + On the research participant side, it is essential to respect individual \textbf{privacy} and ensure \textbf{data security}. + Researchers look deeply into real people's personal lives, financial conditions, and other sensitive subjects. + Respecting the respondents' right to privacy, + by intelligently assessing and proactively averting risks they might face, + is a core tenet of research ethics. + + On the consumer side, it is important to protect confidence in development research by following modern practices for \textbf{transparency} and \textbf{reproducibility}. + Across the social sciences, the open science movement has been fueled by discoveries of low-quality research practices, + data and code that are inaccessible to the public, analytical errors in major research papers, + and in some cases even outright fraud. While the development research community has not yet + experienced any major scandals, it has become clear that there are necessary incremental improvements + in the way that code and data are handled as part of research. + + Neither privacy nor transparency is an ``all-or-nothing'' objective. + Most important is to report the transparency and privacy measures taken. Otherwise, reputation is the primary signal for the quality of evidence, and two failures may occur: low-quality studies from reputable sources may be used as evidence when in fact they don't warrant it, and high-quality studies from sources without an international reputation may be ignored. Both these outcomes reduce the quality of evidence overall. - Even more importantly, the only way to determine credibility without transparency - is to judge research solely based on where it is done and by whom, - which concentrates credibility at better-known international institutions and global universities, - at the expense of quality research done by people and organizations directly involved in and affected by it. Simple transparency standards mean that it is easier to judge research quality, - and making high-quality research identifiable also increases its impact. - This section provides some basic guidelines and resources - for using field data ethically and responsibly to publish research findings. + and identifying high-quality research increases its impact. + + In this chapter, we outline a set of practices that help to ensure + research participants are appropriately protected and + research consumers can be confident in the conclusions reached. + \end{fullwidth} %------------------------------------------------ \section{Protecting confidence in development research} -Across the social sciences, the open science movement -has been fueled by discoveries of low-quality research practices, -data and code that are inaccessible to the public, -analytical errors in major research papers, -and in some cases even outright fraud. -While the development research community has not yet -experienced any major scandals, -it has become clear that there are necessary incremental improvements -in the way that code and data are handled as part of research. + Major publishers and funders, most notably the American Economic Association, have taken steps to require that these research components are accurately reported and preserved as outputs in themselves.\sidenote{ From eddf1e16fd2b42d114e48b8314a3de2dcac57127 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 17:30:47 -0500 Subject: [PATCH 552/854] Ch1 intro minor formatting fix --- chapters/handling-data.tex | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 075df38f4..b7aa5460a 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -10,13 +10,15 @@ It is useful to think of research as a public service, one that requires you to be accountable to both research participants and research consumers. - On the research participant side, it is essential to respect individual \textbf{privacy} and ensure \textbf{data security}. + On the research participant side, + it is essential to respect individual \textbf{privacy} and ensure \textbf{data security}. Researchers look deeply into real people's personal lives, financial conditions, and other sensitive subjects. Respecting the respondents' right to privacy, by intelligently assessing and proactively averting risks they might face, is a core tenet of research ethics. - On the consumer side, it is important to protect confidence in development research by following modern practices for \textbf{transparency} and \textbf{reproducibility}. + On the consumer side, it is important to protect confidence in development research + by following modern practices for \textbf{transparency} and \textbf{reproducibility}. Across the social sciences, the open science movement has been fueled by discoveries of low-quality research practices, data and code that are inaccessible to the public, analytical errors in major research papers, and in some cases even outright fraud. While the development research community has not yet From 0ba1012ea1faa967a7581844b586e2df24ab7bfb Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 7 Feb 2020 17:37:03 -0500 Subject: [PATCH 553/854] [intro] properly fix quotation signs --- chapters/introduction.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index d16c14855..a58f87181 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -104,9 +104,9 @@ \section{Writing reproducible code in a collaborative environment} rather it is part of the output itself: a means for communicating how something was done, in a world where the credibility and transparency of data cleaning and analysis is increasingly important. As this is fundamental to the remainder of the book's content, -we provide here a brief introduction to ''good'' code and standardized practices. +we provide here a brief introduction to ``good'' code and standardized practices. -''Good'' code has two elements: +``Good'' code has two elements: - it is correct (doesn't produce any errors along the way) - it is useful and comprehensible to someone who hasn't seen it before (including the author three weeks later) Many researchers have been trained to code correctly. @@ -128,7 +128,7 @@ \section{Writing reproducible code in a collaborative environment} To accomplish that, you should think of code in terms of three major elements: \textbf{structure}, \textbf{syntax}, and \textbf{style}. -We always tell people to ''code as if a stranger would read it'' +We always tell people to ``code as if a stranger would read it'' (from tomorrow, that stranger will be you). The \textbf{structure} is the environment your code lives in: good structure means that it is easy to find individual pieces of code that correspond to tasks. From 70c6e02b76f9718ad170345ba2d4902c72857a3f Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 17:37:12 -0500 Subject: [PATCH 554/854] Ch2 intro --- chapters/planning-data-work.tex | 39 ++++++++++++++------------------- 1 file changed, 17 insertions(+), 22 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 80d67d2a2..982732608 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -1,37 +1,32 @@ % ---------------------------------------------------------------------------------------------- \begin{fullwidth} -Preparation for collaborative data work begins long before you collect any data, -and involves planning both the software tools you will use yourself +Preparation for collaborative data work begins long before you acquire any data, +and involves planning both the software tools you will use and the collaboration platforms and processes for your team. In order to be prepared to work on the data you receive with a group, -you need to plan out the structure of your workflow in advance. +you need to structure your workflow in advance. This means knowing which data sets and output you need at the end of the process, how they will stay organized, what types of data you'll handle, and whether the data will require special handling due to size or privacy considerations. -Identifying these details should help you map out the data needs for your project, -giving you and your team a sense of how information resources should be organized. -It's okay to update this map once the project is underway -- -the point is that everyone knows -- at any given time -- what the plan is. +Identifying these details will help you map out the data needs for your project, +and give you a sense for how information resources should be organized. +It's okay to update this data map once the project is underway. +The point is that everyone knows -- at any given time -- what the plan is. -To implement this plan, you will need to prepare collaborative tools and workflows. +To do data work effectively in a team environment, +you will need to prepare collaborative tools and workflows. Changing software or protocols halfway through a project can be costly and time-consuming, -so it's important to think ahead about decisions that may seem of little consequence. -For example, things as simple as sharing services, folder structures, and filenames -can be extremely painful to alter down the line in any project. +so it's important to plan ahead. +Seemingly small decisions such as sharing services, folder structures, +and filenames can be extremely painful to alter down the line in any project. Similarly, making sure to set up a self-documenting discussion platform -and version control processes -makes working together on outputs much easier from the very first discussion. +and process for version control; +this makes working together on outputs much easier from the very first discussion. + This chapter will discuss some tools and processes that -will help prepare you for collaboration and replication. -We will provide free, open-source, and platform-agnostic tools wherever possible, -and point to more detailed instructions when relevant. -(Stata is the notable exception here due to its current popularity in the field.) -Most have a learning and adaptation process, -meaning you will become most comfortable with each tool -only by using it in real-world work. -Get to know them well early on, -so that you do not spend a lot of time learning through trial and error. +will prepare you to collaborate in a reproducible and transparent manner. + \end{fullwidth} % ---------------------------------------------------------------------------------------------- From 7d07f968e5183242745ad1e2c0f129698e28311d Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Fri, 7 Feb 2020 17:40:47 -0500 Subject: [PATCH 555/854] [intro] consistent quotation sign usage --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index a58f87181..9a600dbfd 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -6,7 +6,7 @@ %had to remove cite {\cite{angrist2017economic}} because of full page width Economics graduate students of the 2000s expected to work with primarily ``clean'' data from secondhand sources. Today, especially in the development subfield, working with raw data- -whether collected through surveys or acquired from `big' data sources like sensors, satellites, or call data records- +whether collected through surveys or acquired from ``big'' data sources like sensors, satellites, or call data records- is a key skill for researchers and their staff. At the same time, the scope and scale of empirical research projects is expanding: more people are working on the same data over longer timeframes. From 6b5e58ff4503c0d84e851241a468f7b294e8918a Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 17:41:00 -0500 Subject: [PATCH 556/854] Intro re-write Added stata blurb that was previously in chapter 2 intro. to make that work, i switched the order of 'adopting reproducible workflows' and 'writing reproducible code..' --- chapters/introduction.tex | 34 ++++++++++++++++++++++------------ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index d16c14855..2b5480339 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -96,6 +96,25 @@ \section{Outline of this book} handling raw data processing and analytical tasks. +\section{Adopting reproducible workflows} +We will provide free, open-source, and platform-agnostic tools wherever possible, +and point to more detailed instructions when relevant. +Stata is the notable exception here due to its current popularity in economics. +Most tools have a learning and adaptation process, +meaning you will become most comfortable with each tool +only by using it in real-world work. +Get to know them well early on, +so that you do not spend a lot of time learning through trial and error. + +While adopting the workflows and mindsets described in this book requires an up-front cost, +it will save you (and your collaborators) a lot of time and hassle very quickly. +In part this is because you will learn how to implement essential practices directly; +in part because you will find tools for the more advanced practices; +and most importantly because you will acquire the mindset of doing research with a high-quality data focus. +We hope you will find this book helpful for accomplishing all of the above, +and that mastery of data helps you make an impact! + + \section{Writing reproducible code in a collaborative environment} Throughout the book, we refer to the importance of good coding practices. These are the foundation of reproducible and credible data work, @@ -145,8 +164,7 @@ \section{Writing reproducible code in a collaborative environment} For some implementation portions where precise code is particularly important, we will provide minimal code examples either in the book or on the DIME Wiki. -All code guidance is software-agnostic, but code examples are provided in Stata -(we offer analogous examples in R as much as possible). +All code guidance is software-agnostic, but code examples are provided in Stata. In the book, code examples will be presented like the following: \codeexample{code.do}{./code/code.do} @@ -158,7 +176,8 @@ \section{Writing reproducible code in a collaborative environment} \texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietoolkit}} and \texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}}, which standardize our core data collection workflows. -We will not explain Stata commands unless the command is rarely used or the feature we are using is outside common use case of that command. +We will not explain Stata commands unless the command is rarely used +or the feature we are using is outside common use case of that command. We will comment the code generously (as you should), but you should reference Stata help-files \texttt{h [command]} whenever you do not understand the command that is being used. @@ -166,15 +185,6 @@ \section{Writing reproducible code in a collaborative environment} Providing some standardization to Stata code style is also a goal of this team, we provide our guidance on this in the DIME Analytics Stata Style Guide Appendix. -\section{Adopting reproducible workflows} -While adopting the workflows and mindsets described in this book requires an up-front cost, -it will save you (and your collaborators) a lot of time and hassle very quickly. -In part this is because you will learn how to implement essential practices directly; -in part because you will find tools for the more advanced practices; -and most importantly because you will acquire the mindset of doing research with a high-quality data focus. -We hope you will find this book helpful for accomplishing all of the above, -and that mastery of data helps you make an impact! -\textbf{-- The DIME Analytics Team} \mainmatter From 66e59e0b09422f5de018fdb1a5cfedf1d995c34e Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 17:45:32 -0500 Subject: [PATCH 557/854] Intro - changed wording of stranger --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 99aade995..f2a3c7ba3 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -148,7 +148,7 @@ \section{Writing reproducible code in a collaborative environment} To accomplish that, you should think of code in terms of three major elements: \textbf{structure}, \textbf{syntax}, and \textbf{style}. We always tell people to ``code as if a stranger would read it'' -(from tomorrow, that stranger will be you). +(from tomorrow, that stranger could be you!). The \textbf{structure} is the environment your code lives in: good structure means that it is easy to find individual pieces of code that correspond to tasks. Good structure also means that functional blocks are sufficiently independent from each other From 266ab5e03d086737554039a3ffeddacb347a29fb Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 17:48:06 -0500 Subject: [PATCH 558/854] Intro - code snippets edited language as not all chapters have snippets --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index f2a3c7ba3..1f1b17aad 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -27,7 +27,7 @@ The central premise of this book is that data work is a ``social process'', in which many people need to have the same idea about what is to be done, and when and where and by whom, so that they can collaborate effectively on large, long-term research projects. -It aims to be a highly practical resource: each chapter offers code snippets, links to checklists and other practical tools, +It aims to be a highly practical resource: we provide code snippets, links to checklists and other practical tools, and references to primary resources that allow the reader to immediately put recommended processes into practice. From eb7d6e8d64082711d7fc04d2288f4ba542c4f9d1 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 17:48:52 -0500 Subject: [PATCH 559/854] Update chapters/introduction.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index f2a3c7ba3..2cd2d0bbe 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -54,7 +54,7 @@ \section{Doing credible research at scale} DIME also provides advisory services to 30 multilateral and bilateral development agencies. Finally, DIME invests in public goods to improve the quality and reproducibility of development research around the world. -DIME Analytics was created take advantage of the concentration and scale of research at DIME to develop and test solutions, +DIME Analytics was created to take advantage of the concentration and scale of research at DIME to develop and test solutions, to ensure high quality of data collection and research across the DIME portfolio, and to make public training and tools available to the larger community of development researchers. Data for Development Impact compiles the ideas, best practices and software tools Analytics From 96388c71d9f1957eb876ced9b32e9462f62fb743 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 17:49:10 -0500 Subject: [PATCH 560/854] Update chapters/introduction.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 2cd2d0bbe..9b2cf6eed 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -39,7 +39,7 @@ \section{Doing credible research at scale} The team responsible for this book is known as \textbf{DIME Analytics}.\sidenote{ \url{http://www.worldbank.org/en/research/dime/data-and-analytics}} -The DIME Analytics team works within the \textbf{Development Impact Evaluation (DIME)} Department \sidenote{ +The DIME Analytics team works within the \textbf{Development Impact Evaluation (DIME)} Department\sidenote{ \url{http://www.worldbank.org/en/research/dime}} at the World Bank's \textbf{Development Economics (DEC) Vice Presidency}.\sidenote{ \url{https://www.worldbank.org/en/about/unit/unit-dec}} From 8d9fc7693adb84326f69634ccd7e76ff2f8de9f1 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 17:52:41 -0500 Subject: [PATCH 561/854] Update chapters/introduction.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index b412f8495..627884e85 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -1,5 +1,5 @@ \begin{fullwidth} -Welcome to Data for Development Impact. +Welcome to \textit{Data for Development Impact}. This book is intended to teach all users of development data how to handle data effectively, efficiently, and ethically. An empirical revolution has changed the face of research economics rapidly over the last decade. From 56aa7e4b4b5ce90ad2e23338299fcdcc0afacbaa Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 17:57:11 -0500 Subject: [PATCH 562/854] Ch2 intro --- chapters/planning-data-work.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 982732608..408743050 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -24,8 +24,8 @@ and process for version control; this makes working together on outputs much easier from the very first discussion. -This chapter will discuss some tools and processes that -will prepare you to collaborate in a reproducible and transparent manner. +This chapter will guide you on preparing a collaborative work environment, +and structuring your data work to be well-organized and clearly documented. \end{fullwidth} From bfcfa0ad19e8c992d0e753a79f1a516fb8bfe8fe Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 7 Feb 2020 17:57:35 -0500 Subject: [PATCH 563/854] [conclusion] update link to book landing page --- chapters/conclusion.tex | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex index 281483d1a..a11b01949 100644 --- a/chapters/conclusion.tex +++ b/chapters/conclusion.tex @@ -40,6 +40,5 @@ (or the PDF on your desktop) and come back to it anytime you need more information. We wish you all the best in your work -and will love to hear any input you have on ours! - -\url{https://github.com/worldbank/d4di} +and will love to hear any input you have on ours!\sidenote{ +You can share your comments and suggestion on this book through \url{https://worldbank.github.io/d4di/}.} From 1731a78a679dec159857ddd7561eda8bc6b906a4 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 18:10:59 -0500 Subject: [PATCH 564/854] Ch2 intro --- chapters/research-design.tex | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 329eef0f6..34d2a7757 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -3,23 +3,20 @@ \begin{fullwidth} Research design is the process of defining the methods and data that will be used to answer a specific research question. -You don't need to be an expert in this, -and there are lots of good resources out there -that focus on designing interventions and evaluations -as well as on econometric approaches. -Therefore, without going into technical detail, -this section will present a brief overview -of the most common methods that are used in development research, -particularly those that are widespread in program evaluation. -These ``causal inference'' methods will turn up in nearly every project, -so you will need to have a broad knowledge of how the methods in your project -are used in order to manage data and code appropriately. +You don't need to be an expert in research design to do effective data work, +but it is essential that you understand the design of the study you are working on, +and how the design affects data work. +Without going into too much technical detail, +as there are many excellent resources on impact evaluation design, +this section presents a brief overview +of the most common ``causal inference'' methods, +focusing on implications for data structure and analysis. The intent of this chapter is for you to obtain an understanding of the way in which each method constructs treatment and control groups, the data structures needed to estimate the corresponding effects, -and some available code tools designed for each method (the list, of course, is not exhaustive). +and specific code tools designed for each method (the list, of course, is not exhaustive). -Thinking through your design before starting data work is important for several reasons. +Thinking through research design before starting data work is important for several reasons. If you do not know how to calculate the correct estimator for your study, you will not be able to assess the statistical power of your research design. You will also be unable to make decisions in the field @@ -36,6 +33,12 @@ in response to an unexpected event. Intuitive knowledge of your project's chosen approach will make you much more effective at the analytical part of your work. + +This chapter first covers causal inference methods. +Next we discuss how to measure treatment effects and structure data for specific methods, +including: cross-sectional randomized control trials, difference-in-difference designs, +regression discontinuity, instrumental variables, matching, and synthetic controls. + \end{fullwidth} %----------------------------------------------------------------------------------------------- From d60a41c45a52012b4c668b22ab427f6e1bf0a9e8 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 18:11:26 -0500 Subject: [PATCH 565/854] Revert "Ch2 intro" This reverts commit 1731a78a679dec159857ddd7561eda8bc6b906a4. --- chapters/research-design.tex | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 34d2a7757..329eef0f6 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -3,20 +3,23 @@ \begin{fullwidth} Research design is the process of defining the methods and data that will be used to answer a specific research question. -You don't need to be an expert in research design to do effective data work, -but it is essential that you understand the design of the study you are working on, -and how the design affects data work. -Without going into too much technical detail, -as there are many excellent resources on impact evaluation design, -this section presents a brief overview -of the most common ``causal inference'' methods, -focusing on implications for data structure and analysis. +You don't need to be an expert in this, +and there are lots of good resources out there +that focus on designing interventions and evaluations +as well as on econometric approaches. +Therefore, without going into technical detail, +this section will present a brief overview +of the most common methods that are used in development research, +particularly those that are widespread in program evaluation. +These ``causal inference'' methods will turn up in nearly every project, +so you will need to have a broad knowledge of how the methods in your project +are used in order to manage data and code appropriately. The intent of this chapter is for you to obtain an understanding of the way in which each method constructs treatment and control groups, the data structures needed to estimate the corresponding effects, -and specific code tools designed for each method (the list, of course, is not exhaustive). +and some available code tools designed for each method (the list, of course, is not exhaustive). -Thinking through research design before starting data work is important for several reasons. +Thinking through your design before starting data work is important for several reasons. If you do not know how to calculate the correct estimator for your study, you will not be able to assess the statistical power of your research design. You will also be unable to make decisions in the field @@ -33,12 +36,6 @@ in response to an unexpected event. Intuitive knowledge of your project's chosen approach will make you much more effective at the analytical part of your work. - -This chapter first covers causal inference methods. -Next we discuss how to measure treatment effects and structure data for specific methods, -including: cross-sectional randomized control trials, difference-in-difference designs, -regression discontinuity, instrumental variables, matching, and synthetic controls. - \end{fullwidth} %----------------------------------------------------------------------------------------------- From 74f0e2c1216c341fc1e26c477fc13c42a130dafc Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 18:17:17 -0500 Subject: [PATCH 566/854] Update chapters/introduction.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 627884e85..253f4d2e2 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -57,7 +57,7 @@ \section{Doing credible research at scale} DIME Analytics was created to take advantage of the concentration and scale of research at DIME to develop and test solutions, to ensure high quality of data collection and research across the DIME portfolio, and to make public training and tools available to the larger community of development researchers. -Data for Development Impact compiles the ideas, best practices and software tools Analytics +\textit{Data for Development Impact} compiles the ideas, best practices and software tools Analytics has developed while supporting DIME's global impact evaluation portfolio. The \textbf{DIME Wiki} is one of our flagship products, a free online collection of our resources and best practices.\sidenote{ \url{http://dimewiki.worldbank.org/}} From 4c09c9efaa4807e4550ccee8569c5034e0b94bd7 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 18:17:55 -0500 Subject: [PATCH 567/854] Update chapters/introduction.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 253f4d2e2..82c159ace 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -76,7 +76,7 @@ \section{Outline of this book} The book progresses through the typical workflow of an empirical research project. We start with ethical principles to guide empirical research, focusing on research transparency and the right to privacy. -The second chapter discusses the importance of planning data work at the outset of the research project- +The second chapter discusses the importance of planning data work at the outset of the research project - long before any data is acquired - and provide suggestions for collaborative workflows and tools. Next, we turn to common research designs for \textbf{causal inference}{\sidenote{causal inference: identifying the change in outcome \textit{caused} by a particular intervention}}, and consider their implications for data structure. The fourth chapter covers how to implement sampling and randomization to ensure research credibility, From 2eb7f8c8748a19fc1b65741a2994c60ab7099b49 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 18:18:20 -0500 Subject: [PATCH 568/854] Update chapters/introduction.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 82c159ace..fb1dd4eb7 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -129,7 +129,7 @@ \section{Writing reproducible code in a collaborative environment} - it is correct (doesn't produce any errors along the way) - it is useful and comprehensible to someone who hasn't seen it before (including the author three weeks later) Many researchers have been trained to code correctly. -However, when your code runs on your computer and you get the correct results, you are only half-done writing \underline{good} code. +However, when your code runs on your computer and you get the correct results, you are only half-done writing \textit{good} code. Good code is easy to read and replicate, making it easier to spot mistakes. Good code reduces noise due to sampling, randomization, and cleaning errors. Good code can easily be reviewed by others before it's published and replicated afterwards. From 4264adc5fc0413f40dff1e4ba3dfa281d1238ee6 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 18:18:37 -0500 Subject: [PATCH 569/854] Update chapters/introduction.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index fb1dd4eb7..d8543b4ab 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -93,7 +93,7 @@ \section{Outline of this book} \textbf{field coordinators (FCs)} who are responsible for the implementation of the study on the ground; and \textbf{research assistants (RAs)} who are responsible for -handling raw data processing and analytical tasks. +handling data processing and analytical tasks. \section{Adopting reproducible workflows} From e3e94e40554c67f0b451a9339510538d8fae2f02 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 18:27:43 -0500 Subject: [PATCH 570/854] Intro re-write fixed bullet points --- chapters/introduction.tex | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 627884e85..88aca47fc 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -126,8 +126,11 @@ \section{Writing reproducible code in a collaborative environment} we provide here a brief introduction to ``good'' code and standardized practices. ``Good'' code has two elements: -- it is correct (doesn't produce any errors along the way) -- it is useful and comprehensible to someone who hasn't seen it before (including the author three weeks later) +\begin{itemize} +\item it is correct (doesn't produce any errors along the way) +\item it is useful and comprehensible to someone who hasn't seen it before (or even yourself a few weeks, months or years later) +\end{itemize} + Many researchers have been trained to code correctly. However, when your code runs on your computer and you get the correct results, you are only half-done writing \underline{good} code. Good code is easy to read and replicate, making it easier to spot mistakes. From ed01f5f75fb8b5a0795b3c474e6f0452569c7c96 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 7 Feb 2020 18:29:52 -0500 Subject: [PATCH 571/854] Intro re-write mentioned book as example of DIME public good --- chapters/introduction.tex | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index c603a054c..81e5572d6 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -52,7 +52,7 @@ \section{Doing credible research at scale} US\$180 million research budget to shape the design and implementation of US\$18 billion in development finance. DIME also provides advisory services to 30 multilateral and bilateral development agencies. -Finally, DIME invests in public goods to improve the quality and reproducibility of development research around the world. +Finally, DIME invests in public goods (such as this book) to improve the quality and reproducibility of development research around the world. DIME Analytics was created to take advantage of the concentration and scale of research at DIME to develop and test solutions, to ensure high quality of data collection and research across the DIME portfolio, @@ -78,7 +78,9 @@ \section{Outline of this book} focusing on research transparency and the right to privacy. The second chapter discusses the importance of planning data work at the outset of the research project - long before any data is acquired - and provide suggestions for collaborative workflows and tools. -Next, we turn to common research designs for \textbf{causal inference}{\sidenote{causal inference: identifying the change in outcome \textit{caused} by a particular intervention}}, and consider their implications for data structure. +Next, we turn to common research designs for +\textbf{causal inference}{\sidenote{causal inference: identifying the change in outcome +\textit{caused} by a particular intervention}}, and consider their implications for data structure. The fourth chapter covers how to implement sampling and randomization to ensure research credibility, and includes details on power calculation and randomization inference. The fifth chapter provides guidance on high quality primary data collection, particularly for projects that use surveys. From 31de73d76eb8c1b88127ec51fe0604c5ab441e85 Mon Sep 17 00:00:00 2001 From: Luiza Date: Fri, 7 Feb 2020 19:09:37 -0500 Subject: [PATCH 572/854] [ch7] restructure of publishing data subsection --- chapters/publication.tex | 34 ++++++++++++++++------------------ 1 file changed, 16 insertions(+), 18 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 9f8cf4c81..1fb3830fa 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -273,13 +273,21 @@ \subsection{Publishing data for replication} to investigate what other results might be obtained from the same population, and test alternative approaches to other questions. Therefore you should make clear in your study -where and how data are stored, and how and under what circumstances it might be accessed. +where and how data are stored, and how and under what circumstances they may be accessed. You do not always have to publish the data yourself, -and in some cases you are legally not allowed to, -but what matters is that the data is published -(with or without access restrictions) -and that you cite or otherwise directly reference all data, -even data that you cannot release. +and in some cases you are legally not allowed to. +Even if you cannot release data immediately or publicly, +there are often options to catalog or archive the data without open publication. +These may take the form of metadata catalogs or embargoed releases. +Such setups allow you to hold an archival version of your data +which your publication can reference, +as well as provide information about the contents of the datasets +and how future users might request permission to access them +(even if you are not the person who can grant that permission). +They can also provide for timed future releases of datasets +once the need for exclusive access has ended. + +What matters is for you to be able to cite or otherwise directly reference the data used. When your raw data is owned by someone else, or for any other reason you are not able to publish it, in many cases you will have the right to release @@ -304,16 +312,6 @@ \subsection{Publishing data for replication} depending on how it was collected, and the best time to resolve any questions about these rights is at the time that data collection or transfer agreements are signed. -Even if you cannot release data immediately or publicly, -there are often options to catalog or archive the data without open publication. -These may take the form of metadata catalogs or embargoed releases. -Such setups allow you to hold an archival version of your data -which your publication can reference, -as well as provide information about the contents of the datasets -and how future users might request permission to access them -(even if you are not the person who can grant that permission). -They can also provide for timed future releases of datasets -once the need for exclusive access has ended. Data publication should release the dataset in a widely recognized format. While software-specific datasets are acceptable accompaniments to the code @@ -325,10 +323,10 @@ \subsection{Publishing data for replication} the data collection instrument or survey questionnaire so that readers can understand which data components are collected directly in the field and which are derived. -You should publish both a clean version of the data +If possible, you should publish both a clean version of the data which corresponds exactly to the original database or questionnaire as well as the constructed or derived dataset used for analysis. -Wherever possible, you should also release the code +You should also release the code that constructs any derived measures, particularly where definitions may vary, so that others can learn from your work and adapt it as they like. From db6e7b9627c16ef4e4c1497cf4149de2d2d75497 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 8 Feb 2020 11:51:53 -0500 Subject: [PATCH 573/854] Licensing and ownership --- chapters/data-collection.tex | 128 ++++++++++++++++++++++++++++++++++- 1 file changed, 127 insertions(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 7fb2a07bd..0ea6387b6 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -15,12 +15,138 @@ %------------------------------------------------ \section{Collecting primary data with development partners} -\subsection{Who owns data?} +Primary data is the key to most modern development research. +Often, there is simply no source of reliable official statistics +on the inputs or outcomes we are interested in. +Therefore we undertake to create or obtain new data, +typically in patnership with a local agency or organization. +The intention of primary data collection +is to answer a unique question that cannot be approached in any other way, +so it is important to properly collect and handle that data, +especially when it belongs to or describes people. + +\subsection{Data ownership} + +Data ownership is a tricky subject, +as many jurisdictions have differing laws regarding data and information, +and you may even be subject to multiple conflicting regulations. +In some places, data is implicitly owned by the people who it is about. +In others, it is owned by the people who collected it. +In still more, it is highly unclear and there may be varying norms. +The best approach is always to consult with a local partner +and to make explicit agreements (including consent, where applicable) +about who owns any data that is collected. +Particularly where personal data, business data, or government data is involved +-- that is, when people are disclosing information to you +that you could not obtain simply by walking around and looking -- +you should be extremely clear up front about what will be done with data +so that there is no possibility of confusion down the line. + +As with all forms of ownership, +data ownership can include a variety of rights and responsibilities +which can be handled together or separately. +Ownership of data may or may not give you +the right to share that data with other people, +the right to publish or reproduce that data, +or even the right to store data in certain ways or in certain locations. +Again, clarity is key. +If you are not collecting the data directly yourself -- +for example, if a government, company, or agency is doing it for you -- +make sure that you have an explicit agreement with them +about who owns the resulting data +and what the rights and responsibilities of the data collectors are, +including reuse, storage, and retention or destruction of data. \subsection{Data licensing agreements} +Data licensing is the formal act of giving some data rights to others +while retaining ownership of a particular dataset. +Whether or not you are the owner of a dataset you want to analyze, +you can enter into a licensing agreement to access it for research purposes. +Similarly, when you own a dataset, +you may be interested in allowing specific people +or the general public to use it for various reasons. +As a researcher, it is your responsibility to respect the rights +of people who own data and people who are described in it; +but it is also your responsibility to make sure +that information is as available and accessible it can be. +These twin responsibilities can and do come into tension, +so it is important to be fully informed about what others are doing +and to fully inform others of what you are doing. +Writing down and agreeing to specific details is a good way of doing that. + +When you are licensing someone else's data for research, +keep in mind that they are not likely to be familiar +with the research process, and therefore may be surprised +at some of the things you want to do if you are not clear up front. +You will typically want the right to create and retain +derivative indicators, and you will want to own that output dataset. +You will want to store, catalog, or publish, in whole or in part, +either the original licensed material or the derived dataset. +Make sure that the license you obtain from the data owner allows these uses, +and that you check in with them if you have any questions +about what you are allowed to do with specific portions of their data. + +When you are licensing your own data for release, +whether it is to a particular individual or to a group, +make sure you take the same considerations. +Would you be okay with someone else publicly releasing that data in full? +Would you be okay with it being stored on servers anywhere in the world, +even ones that are owned by corporations or governments abroad? +Would you expect that users of your data cite you or give you credit, +or would you require them in turn to release +their derivative data or publications under similar licenses as yours? +Whatever your answers are to these questions, +make sure your license or other agreement +specifically details those requirements. + \subsection{Receiving data from development partners} +Data may be recieved from development partners in various ways. +You may conduct a first-hand survey either of them or with them +(more on that in the next section). +You may recieve access to servers or accounts that already exist. +You may recieve a one-time transfer of a block of data, +or you may be given access to a restricted area to extract information. +Talk to an information-technology specialist, +either at your organization or at the parner organization, +to ensure that data is being transferred, recieved, and stored +in a method that conforms to the relevant level of security. +The data owner will determine the appropriate level of security. +Whether or not you are the data owner, you will need to use your judgment +and follow the data protocols that were determined +in the course of your IRB approval to obtain and use the data: +these may be stricter than the requirements of the data provider. + +Another consideration that is important at this stage is proper documentation and cataloging of data. +It is not always clear what pieces of information jointly constitute a ``dataset'', +and many of the sources you recieve data from will not be organized for research. +To help you keep organized and to put some stucture on the materials you will be recieving, +you should always retain the original data as recieved +alongside a copy of the corresponding ownership agreement or license. +You should make a simple ``readme'' document noting the date of reciept, +the source and recipient of the data, and a brief description of what it is. +All too often data produced by systems is provided as vaguely-named spreadsheets, +or transferred as electronic communications with non-specific titles, +and it is not possible to keep track of these kinds of information as data over time. +Eventually, you will want to make sure that you are creating a collection or object +that can be properly submitted to a data catalog and given a reference and citation. + +As soon as the requisite pieces of information are stored together, +think about which ones are the components of what you would call a dataset. +This is, as many things are, more of an art than a science: +you want to keep things together that belong together, +but you also want to keep things apart that belong apart. +There usually won't be a precise way to tell the answer to this question, +so consult with others about what is the appropriate level of aggregation +for the data project you have endeavored to obtain. +This is the object you will think about cataloging, releasing, and licensing +as you move towards the publication part of the research process. +This may require you to re-check with the provider +about what portions are acceptible to license, +particularly if you are combining various datasets +that may provide even more information about specific individuals. + %------------------------------------------------ \section{Collecting primary data using electronic surveys} From 43a4fb5b350a9a0bfb0f652bf0e82a1bc9d4fe3a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 8 Feb 2020 11:56:42 -0500 Subject: [PATCH 574/854] Accept suggestion Co-Authored-By: Luiza Andrade --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 17f66876b..21a054162 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -10,7 +10,7 @@ But you do need to understand the intuitive approach of the main methods in order to be able to collect, store, and analyze data effectively. Laying out the research design before starting data work -ensure you will know how to assess the statistical power of your research design +will ensure that you know how to assess the statistical power of your research design and calculate the correct estimate of your results. While you are in the field, understanding the research design will enable you to make decisions in the field From c67c5288081e57787fb946c52381bf44fa279fd8 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 8 Feb 2020 11:56:56 -0500 Subject: [PATCH 575/854] Accept suggestion Co-Authored-By: Luiza Andrade --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 21a054162..bfb4b9d8e 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -13,7 +13,7 @@ will ensure that you know how to assess the statistical power of your research design and calculate the correct estimate of your results. While you are in the field, understanding the research design -will enable you to make decisions in the field +will enable you to make decisions when you inevitably have to allocate scarce resources between costly tasks like maximizing sample size or ensuring follow-up with specific respondents. From 135beb274aa2fe059f86b2bd6a6e27381114168c Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Sat, 8 Feb 2020 11:58:26 -0500 Subject: [PATCH 576/854] Accept suggestion --- chapters/planning-data-work.tex | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 7c44cecfc..5ded8fe92 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -304,13 +304,17 @@ \subsection{Choosing software} \section{Organizing code and data} We assume you are going to do nearly all of your analytical work through code. +Though it is possible to use some statistical software through the user interface +without writing any code, we strongly advise against it. +Writing code creates a record of every task you performed. +It also prevents direct interaction with the data files that could lead to non-reproducible steps. Good code, like a good recipe, allows other people to read and replicate it, -and this functionality is now considered an essential component of a research output. -You may do some exploratory tasks in an ``interactive'' way, +and this functionality is now considered an essential component of any research output. +You may do some exploratory tasks by point-and-click or typing directly into the console, but anything that is included in a research output must be coded up in an organized fashion so that you can release the exact code recipe that goes along with your final results. -But organizing files and folders is not a trivial task. +Still, organizing code and data into files and folders is not a trivial task. What is intuitive to one person rarely comes naturally to another, and searching for files and folders is everybody's least favorite task. As often as not, you come up with the wrong one, From c26820178caf0ccf43cb7e019fb7e596738e4807 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Sat, 8 Feb 2020 21:30:11 -0500 Subject: [PATCH 577/854] [ch4] how to set seed in R Co-Authored-By: Luiza Andrade --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index c8b5975c7..631549d44 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -138,7 +138,7 @@ \subsection{Reproducibility in random Stata processes} can draw a uniformly distributed six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. (This link is a just shortcut to request such a random seed on \url{https://www.random.org}.) There are many more seeds possible but this is a large enough set for most purposes.} -In Stata, \texttt{set seed [seed]} will set the generator to that start-point. +In Stata, \texttt{set seed [seed]} will set the generator to that start-point. In R, the \texttt{set.seed} function does the same. To be clear: you should not set a single seed once in the master do-file, but instead you should set a new seed in code right before each random process. The most important thing is that each of these seeds is truly random, From e11ff1176219ba2b8a7a331b5dbac7708436bed8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Sat, 8 Feb 2020 21:30:29 -0500 Subject: [PATCH 578/854] [ch4] typo Co-Authored-By: Luiza Andrade --- code/randtreat-strata.do | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code/randtreat-strata.do b/code/randtreat-strata.do index bc51014d6..1d5dbf75a 100644 --- a/code/randtreat-strata.do +++ b/code/randtreat-strata.do @@ -9,7 +9,7 @@ isid patient, sort // Sort set seed 796683 // Seed - drawn using http://bit.ly/stata-random -* Create strata indicator. The indicator is a categorical varaible with +* Create strata indicator. The indicator is a categorical variable with * one value for each unique combination of gender and age group. egen sex_agegroup = group(sex agegrp) , label label var sex_agegroup "Strata Gender and Age Group" From 905e932fd05166aa057bc49c077dc80507d2288f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Sat, 8 Feb 2020 21:32:25 -0500 Subject: [PATCH 579/854] [ch4] clarifcation in code comment Co-Authored-By: Luiza Andrade --- code/randtreat-strata.do | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code/randtreat-strata.do b/code/randtreat-strata.do index 1d5dbf75a..09e6d24ea 100644 --- a/code/randtreat-strata.do +++ b/code/randtreat-strata.do @@ -10,7 +10,7 @@ set seed 796683 // Seed - drawn using http://bit.ly/stata-random * Create strata indicator. The indicator is a categorical variable with -* one value for each unique combination of gender and age group. +* a different value for each unique combination of gender and age group. egen sex_agegroup = group(sex agegrp) , label label var sex_agegroup "Strata Gender and Age Group" From 226d836397d29cfd17c3e57e46f2cf5fd6d726bf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Sat, 8 Feb 2020 21:35:20 -0500 Subject: [PATCH 580/854] [ch4] randtreat misfit explanation Co-Authored-By: Luiza Andrade --- code/randtreat-strata.do | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code/randtreat-strata.do b/code/randtreat-strata.do index 09e6d24ea..cb9d2ac74 100644 --- a/code/randtreat-strata.do +++ b/code/randtreat-strata.do @@ -15,7 +15,7 @@ label var sex_agegroup "Strata Gender and Age Group" * Use the user written command randtreat to randomize when the groups -* cannot be evenly distributed into treatment arms. There are 20 +* cannot be evenly distributed into treatment arms. * observations in each strata, and there is no way to evenly distribute * 20 observations in 6 groups. If we assign 3 observation to each * treatment arm we have 2 observations in each strata left. The remaining From 97dee142a359282901acf84faab4e6dc03045731 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Sat, 8 Feb 2020 21:35:32 -0500 Subject: [PATCH 581/854] [ch4] randtreat misfit explanation Co-Authored-By: Luiza Andrade --- code/randtreat-strata.do | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code/randtreat-strata.do b/code/randtreat-strata.do index cb9d2ac74..38e01fa18 100644 --- a/code/randtreat-strata.do +++ b/code/randtreat-strata.do @@ -16,7 +16,7 @@ * Use the user written command randtreat to randomize when the groups * cannot be evenly distributed into treatment arms. -* observations in each strata, and there is no way to evenly distribute +* This is the case here, since there are 20 observations in each strata * 20 observations in 6 groups. If we assign 3 observation to each * treatment arm we have 2 observations in each strata left. The remaining * observations are called "misfits". In randtreat we can use the "global" From ae6777997ef432be513c103530998e5f9cd08774 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Sat, 8 Feb 2020 21:36:18 -0500 Subject: [PATCH 582/854] [ch4] randtreat misfit explanation Co-Authored-By: Luiza Andrade --- code/randtreat-strata.do | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/code/randtreat-strata.do b/code/randtreat-strata.do index 38e01fa18..3da5c06c4 100644 --- a/code/randtreat-strata.do +++ b/code/randtreat-strata.do @@ -17,12 +17,12 @@ * Use the user written command randtreat to randomize when the groups * cannot be evenly distributed into treatment arms. * This is the case here, since there are 20 observations in each strata -* 20 observations in 6 groups. If we assign 3 observation to each -* treatment arm we have 2 observations in each strata left. The remaining -* observations are called "misfits". In randtreat we can use the "global" +* and 6 treatment arms to distribute them into. +* This will always result in two remaining ("misfits") observations in each group. +* randtreat offers different ways to deal with misfits. In this example, we use the "global" * misfit strategy, meaning that the misfits will be randomized into * treatment groups so that the sizes of the treatment groups are as -* balanced as possible globally (read helpfile for more information). +* balanced as possible globally (read helpfile for details on this and other strategies for misfits). * This way we have 6 treatment groups with exactly 20 observations * in each, and it is randomized which strata that has an extra * observation in each treatment arm. From 26324a45dde7091f927749ad4545c8a19cc5b58e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Sat, 8 Feb 2020 21:38:00 -0500 Subject: [PATCH 583/854] [ch4] code comment typo Co-Authored-By: Luiza Andrade --- code/simple-multi-arm-randomization.do | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code/simple-multi-arm-randomization.do b/code/simple-multi-arm-randomization.do index 65c77c8b7..946c8bd6e 100644 --- a/code/simple-multi-arm-randomization.do +++ b/code/simple-multi-arm-randomization.do @@ -10,7 +10,7 @@ gen treatment_rand = rnormal() //Generate a random number sort treatment_rand //Sort based on the random number -* See simple-sample.do example for explination of "(_n <= _N * X)". The code +* See simple-sample.do example for an explanation of "(_n <= _N * X)". The code * below randomly selects one third into group 0, one third into group 1 and * one third into group 2. Typically 0 represents the control group and 1 and * 2 represents two treatment arms From 3fa164c37c60130a3b33946649d33c926365561f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Sat, 8 Feb 2020 21:38:33 -0500 Subject: [PATCH 584/854] [ch4] clarifcation in code comment Co-Authored-By: Luiza Andrade --- code/simple-multi-arm-randomization.do | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code/simple-multi-arm-randomization.do b/code/simple-multi-arm-randomization.do index 946c8bd6e..65bf6520d 100644 --- a/code/simple-multi-arm-randomization.do +++ b/code/simple-multi-arm-randomization.do @@ -11,7 +11,7 @@ sort treatment_rand //Sort based on the random number * See simple-sample.do example for an explanation of "(_n <= _N * X)". The code -* below randomly selects one third into group 0, one third into group 1 and +* below randomly selects one third of the observations into group 0, one third into group 1 and * one third into group 2. Typically 0 represents the control group and 1 and * 2 represents two treatment arms generate treatment = 0 //Set all observations to 0 From 38c1fd65113f4a7ce42fab2182728efa55b1ef7e Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Sat, 8 Feb 2020 21:43:33 -0500 Subject: [PATCH 585/854] [ch4] randtreat ex. Even line length, use "group" consistently etc --- code/randtreat-strata.do | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/code/randtreat-strata.do b/code/randtreat-strata.do index 3da5c06c4..ace5a9419 100644 --- a/code/randtreat-strata.do +++ b/code/randtreat-strata.do @@ -14,18 +14,17 @@ egen sex_agegroup = group(sex agegrp) , label label var sex_agegroup "Strata Gender and Age Group" -* Use the user written command randtreat to randomize when the groups -* cannot be evenly distributed into treatment arms. -* This is the case here, since there are 20 observations in each strata -* and 6 treatment arms to distribute them into. -* This will always result in two remaining ("misfits") observations in each group. -* randtreat offers different ways to deal with misfits. In this example, we use the "global" -* misfit strategy, meaning that the misfits will be randomized into -* treatment groups so that the sizes of the treatment groups are as -* balanced as possible globally (read helpfile for details on this and other strategies for misfits). -* This way we have 6 treatment groups with exactly 20 observations -* in each, and it is randomized which strata that has an extra -* observation in each treatment arm. +* Use the user written command randtreat to randomize when the groups cannot +* be evenly distributed into treatment arms. This is the case here, since +* there are 20 observations in each strata and 6 treatment arms to randomize +* them into. This will always result in two remaining ("misfits") observations +* in each group. randtreat offers different ways to deal with misfits. In this +* example, we use the "global" misfit strategy, meaning that the misfits will +* be randomized into treatment groups so that the sizes of the treatment +* groups are as balanced as possible globally (read the help file for details +* on this and other strategies for misfits). This way we have 6 treatment +* groups with exactly 20 observations in each, and it is randomized which +* group that has an extra observation in each treatment arm. randtreat, /// generate(treatment) /// New variable name multiple(6) /// 6 treatment arms From bb3055685c623966def7db0b0888d1a65191a890 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 10:03:50 -0500 Subject: [PATCH 586/854] [ch1] sentence flow better like this - imho --- chapters/handling-data.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 37ffc7f6c..52f7d449c 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -349,9 +349,10 @@ \subsection{Transmitting and storing data securely} inside that secure environment if multiple users share accounts. However, password-protection alone is not sufficient, because if the underlying data is obtained through a leak the information itself remains usable. -Raw data which contains PII \textit{must} therefore be \textbf{encrypted}\sidenote{ - \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked, or any other type of unauthorized access is obtained. - \url{https://dimewiki.worldbank.org/wiki/encryption}} +Raw data which contains PII \textit{must} therefore be \index{encryption}\textbf{encrypted}\sidenote{ +\textbf{Encryption:} Methods which ensure that files are unreadable even if laptops +are stolen, databases are hacked, or unauthorized access to the data is obtained in +any other way. \url{https://dimewiki.worldbank.org/wiki/encryption}} during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} The biggest security gap is often in transmitting survey plans to and from staff in the field, since staff with technical specialization are usually in an HQ office. From 8119fbe8b98c993ce4666bf589b855eb211ceb29 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 10:04:07 -0500 Subject: [PATCH 587/854] [ch1] link to encryption in transit. SCTO is mentioned in that wiki page --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 52f7d449c..1d35a5d90 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -363,7 +363,7 @@ \subsection{Transmitting and storing data securely} Most modern data collection software has features that, if enabled, make secure transmission straightforward.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} + \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_in\_Transit}} Many also have features that ensure data is encrypted when stored on their servers, although this usually needs to be actively enabled and administered. Proper encryption means that, From 424576ccbb98ec5186542c203962e906f3b584c5 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 10:04:27 -0500 Subject: [PATCH 588/854] [ch1] link to encryption at rest. no decrypt key for service providers --- chapters/handling-data.tex | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 1d35a5d90..9ab104aa6 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -365,12 +365,14 @@ \subsection{Transmitting and storing data securely} if enabled, make secure transmission straightforward.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_in\_Transit}} Many also have features that ensure data is encrypted when stored on their servers, -although this usually needs to be actively enabled and administered. -Proper encryption means that, -even if the information were to be intercepted or made public, -the files that would be obtained would be useless to the recipient. -In security language this person is often referred to as an ``intruder'' -but it is rare that data breaches are malicious or even intentional. +although this usually needs to be actively enabled and administered.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_at\_Rest}} +Proper encryption means that, even if the files were to be intercepted my a malicious +``intruder'' or accidentally made public, the information that would be leaked would +be completely unreadable and unusable. Proper encryption also means that no one not +listed on the IRB may have access to the decryption key, which means that it usually not +enough to rely service providers' on-the-fly encryption as they need to keep a copy +of the decryption key to make it automatic. The easiest way to protect personal information is not to use it. It is often very simple to conduct planning and analytical work From fd158deb773ef455cbc7c7d660ea78d25c62b655 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 10:04:41 -0500 Subject: [PATCH 589/854] [ch1] make it clear this is not an alternative to encryption --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 9ab104aa6..337ba7e5f 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -374,7 +374,7 @@ \subsection{Transmitting and storing data securely} enough to rely service providers' on-the-fly encryption as they need to keep a copy of the decryption key to make it automatic. -The easiest way to protect personal information is not to use it. +The easiest way to reduce the risk of leaking personal information is to use it as rarely as possible. It is often very simple to conduct planning and analytical work using a subset of the data that has anonymous identifying ID variables, and has had personal characteristics removed from the dataset altogether. From 1f9e01bae7252e9f9f6e50e848b9c5c6e5e26a37 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 10:16:46 -0500 Subject: [PATCH 590/854] [ch5] typos --- chapters/data-collection.tex | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 0ea6387b6..c8d666e57 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -19,7 +19,7 @@ \section{Collecting primary data with development partners} Often, there is simply no source of reliable official statistics on the inputs or outcomes we are interested in. Therefore we undertake to create or obtain new data, -typically in patnership with a local agency or organization. +typically in partnership with a local agency or organization. The intention of primary data collection is to answer a unique question that cannot be approached in any other way, so it is important to properly collect and handle that data, @@ -102,15 +102,15 @@ \subsection{Data licensing agreements} \subsection{Receiving data from development partners} -Data may be recieved from development partners in various ways. +Data may be received from development partners in various ways. You may conduct a first-hand survey either of them or with them (more on that in the next section). -You may recieve access to servers or accounts that already exist. -You may recieve a one-time transfer of a block of data, +You may receive access to servers or accounts that already exist. +You may receive a one-time transfer of a block of data, or you may be given access to a restricted area to extract information. Talk to an information-technology specialist, -either at your organization or at the parner organization, -to ensure that data is being transferred, recieved, and stored +either at your organization or at the partner organization, +to ensure that data is being transferred, received, and stored in a method that conforms to the relevant level of security. The data owner will determine the appropriate level of security. Whether or not you are the data owner, you will need to use your judgment @@ -120,11 +120,11 @@ \subsection{Receiving data from development partners} Another consideration that is important at this stage is proper documentation and cataloging of data. It is not always clear what pieces of information jointly constitute a ``dataset'', -and many of the sources you recieve data from will not be organized for research. -To help you keep organized and to put some stucture on the materials you will be recieving, -you should always retain the original data as recieved +and many of the sources you receive data from will not be organized for research. +To help you keep organized and to put some structure on the materials you will be receiving, +you should always retain the original data as received alongside a copy of the corresponding ownership agreement or license. -You should make a simple ``readme'' document noting the date of reciept, +You should make a simple ``readme'' document noting the date of receipt, the source and recipient of the data, and a brief description of what it is. All too often data produced by systems is provided as vaguely-named spreadsheets, or transferred as electronic communications with non-specific titles, @@ -143,7 +143,7 @@ \subsection{Receiving data from development partners} This is the object you will think about cataloging, releasing, and licensing as you move towards the publication part of the research process. This may require you to re-check with the provider -about what portions are acceptible to license, +about what portions are acceptable to license, particularly if you are combining various datasets that may provide even more information about specific individuals. @@ -186,7 +186,7 @@ \subsection{Developing a survey instrument} and ensures teams have a readable, printable version of their questionnaire. Most importantly, it means the research, not the technology, drives the questionnaire design. -We recomment this approach because an easy-to-read paper questionnaire +We recommend this approach because an easy-to-read paper questionnaire is especially useful for training data collection staff, by focusing on the survey content and structure before diving into the technical component. It is much easier for enumerators to understand the range of possible participant responses @@ -288,7 +288,7 @@ \subsection{Designing surveys for electronic deployment} but allows for much more comprehensible top-line statistics and data quality checks. Rigorous field testing is required to ensure that answer categories are comprehensive; however, it is best practice to include an \textit{other, specify} option. -Keep track of those reponses in the first few weeks of fieldwork. +Keep track of those responses in the first few weeks of fieldwork. Adding an answer category for a response frequently showing up as \textit{other} can save time, as it avoids extensive post-coding. @@ -300,7 +300,7 @@ \subsection{Designing surveys for electronic deployment} and detailed descriptions of choice options. This is what you want for the enumerator-respondent interaction, but you should already have analysis-compatible labels programmed in the background -so the resulting data can be rapidly imported in anlytical software. +so the resulting data can be rapidly imported in analytical software. There is some debate over how exactly individual questions should be identified: formats like \texttt{hq\_1} are hard to remember and unpleasant to reorder, but formats like \texttt{hq\_asked\_about\_loans} quickly become cumbersome. @@ -319,7 +319,7 @@ \subsection{Designing surveys for electronic deployment} \subsection{Programming electronic questionnaires} -The starting point for questionnare programming is therefore a complete paper version of the questionnaire, +The starting point for questionnaire programming is therefore a complete paper version of the questionnaire, piloted for content and translated where needed. Doing so reduces version control issues that arise from making significant changes to concurrent paper and electronic survey instruments. @@ -547,7 +547,7 @@ \subsection{Receiving, storing, and sharing data securely} that are used in data collection have a secure logon password and are never left unlocked. Even though your data is therefore usually safe while it is being transmitted, -it is not automatially secure when it is being stored. +it is not automatically secure when it is being stored. \textbf{Encryption at rest} is the only way to ensure that PII data remains private when it is stored on someone else's server on the internet. You must keep your data encrypted on the data collection server whenever PII data is collected. From cefd6dfcecc4c925422afa4895e7b668f8763dd2 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 10:45:49 -0500 Subject: [PATCH 591/854] [ch5] Add encryption index --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index c8d666e57..fa4d8434e 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -535,7 +535,7 @@ \subsection{Receiving, storing, and sharing data securely} Research teams must maintain strict protocols for data security at each stage of the process, including data collection, storage, and sharing. -In field surveys, most common data collection software will automatically \textbf{encrypt}\sidenote{ +In field surveys, most common data collection software will automatically \index{encryption}\textbf{encrypt}\sidenote{ \textbf{Encryption:} the process of making information unreadable to anyone without access to a specific deciphering key. \url{https://dimewiki.worldbank.org/wiki/Encryption}} From c0ff8779e2d82812c3825e6b2b1fc96ed695a623 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 11:13:21 -0500 Subject: [PATCH 592/854] [ch5] encryption in transit --- chapters/data-collection.tex | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index fa4d8434e..3a0e16b35 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -536,15 +536,18 @@ \subsection{Receiving, storing, and sharing data securely} including data collection, storage, and sharing. In field surveys, most common data collection software will automatically \index{encryption}\textbf{encrypt}\sidenote{ - \textbf{Encryption:} the process of making information unreadable to - anyone without access to a specific deciphering - key. \url{https://dimewiki.worldbank.org/wiki/Encryption}} -all data submitted from the field while in transit (i.e., when uploading or downloading). -If this is implemented by the software, -the data will be encrypted from the time it leaves the device or browser until it reaches the server. -Therefore, as long as you are using an established survey software, this step is largely taken care of. -Of course, the research team must ensure that all computers, tablets, and accounts -that are used in data collection have a secure logon password and are never left unlocked. +\textbf{Encryption:} Methods which ensure that files are unreadable even if laptops +are stolen, databases are hacked, or unauthorized access to the data is obtained in +any other way. \url{https://dimewiki.worldbank.org/wiki/Encryption}} +all data submitted from the field while in transit (i.e., upload or download).\sidenote{ +\url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_in\_Transit}} +If this is implemented by the software you are using, then your +data will be encrypted from the time it leaves the device (in tablet-assisted data +collation) or browser (in web data collection), until it reaches the server. +Therefore, as long as you are using an established survey software, this step is +largely taken care of. However, the research team must ensure that all computers, +tablets, and accounts that are used in data collection have secure a logon +password and are never left unlocked. Even though your data is therefore usually safe while it is being transmitted, it is not automatically secure when it is being stored. From d3b4fa64d265e29b7a38afed90cc9823139fbf62 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 11:22:58 -0500 Subject: [PATCH 593/854] [ch5] encryption at rest --- chapters/data-collection.tex | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 3a0e16b35..4e389db7e 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -551,18 +551,17 @@ \subsection{Receiving, storing, and sharing data securely} Even though your data is therefore usually safe while it is being transmitted, it is not automatically secure when it is being stored. -\textbf{Encryption at rest} is the only way to ensure -that PII data remains private when it is stored on someone else's server on the internet. -You must keep your data encrypted on the data collection server whenever PII data is collected. -If you do not, the raw data will be accessible by individuals -who are not approved by your IRB agreement, -such as tech support personnel, server administrators, and other third-party staff. -Encryption at rest must be used to make data files completely unusable -without access to a security key specific to that data --- a higher level of security than simple password-protection. -Encryption at rest requires active participation from the user, -and you should be fully aware that if your private encryption key is lost, -there is absolutely no way to recover your data. +\textbf{Encryption at rest}\sidenote{ +\url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_at\_Rest}} +is the only way to ensure that PII data remains private when it is stored on a +server on the internet. You must keep your data encrypted on the data collection server +whenever PII data is collected. If you do not, the raw data will be accessible by +individuals who are not approved by your IRB, such as tech support personnel, server +administrators and other third-party staff. Encryption at rest must be used to make +data files completely unusable without access to a security key specific to that +data -- a higher level of security than password-protection. Encryption at rest +requires active participation from the user, and you should be fully aware that +if your decryption key is lost, there is absolutely no way to recover your data. You should not assume that your data is encrypted by default: because of the careful protocols necessary, for most data collection platforms, From dd8c5236638780a3dab3cbb87a63715f59786621 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 12:44:48 -0500 Subject: [PATCH 594/854] [ch5] encryption at rest during data collection --- chapters/data-collection.tex | 42 ++++++++++++++++++------------------ 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 4e389db7e..f59fa9b96 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -563,27 +563,27 @@ \subsection{Receiving, storing, and sharing data securely} requires active participation from the user, and you should be fully aware that if your decryption key is lost, there is absolutely no way to recover your data. -You should not assume that your data is encrypted by default: -because of the careful protocols necessary, for most data collection platforms, -encryption at rest needs to be explicitly enabled and operated by the user. -There is no automatic way to implement this protocol, -because the encryption key that is generated -can never pass through the hands of a third party, including the data storage application. -To enable encryption at rest, you must confirm -you know how to operate the encryption system -and understand the consequences if the correct protocols are not followed. -When you enable encryption at rest, the service typically will allow you to download -- once -- -the keyfile pair needed to decrypt the data. -You must download and store this keyfile in a secure location, such as a password manager. -Make sure you store keyfiles with descriptive names to match the survey to which they correspond. -The keyfiles must only be accessible to people who are IRB-approved to use PII data. -Any time anyone accesses the data -- either when viewing it in the browser or downloading it to your computer -- they will be asked to provide the keyfile. -If they cannot, the data is inaccessible. -This makes keyfile encryption the recommended storage for any data service -that is not enterprise-grade. -Enterprise-grade storage services typically implement a similar protocol -and are legally and technically configured so that your organization -is able to hold keys safely and allow data access based on verification of your identity. +You should not assume that your data is encrypted at rest by default because of +the careful protocols necessary. In most data collection platforms, encryption at +rest needs to be explicitly enabled and operated by the user. There is no automatic +way to implement this protocol, because the encryption key that is generated may +never pass through the hands of a third party, including the data storage application. +Most survey software implement \textbf{asymmetric encryption}\sidenote{\url{ +https://dimewiki.worldbank.org/wiki/Encryption\#Asymmetric\_Encryption}} where +there are two keys in a public/private key pair. Only the private key can be used to +decrypt the encrypted data, and the public key can only be used to encrypt the data. +It is therefore safe to send the public key to the tablet or the browser used to +collect the data. When you enable encryption, the survey software will allow you to +download -- once -- the public/private keyfile pair needed to decrypt the data. You +upload the public key when you start a new survey, and all data collected using that +public key can only be accessed with the private key from that public/private key +pair. You must store the key pair in a secure location, such as a password manager, as +there is no way to access your data if the private key is lost. Make sure you store +keyfiles with descriptive names to match the survey to which they correspond. Any time +anyone accesses the data -- either when viewing it in the browser or downloading it to +your computer -- they will be asked to provide the keyfile. Only project team members +named in the IRB are allowed access to the private keyfile. + For most analytical needs, you will therefore need to create a copy of the data which has all direct identifiers removed. From cca3ffb84e5c139790d6658ae1b90e99224b5426 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 12:53:16 -0500 Subject: [PATCH 595/854] [ch5] store data after data collection --- chapters/data-collection.tex | 41 +++++++++++++++++++++++++++--------- 1 file changed, 31 insertions(+), 10 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index f59fa9b96..72d538145 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -585,19 +585,40 @@ \subsection{Receiving, storing, and sharing data securely} named in the IRB are allowed access to the private keyfile. -For most analytical needs, you will therefore need to create -a copy of the data which has all direct identifiers removed. -This working copy can be stored using unencrypted storage methods, -staff who are not IRB-approved can access and use the data, -and it can be shared with other people involved in the research without strict protocols. -The following workflow allows you to receive data and store it securely, +For most analytical needs, you typically need a to store the data somewhere else +than the survey software's server, for example your computer or a cloud drive. While +asymmetric encryption is optimal for one-way transfer from the data collection device +to the data collection server, it is not practical once you start interacting with the data. + +Instead we want to use \textbf{symmetric encryption}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Encryption\#Symmetric\_Encryption}} where we +create a secure encrypted folder, using for example VeraCrypt\sidenote{\url{https://www.veracrypt.fr/}}, +where a single key is used to both encrypt and decrypt the information. Since only one +key is used, the work flow can be simplified, the re-encryption after decrypting can +be done automatically and the same secure folder can be used for multiple files, and +these files can be interacted with and modified like any unencryted file as long as you +have the key. The following workflow allows you to receive data and store it securely, without compromising data security: \begin{enumerate} - \item Download data - \item Store a ``master'' copy of the data into an encrypted location that will remain accessible on disk and be regularly backed up - \item Create a ``gold master'' copy of the raw data in a secure location, such as a long-term cloud storage service or an encrypted physical hard drive stored in a separate location. If you remain lucky, you will never have to access this copy -- you just want to know it is there, safe, if you need it. - + \item Create a secure encrypted folder in your project folder, this should be on + your computer and could be in a shared folder. + \item Download data from the data collection server to that secure folder -- if + you encrypted the data during data collection you will need \textit{both} the + private key used during data collection to be able to download the data, \textit{and} + you will need the key used when created the secure folder to save it there. This + your first copy of your raw data. + \item Then create a secure folder on a pen-drive or a external hard drive, that you + can keep in your office. Copy the data you just downloaded to this second secure + folder. This is your ''master'' copy of your raw data. (Instead of creating a second + secure folder, you can simply copy the first secure folder) + \item Finally, create a third secure folder. Either you can create this on your + computer and upload it to a long-term cloud storage service, or you can create it on + an external hard drive that you then store in a separate location, for example at + another office of your organization. This is your ''golden master'' copy of your raw + data. You should never store the ''golden master'' copy of your raw data in a synced + folder where it is also deleted in the cloud storage if it is deleted on your computer. + (Instead of creating a third secure folder, you can simply copy the first secure folder). \end{enumerate} This handling satisfies the \textbf{3-2-1 rule}: there are From eee440493b46b7c954a45051661d7da096546f6e Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 14:45:03 -0500 Subject: [PATCH 596/854] [ch5] intro to data security chapter --- chapters/data-collection.tex | 61 +++++++++++++++++++++++------------- 1 file changed, 39 insertions(+), 22 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 72d538145..f4a866af7 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -387,22 +387,14 @@ \subsection{Programming electronic questionnaires} The data-focused pilot should be done in advance of enumerator training. %------------------------------------------------ -\section{Data quality assurance and data security} +\section{Data quality assurance} Whether you are handling data from a partner or collecting it directly, it is important to make sure that data faithfully reflects ground realities. Data quality assurance requires a combination of real-time data checks and back-checks or validation audits, which often means tracking down the people whose information is in the dataset. -However, since that data also likely contains sensitive or personal information, -it is important to keep it safe throughout the entire process. -All sensitive data must be handled in a way -where there is no risk that anyone who is not approved by an IRB -for the specific project has the ability to access the data. -Data can be sensitive for multiple reasons, -but the most common reasons are that it contains personally identifiable information (PII) -or that the partner providing the data does not want it to be released. -This section will detail principles and practices for the verification and handling of these datasets. + \subsection{Implementing high frequency quality checks} @@ -520,20 +512,45 @@ \subsection{Conducting back-checks and data validation} \textbf{Audio audits} are a useful means to assess whether enumerators are conducting interviews as expected. Do note, however, that audio audits must be included in the informed consent for the respondents. -\subsection{Receiving, storing, and sharing data securely} +%------------------------------------------------ +\section{Collecting and sharing data securely} + +All sensitive data must be handled in a way where there is no risk that anyone who is +not approved by an Institutional Review Board (IRB)\sidenote{\url{ +https://dimewiki.worldbank.org/wiki/IRB\_Approval}} for the specific project has the +ability to access the data. Data can be sensitive for multiple reasons, but the two most +common reasons are that it contains personally identifiable information (PII)\sidenote{ +\url{https://dimewiki.worldbank.org/wiki/Personally\_Identifiable\_Information\_(PII)}} +or that the partner providing the data does not want it to be released. + +Central to data security is \index{encryption}\textbf{data encryption} which is a group +of methods that ensure that files are unreadable even if laptops are stolen, servers +are hacked, or unauthorized access to the data is obtained in any other way.\sidenote{\url{ +https://dimewiki.worldbank.org/wiki/Encryption}} Proper encryption is rarely just one thing as +the data will travel through many servers, devices and computers from the source of the data +to the final analysis. So encryption should be seen as a system that is only as secure as +its weakest link. This section recommends a workflow with as few parts as possible, so that +it is easy as possible to make sure the weakest link is still strong enough. + +Encrypted data is made readable again using decryption, and decryption requires a password or a key. +You must never share passwords or keys by email, WhatsApp or other insecure modes of communication; +instead you must use a secure password manager\sidenote{\url{https://lastpass.com} or +\url{https://bitwarden.com}}. In addition to providing a way to securely share passwords, password +managers also provide a secure location for long term storage for passwords and keys regardless if +they are shared or not. + +Many data sharing software providers you are using will promote their services by saying they have +on-the-fly encryption and decryption. While this is not a bad thing and it makes your data more secure, +on-the-fly encryption/decryption by itself is never secure enough, as in order to make it automatic +they need to keep a copy of the password or key. Since it unlikely that that software provider is +included in your IRB, this is not good enough. + +It is possible in some enterprise versions of data sharing softwares, to set up on-the-fly encryption. +However, that set up is advanced and you should never trust it unless you are a cyber security expert, +or a cyber security expert within your organization have specified what it can be used for. In all +other cases you should follow the steps laid out in this section. -Primary data collection, whether in surveys or from partners, -almost always includes \textbf{personally-identifiable information (PII)}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Personally_Identifiable_Information_(PII)}} -from the people who are described in the dataset. -PII must be handled with great care at all points in the data collection and management process, in order to comply with ethical and legal requirements -and to avoid breaches of confidentiality. -Access to PII must be restricted exclusively to team members -who are granted that permission by the applicable Institutional Review Board -or the data licensing agreement with the partner agency. -Research teams must maintain strict protocols for data security at each stage of the process, -including data collection, storage, and sharing. In field surveys, most common data collection software will automatically \index{encryption}\textbf{encrypt}\sidenote{ \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops From 75df3a82cf2b6e9c9c79da595f0f379147638707 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 14:46:18 -0500 Subject: [PATCH 597/854] [ch5] data security after data collection --- chapters/data-collection.tex | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index f4a866af7..fa871eae3 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -624,7 +624,7 @@ \section{Collecting and sharing data securely} you encrypted the data during data collection you will need \textit{both} the private key used during data collection to be able to download the data, \textit{and} you will need the key used when created the secure folder to save it there. This - your first copy of your raw data. + your first copy of your raw data, and the copy you will used in your cleaning and analysis. \item Then create a secure folder on a pen-drive or a external hard drive, that you can keep in your office. Copy the data you just downloaded to this second secure folder. This is your ''master'' copy of your raw data. (Instead of creating a second @@ -638,13 +638,13 @@ \section{Collecting and sharing data securely} (Instead of creating a third secure folder, you can simply copy the first secure folder). \end{enumerate} -This handling satisfies the \textbf{3-2-1 rule}: there are +\noindent This handling satisfies the \textbf{3-2-1 rule}: there are two on-site copies of the data and one off-site copy, so the data can never -be lost in case of hardware failure.\sidenote{ - \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} -In addition, you should ensure that all teams take basic precautions to ensure the security of data, as most problems are due to human error. -Ideally, the machine hard drives themselves should also be encrypted, -as well as any external hard drives or flash drives used. +be lost in case of hardware failure.\sidenote{\url{ +https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} However, you still +need to keep track of your encryption keys as without them your data is lost. +If you remain lucky, you will never have to access your ``master'' or ``golden master'' +copies -- you just want to know it is out there, safe, if you need it. All files sent to the field containing PII data, such as sampling lists, must be encrypted. You must never share passwords by email; rather, use a secure password manager. This significantly mitigates the risk in case there is a security breach From b7458c02d3e9fbd3b1a194210659d3a553153727 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 14:47:18 -0500 Subject: [PATCH 598/854] [ch 5] share data securely in team --- chapters/data-collection.tex | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index fa871eae3..8ec4e4ca2 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -647,8 +647,34 @@ \section{Collecting and sharing data securely} copies -- you just want to know it is out there, safe, if you need it. All files sent to the field containing PII data, such as sampling lists, must be encrypted. You must never share passwords by email; rather, use a secure password manager. -This significantly mitigates the risk in case there is a security breach -such as loss, theft, hacking, or a virus, with little impact on day-to-day utilization. +You and your team will use your first copy of the raw data as the starting point for data +cleaning analysis of the data. This raw data set must remain encrypted at all times if it +includes PII data, which is almost always the case. As long as the data is properly encrypted, +using for example VeraCrypt, it can be shared using insecure modes of communication such as +email or third-party syncing services. While this is safe from a data security perspective, +this is a burdensome workflow as anyone accessing the raw data must be listed on the IRB, +have access to the decryption key and know how to use that key. Fortunately there is a way +to simplify the workflow without compromising data security. + +To simplify the workflow, the PII variables should be removed from your data at the earliest +possible opportunity creating a de-identified copy of the data. Once the data is de-identified, +it no longer needs to be encrypted -- therefore you and you team members can share it directly +without having to encrypt it and handle decryption keys. Next chapter will discuss how to +de-identify your data. If PII variables are directly required for the analysis itself, it will +be necessary to keep at least a subset of the data encrypted through the data analysis process. + +The data security standards that apply when receiving PII data obviously also apply when sending +PII data. A common example where this is often forgotten is when sending survey information, +such as sampling lists, to the field partner. This data is by all definitions also PII data and +must be encrypted. A sampling list can often be used to reverse identify a de-identified data set, +so if you were to share it using an insecure method, then that would be your weakest link that +could break all the other steps you have taken to ensure the privacy of the respondents. + +In some survey softwares you can use the same encryption that allows you to receive data securely +from the field, to also send data, such a sampling list, to the field. But if you are not sure how +that is done, or even can be done, in the survey software you are using, then you should create a +secure folder using, for example, VeraCrypt and share that secure folder with the field team. +Remember that you must always share passwords and keys in a secure way like password managers. To simplify workflow, it is best to remove PII variables from your data at the earliest possible opportunity, and save a de-identified copy of the data. From 64367f87562d296f36e35956e97b61bbcee511d3 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 14:47:57 -0500 Subject: [PATCH 599/854] [ch5] move de-identification to the end --- chapters/data-collection.tex | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 8ec4e4ca2..bfefde3b8 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -676,10 +676,11 @@ \section{Collecting and sharing data securely} secure folder using, for example, VeraCrypt and share that secure folder with the field team. Remember that you must always share passwords and keys in a secure way like password managers. -To simplify workflow, it is best to remove PII variables from your data -at the earliest possible opportunity, and save a de-identified copy of the data. -Once the data is de-identified, it no longer needs to be encrypted --- therefore you can interact with it directly without having to provide the keyfile. + +%------------------------------------------------ + +\section{de-identification - for luiza to move - either back to this chapter or to other chapters} + We recommend de-identification in two stages: an initial process to remove direct identifiers to create a working de-identified dataset, and a final process to remove all possible identifiers to create a publishable dataset. @@ -691,30 +692,29 @@ \section{Collecting and sharing data securely} can I encode or otherwise construct a variable to use for the analysis that masks the PII, and drop the original variable? Examples include geocoordinates (after constructing measures of distance or area, drop the specific location), and names for social network analysis (can be encoded to unique numeric IDs). -If PII variables are directly required for the analysis itself, -it will be necessary to keep at least a subset of the data encrypted through the data analysis process. + Flagging all potentially identifying variables in the questionnaire design stage, as recommended above, simplifies the initial de-identification. You already have the list of variables to assess, and ideally have already assessed those against the analysis plan. If so, all you need to do is write a script to drop the variables that are not required for analysis, - encode or otherwise mask those that are required, and save a working version of the data. +encode or otherwise mask those that are required, and save a working version of the data. The \textbf{final de-identification} is a more involved process, with the objective of creating a dataset for publication that cannot be manipulated or linked to identify any individual research participant. You must remove all direct and indirect identifiers, and assess the risk of statistical disclosure.\sidenote{ - \textbf{Disclosure risk:} the likelihood that a released data record can be associated with an individual or organization.} + \textbf{Disclosure risk:} the likelihood that a released data record can be associated with an individual or organization.} \index{statistical disclosure} There will almost always be a trade-off between accuracy and privacy. For publicly disclosed data, you should favor privacy. There are a number of useful tools for de-identification: PII scanners for Stata\sidenote{ - \url{https://github.com/J-PAL/stata_PII_scan}} + \url{https://github.com/J-PAL/stata_PII_scan}} or R\sidenote{ - \url{https://github.com/J-PAL/PII-Scan}}, + \url{https://github.com/J-PAL/PII-Scan}}, and tools for statistical disclosure control.\sidenote{ - \url{https://sdcpractice.readthedocs.io/en/latest/}} + \url{https://sdcpractice.readthedocs.io/en/latest/}} In cases where PII data is required for analysis, we recommend embargoing the sensitive variables when publishing the data. Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. @@ -732,6 +732,4 @@ \section{Collecting and sharing data securely} -- a \textbf{tracking dataset} -- that records all events in which survey substitutions and loss to follow-up occurred in the field and how they were implemented and resolved. With the raw data securely stored and backed up, -and a de-identified dataset to work with, you are ready to move to data cleaning and analysis. - -%------------------------------------------------ +and a de-identified dataset to work with, you are ready to move to data cleaning and analysis. \ No newline at end of file From 143c8c6066831fe5946803960d3fea7380cd9f92 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Mon, 10 Feb 2020 14:48:25 -0500 Subject: [PATCH 600/854] [ch5] data security section and sub-section headers --- chapters/data-collection.tex | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index bfefde3b8..88b94026a 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -550,13 +550,10 @@ \section{Collecting and sharing data securely} or a cyber security expert within your organization have specified what it can be used for. In all other cases you should follow the steps laid out in this section. -in order to comply with ethical and legal requirements +\subsection{Data security during data collection} -In field surveys, most common data collection software will automatically \index{encryption}\textbf{encrypt}\sidenote{ -\textbf{Encryption:} Methods which ensure that files are unreadable even if laptops -are stolen, databases are hacked, or unauthorized access to the data is obtained in -any other way. \url{https://dimewiki.worldbank.org/wiki/Encryption}} -all data submitted from the field while in transit (i.e., upload or download).\sidenote{ +In field surveys, most common data collection software will automatically encrypt +all data in transit (i.e., upload from field or download from server).\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_in\_Transit}} If this is implemented by the software you are using, then your data will be encrypted from the time it leaves the device (in tablet-assisted data @@ -590,10 +587,12 @@ \section{Collecting and sharing data securely} there are two keys in a public/private key pair. Only the private key can be used to decrypt the encrypted data, and the public key can only be used to encrypt the data. It is therefore safe to send the public key to the tablet or the browser used to -collect the data. When you enable encryption, the survey software will allow you to +collect the data. + +When you enable encryption, the survey software will allow you to create and download -- once -- the public/private keyfile pair needed to decrypt the data. You upload the public key when you start a new survey, and all data collected using that -public key can only be accessed with the private key from that public/private key +public key can only be accessed with the private key from that specific public/private key pair. You must store the key pair in a secure location, such as a password manager, as there is no way to access your data if the private key is lost. Make sure you store keyfiles with descriptive names to match the survey to which they correspond. Any time @@ -601,6 +600,7 @@ \section{Collecting and sharing data securely} your computer -- they will be asked to provide the keyfile. Only project team members named in the IRB are allowed access to the private keyfile. +\subsection{Data security after data collection} For most analytical needs, you typically need a to store the data somewhere else than the survey software's server, for example your computer or a cloud drive. While @@ -608,7 +608,7 @@ \section{Collecting and sharing data securely} to the data collection server, it is not practical once you start interacting with the data. Instead we want to use \textbf{symmetric encryption}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Encryption\#Symmetric\_Encryption}} where we +\url{https://dimewiki.worldbank.org/wiki/Encryption\#Symmetric\_Encryption}} where we create a secure encrypted folder, using for example VeraCrypt\sidenote{\url{https://www.veracrypt.fr/}}, where a single key is used to both encrypt and decrypt the information. Since only one key is used, the work flow can be simplified, the re-encryption after decrypting can @@ -645,8 +645,8 @@ \section{Collecting and sharing data securely} need to keep track of your encryption keys as without them your data is lost. If you remain lucky, you will never have to access your ``master'' or ``golden master'' copies -- you just want to know it is out there, safe, if you need it. -All files sent to the field containing PII data, such as sampling lists, must be encrypted. -You must never share passwords by email; rather, use a secure password manager. + +\subsection{Secure data sharing} You and your team will use your first copy of the raw data as the starting point for data cleaning analysis of the data. This raw data set must remain encrypted at all times if it includes PII data, which is almost always the case. As long as the data is properly encrypted, From 7a6e05ec8ba427e871d92c6502956536656c7260 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 15:54:41 -0500 Subject: [PATCH 601/854] [ch5] edits to line breaks only --- chapters/data-collection.tex | 243 ++++++++++++++++++++--------------- 1 file changed, 137 insertions(+), 106 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 88b94026a..7d74f330c 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -517,50 +517,59 @@ \section{Collecting and sharing data securely} All sensitive data must be handled in a way where there is no risk that anyone who is not approved by an Institutional Review Board (IRB)\sidenote{\url{ -https://dimewiki.worldbank.org/wiki/IRB\_Approval}} for the specific project has the -ability to access the data. Data can be sensitive for multiple reasons, but the two most +https://dimewiki.worldbank.org/wiki/IRB\_Approval}} +for the specific project has the +ability to access the data. + Data can be sensitive for multiple reasons, but the two most common reasons are that it contains personally identifiable information (PII)\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Personally\_Identifiable\_Information\_(PII)}} or that the partner providing the data does not want it to be released. Central to data security is \index{encryption}\textbf{data encryption} which is a group of methods that ensure that files are unreadable even if laptops are stolen, servers -are hacked, or unauthorized access to the data is obtained in any other way.\sidenote{\url{ -https://dimewiki.worldbank.org/wiki/Encryption}} Proper encryption is rarely just one thing as -the data will travel through many servers, devices and computers from the source of the data -to the final analysis. So encryption should be seen as a system that is only as secure as -its weakest link. This section recommends a workflow with as few parts as possible, so that -it is easy as possible to make sure the weakest link is still strong enough. +are hacked, or unauthorized access to the data is obtained in any other way. +\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} +Proper encryption is rarely just one thing as the data will travel through many servers, +devices and computers from the source of the data to the final analysis. +So encryption should be seen as a system that is only as secure as its weakest link. +This section recommends a workflow with as few parts as possible, +so that it is easy as possible to make sure the weakest link is still strong enough. Encrypted data is made readable again using decryption, and decryption requires a password or a key. You must never share passwords or keys by email, WhatsApp or other insecure modes of communication; -instead you must use a secure password manager\sidenote{\url{https://lastpass.com} or -\url{https://bitwarden.com}}. In addition to providing a way to securely share passwords, password -managers also provide a secure location for long term storage for passwords and keys regardless if +instead you must use a secure password manager\sidenote{\url{ +https://lastpass.com} or \url{https://bitwarden.com}}. +In addition to providing a way to securely share passwords, +password managers also provide a secure location for long term storage for passwords and keys regardless if they are shared or not. Many data sharing software providers you are using will promote their services by saying they have -on-the-fly encryption and decryption. While this is not a bad thing and it makes your data more secure, -on-the-fly encryption/decryption by itself is never secure enough, as in order to make it automatic -they need to keep a copy of the password or key. Since it unlikely that that software provider is +on-the-fly encryption and decryption. +While this is not a bad thing and it makes your data more secure, +on-the-fly encryption/decryption by itself is never secure enough, +as in order to make it automatic +they need to keep a copy of the password or key. +Since it unlikely that that software provider is included in your IRB, this is not good enough. -It is possible in some enterprise versions of data sharing softwares, to set up on-the-fly encryption. +It is possible in some enterprise versions of data sharing software, to set up on-the-fly encryption. However, that set up is advanced and you should never trust it unless you are a cyber security expert, -or a cyber security expert within your organization have specified what it can be used for. In all -other cases you should follow the steps laid out in this section. +or a cyber security expert within your organization have specified what it can be used for. +In all other cases you should follow the steps laid out in this section. \subsection{Data security during data collection} In field surveys, most common data collection software will automatically encrypt all data in transit (i.e., upload from field or download from server).\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_in\_Transit}} -If this is implemented by the software you are using, then your -data will be encrypted from the time it leaves the device (in tablet-assisted data -collation) or browser (in web data collection), until it reaches the server. -Therefore, as long as you are using an established survey software, this step is -largely taken care of. However, the research team must ensure that all computers, -tablets, and accounts that are used in data collection have secure a logon +If this is implemented by the software you are using, +then your data will be encrypted from the time it leaves the device +(in tablet-assisted data collection) or browser (in web data collection), +until it reaches the server. +Therefore, as long as you are using an established survey software, +this step is largely taken care of. +However, the research team must ensure that all computers, tablets, +and accounts that are used in data collection have secure a logon password and are never left unlocked. Even though your data is therefore usually safe while it is being transmitted, @@ -568,112 +577,134 @@ \subsection{Data security during data collection} \textbf{Encryption at rest}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_at\_Rest}} is the only way to ensure that PII data remains private when it is stored on a -server on the internet. You must keep your data encrypted on the data collection server -whenever PII data is collected. If you do not, the raw data will be accessible by -individuals who are not approved by your IRB, such as tech support personnel, server -administrators and other third-party staff. Encryption at rest must be used to make +server on the internet. +You must keep your data encrypted on the data collection server whenever PII data is collected. +If you do not, the raw data will be accessible by +individuals who are not approved by your IRB, +such as tech support personnel, +server administrators and other third-party staff. +Encryption at rest must be used to make data files completely unusable without access to a security key specific to that -data -- a higher level of security than password-protection. Encryption at rest -requires active participation from the user, and you should be fully aware that -if your decryption key is lost, there is absolutely no way to recover your data. +data -- a higher level of security than password-protection. +Encryption at rest requires active participation from the user, +and you should be fully aware that if your decryption key is lost, +there is absolutely no way to recover your data. You should not assume that your data is encrypted at rest by default because of -the careful protocols necessary. In most data collection platforms, encryption at -rest needs to be explicitly enabled and operated by the user. There is no automatic -way to implement this protocol, because the encryption key that is generated may -never pass through the hands of a third party, including the data storage application. +the careful protocols necessary. +In most data collection platforms, +encryption at rest needs to be explicitly enabled and operated by the user. +There is no automatic way to implement this protocol, +because the encryption key that is generated may +never pass through the hands of a third party, +including the data storage application. Most survey software implement \textbf{asymmetric encryption}\sidenote{\url{ -https://dimewiki.worldbank.org/wiki/Encryption\#Asymmetric\_Encryption}} where -there are two keys in a public/private key pair. Only the private key can be used to -decrypt the encrypted data, and the public key can only be used to encrypt the data. -It is therefore safe to send the public key to the tablet or the browser used to -collect the data. +https://dimewiki.worldbank.org/wiki/Encryption\#Asymmetric\_Encryption}} +where there are two keys in a public/private key pair. +Only the private key can be used to decrypt the encrypted data, +and the public key can only be used to encrypt the data. +It is therefore safe to send the public key to the tablet or the browser used to collect the data. When you enable encryption, the survey software will allow you to create and -download -- once -- the public/private keyfile pair needed to decrypt the data. You -upload the public key when you start a new survey, and all data collected using that -public key can only be accessed with the private key from that specific public/private key -pair. You must store the key pair in a secure location, such as a password manager, as -there is no way to access your data if the private key is lost. Make sure you store -keyfiles with descriptive names to match the survey to which they correspond. Any time -anyone accesses the data -- either when viewing it in the browser or downloading it to -your computer -- they will be asked to provide the keyfile. Only project team members -named in the IRB are allowed access to the private keyfile. +download -- once -- the public/private keyfile pair needed to decrypt the data. +You upload the public key when you start a new survey, and all data collected using that +public key can only be accessed with the private key from that specific public/private key pair. +You must store the key pair in a secure location, such as a password manager, +as there is no way to access your data if the private key is lost. +Make sure you store keyfiles with descriptive names to match the survey to which they correspond. +Any time anyone accesses the data -- +either when viewing it in the browser or downloading it to your computer -- +they will be asked to provide the keyfile. +Only project team members named in the IRB are allowed access to the private keyfile. \subsection{Data security after data collection} For most analytical needs, you typically need a to store the data somewhere else -than the survey software's server, for example your computer or a cloud drive. While -asymmetric encryption is optimal for one-way transfer from the data collection device -to the data collection server, it is not practical once you start interacting with the data. +than the survey software's server, for example your computer or a cloud drive. +While asymmetric encryption is optimal for one-way transfer from the data collection device +to the data collection server, +it is not practical once you start interacting with the data. Instead we want to use \textbf{symmetric encryption}\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Encryption\#Symmetric\_Encryption}} where we -create a secure encrypted folder, using for example VeraCrypt\sidenote{\url{https://www.veracrypt.fr/}}, -where a single key is used to both encrypt and decrypt the information. Since only one -key is used, the work flow can be simplified, the re-encryption after decrypting can -be done automatically and the same secure folder can be used for multiple files, and -these files can be interacted with and modified like any unencryted file as long as you -have the key. The following workflow allows you to receive data and store it securely, +\url{https://dimewiki.worldbank.org/wiki/Encryption\#Symmetric\_Encryption}} +where we create a secure encrypted folder, +using for example VeraCrypt\sidenote{\url{https://www.veracrypt.fr/}}, +where a single key is used to both encrypt and decrypt the information. +Since only one key is used, the work flow can be simplified, +the re-encryption after decrypting can be done automatically and the same secure folder can be used for multiple files, +and these files can be interacted with and modified like any unencryted file as long as you have the key. +The following workflow allows you to receive data and store it securely, without compromising data security: \begin{enumerate} - \item Create a secure encrypted folder in your project folder, this should be on - your computer and could be in a shared folder. - \item Download data from the data collection server to that secure folder -- if - you encrypted the data during data collection you will need \textit{both} the - private key used during data collection to be able to download the data, \textit{and} - you will need the key used when created the secure folder to save it there. This - your first copy of your raw data, and the copy you will used in your cleaning and analysis. - \item Then create a secure folder on a pen-drive or a external hard drive, that you - can keep in your office. Copy the data you just downloaded to this second secure - folder. This is your ''master'' copy of your raw data. (Instead of creating a second - secure folder, you can simply copy the first secure folder) - \item Finally, create a third secure folder. Either you can create this on your - computer and upload it to a long-term cloud storage service, or you can create it on - an external hard drive that you then store in a separate location, for example at - another office of your organization. This is your ''golden master'' copy of your raw - data. You should never store the ''golden master'' copy of your raw data in a synced + \item Create a secure encrypted folder in your project folder, + this should be on your computer and could be in a shared folder. + \item Download data from the data collection server to that secure folder -- + if you encrypted the data during data collection you will need \textit{both} the + private key used during data collection to be able to download the data, + \textit{and} you will need the key used when created the secure folder to save it there. + This your first copy of your raw data, and the copy you will used in your cleaning and analysis. + \item Then create a secure folder on a pen-drive or a external hard drive, + that you can keep in your office. + Copy the data you just downloaded to this second secure folder. + This is your ''master'' copy of your raw data. + (Instead of creating a second secure folder, you can simply copy the first secure folder) + \item Finally, create a third secure folder. + Either you can create this on your computer and upload it to a long-term cloud storage service, + or you can create it on an external hard drive that you then store in a separate location, + for example at another office of your organization. + This is your ''golden master'' copy of your raw data. + You should never store the ''golden master'' copy of your raw data in a synced folder where it is also deleted in the cloud storage if it is deleted on your computer. (Instead of creating a third secure folder, you can simply copy the first secure folder). \end{enumerate} -\noindent This handling satisfies the \textbf{3-2-1 rule}: there are -two on-site copies of the data and one off-site copy, so the data can never -be lost in case of hardware failure.\sidenote{\url{ -https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} However, you still -need to keep track of your encryption keys as without them your data is lost. -If you remain lucky, you will never have to access your ``master'' or ``golden master'' -copies -- you just want to know it is out there, safe, if you need it. +\noindent This handling satisfies the \textbf{3-2-1 rule}: +there are two on-site copies of the data and one off-site copy, +so the data can never be lost in case of hardware failure.\sidenote{\url{ +https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} +However, you still need to keep track of your encryption keys as without them your data is lost. +If you remain lucky, you will never have to access your ``master'' or ``golden master'' copies -- +you just want to know it is out there, safe, if you need it. \subsection{Secure data sharing} You and your team will use your first copy of the raw data as the starting point for data -cleaning analysis of the data. This raw data set must remain encrypted at all times if it -includes PII data, which is almost always the case. As long as the data is properly encrypted, -using for example VeraCrypt, it can be shared using insecure modes of communication such as -email or third-party syncing services. While this is safe from a data security perspective, +cleaning analysis of the data. +This raw data set must remain encrypted at all times if it includes PII data, +which is almost always the case. +As long as the data is properly encrypted, using for example VeraCrypt, +it can be shared using insecure modes of communication such as email or third-party syncing services. +While this is safe from a data security perspective, this is a burdensome workflow as anyone accessing the raw data must be listed on the IRB, -have access to the decryption key and know how to use that key. Fortunately there is a way -to simplify the workflow without compromising data security. - -To simplify the workflow, the PII variables should be removed from your data at the earliest -possible opportunity creating a de-identified copy of the data. Once the data is de-identified, -it no longer needs to be encrypted -- therefore you and you team members can share it directly -without having to encrypt it and handle decryption keys. Next chapter will discuss how to -de-identify your data. If PII variables are directly required for the analysis itself, it will -be necessary to keep at least a subset of the data encrypted through the data analysis process. - -The data security standards that apply when receiving PII data obviously also apply when sending -PII data. A common example where this is often forgotten is when sending survey information, -such as sampling lists, to the field partner. This data is by all definitions also PII data and -must be encrypted. A sampling list can often be used to reverse identify a de-identified data set, -so if you were to share it using an insecure method, then that would be your weakest link that -could break all the other steps you have taken to ensure the privacy of the respondents. - -In some survey softwares you can use the same encryption that allows you to receive data securely -from the field, to also send data, such a sampling list, to the field. But if you are not sure how -that is done, or even can be done, in the survey software you are using, then you should create a -secure folder using, for example, VeraCrypt and share that secure folder with the field team. +have access to the decryption key and know how to use that key. +Fortunately there is a way to simplify the workflow without compromising data security. + +To simplify the workflow, +the PII variables should be removed from your data at the earliest +possible opportunity creating a de-identified copy of the data. +Once the data is de-identified, +it no longer needs to be encrypted -- +therefore you and you team members can share it directly +without having to encrypt it and handle decryption keys. +ext chapter will discuss how to de-identify your data. +If PII variables are directly required for the analysis itself, +it will be necessary to keep at least a subset of the data encrypted through the data analysis process. + +The data security standards that apply when receiving PII data obviously also apply when sending PII data. +A common example where this is often forgotten is when sending survey information, +such as sampling lists, to the field partner. +This data is by all definitions also PII data and must be encrypted. +A sampling list can often be used to reverse identify a de-identified data set, +so if you were to share it using an insecure method, +then that would be your weakest link that could break all the other steps +you have taken to ensure the privacy of the respondents. + +In some survey software you can use the same encryption that allows you to receive data securely +from the field, to also send data, such a sampling list, to the field. +But if you are not sure how that is done, or even can be done, +in the survey software you are using, +then you should create a secure folder using, for example, +VeraCrypt and share that secure folder with the field team. Remember that you must always share passwords and keys in a secure way like password managers. From c531bb9a5eebf1d498d761a8ff3f2f669622a640 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 10 Feb 2020 16:37:33 -0500 Subject: [PATCH 602/854] Update chapters/introduction.tex Co-Authored-By: Luiza Andrade --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 81e5572d6..fe3abd10c 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -5,7 +5,7 @@ An empirical revolution has changed the face of research economics rapidly over the last decade. %had to remove cite {\cite{angrist2017economic}} because of full page width Economics graduate students of the 2000s expected to work with primarily ``clean'' data from secondhand sources. -Today, especially in the development subfield, working with raw data- +Today, especially in the development subfield, working with raw data -- whether collected through surveys or acquired from ``big'' data sources like sensors, satellites, or call data records- is a key skill for researchers and their staff. At the same time, the scope and scale of empirical research projects is expanding: From c678a3ed8976ba981f0110a53220fe79dbf14e19 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 10 Feb 2020 16:37:52 -0500 Subject: [PATCH 603/854] Update chapters/introduction.tex Co-Authored-By: Luiza Andrade --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index fe3abd10c..e5b758d9c 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -6,7 +6,7 @@ %had to remove cite {\cite{angrist2017economic}} because of full page width Economics graduate students of the 2000s expected to work with primarily ``clean'' data from secondhand sources. Today, especially in the development subfield, working with raw data -- -whether collected through surveys or acquired from ``big'' data sources like sensors, satellites, or call data records- +whether collected through surveys or acquired from ``big'' data sources like sensors, satellites, or call data records -- is a key skill for researchers and their staff. At the same time, the scope and scale of empirical research projects is expanding: more people are working on the same data over longer timeframes. From fe08432bd4418dafc744c495f37422f19ee09f52 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 10 Feb 2020 16:38:16 -0500 Subject: [PATCH 604/854] Update chapters/introduction.tex Co-Authored-By: Luiza Andrade --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index e5b758d9c..170234768 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -101,7 +101,7 @@ \section{Outline of this book} \section{Adopting reproducible workflows} We will provide free, open-source, and platform-agnostic tools wherever possible, and point to more detailed instructions when relevant. -Stata is the notable exception here due to its current popularity in economics. +Stata is the notable exception here due to its current popularity in development economics. Most tools have a learning and adaptation process, meaning you will become most comfortable with each tool only by using it in real-world work. From 7a2ea9dd02c777dfcb91836c56205e232c6400e4 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 10 Feb 2020 16:38:27 -0500 Subject: [PATCH 605/854] Update chapters/introduction.tex Co-Authored-By: Luiza Andrade --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 170234768..fcc48289a 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -180,7 +180,7 @@ \section{Writing reproducible code in a collaborative environment} In particular, we point to two suites of Stata commands developed by DIME Analytics, \texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietoolkit}} and \texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}}, -which standardize our core data collection workflows. +which standardize our core data collection, management, and analysis workflows. We will not explain Stata commands unless the command is rarely used or the feature we are using is outside common use case of that command. We will comment the code generously (as you should), From f48b1e2301e8e16edbca6e7e7b23fcfa7bf1a3cc Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 16:39:25 -0500 Subject: [PATCH 606/854] [ch5] moved de-identification to other chapters --- chapters/data-analysis.tex | 20 ++++++++++++++ chapters/data-collection.tex | 53 +++++++----------------------------- chapters/handling-data.tex | 45 +++++++++--------------------- chapters/publication.tex | 42 ++++++++++++++++++++++++++++ 4 files changed, 85 insertions(+), 75 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index d45c579f2..e5a548f35 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -176,6 +176,26 @@ \subsection{De-identification} However, if sensitive information is strictly needed for analysis, the data must be encrypted while performing the tasks described in this chapter. + + +The \textbf{initial de-identification} should happen directly after the encrypted data is downloaded to disk. +At this time, for each variable that contains PII, ask: will this variable be needed for analysis? +If not, the variable should be dropped. +Examples include respondent names, enumerator names, interview dates, and respondent phone numbers. +If the variable is needed for analysis, ask: +can I encode or otherwise construct a variable to use for the analysis that masks the PII, +and drop the original variable? +Examples include geocoordinates (after constructing measures of distance or area, drop the specific location), and names for social network analysis (can be encoded to unique numeric IDs). + + +Flagging all potentially identifying variables in the questionnaire design stage, +as recommended above, simplifies the initial de-identification. +You already have the list of variables to assess, +and ideally have already assessed those against the analysis plan. +If so, all you need to do is write a script to drop the variables that are not required for analysis, +encode or otherwise mask those that are required, and save a working version of the data. + + \subsection{Correction of data entry errors} There are two main cases when the raw data will be modified during data cleaning. diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 7d74f330c..9c043febc 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -708,47 +708,7 @@ \subsection{Secure data sharing} Remember that you must always share passwords and keys in a secure way like password managers. -%------------------------------------------------ - -\section{de-identification - for luiza to move - either back to this chapter or to other chapters} - -We recommend de-identification in two stages: -an initial process to remove direct identifiers to create a working de-identified dataset, -and a final process to remove all possible identifiers to create a publishable dataset. -The \textbf{initial de-identification} should happen directly after the encrypted data is downloaded to disk. -At this time, for each variable that contains PII, ask: will this variable be needed for analysis? -If not, the variable should be dropped. -Examples include respondent names, enumerator names, interview dates, and respondent phone numbers. -If the variable is needed for analysis, ask: -can I encode or otherwise construct a variable to use for the analysis that masks the PII, -and drop the original variable? -Examples include geocoordinates (after constructing measures of distance or area, drop the specific location), and names for social network analysis (can be encoded to unique numeric IDs). - - -Flagging all potentially identifying variables in the questionnaire design stage, -as recommended above, simplifies the initial de-identification. -You already have the list of variables to assess, -and ideally have already assessed those against the analysis plan. -If so, all you need to do is write a script to drop the variables that are not required for analysis, -encode or otherwise mask those that are required, and save a working version of the data. - -The \textbf{final de-identification} is a more involved process, -with the objective of creating a dataset for publication -that cannot be manipulated or linked to identify any individual research participant. -You must remove all direct and indirect identifiers, and assess the risk of statistical disclosure.\sidenote{ - \textbf{Disclosure risk:} the likelihood that a released data record can be associated with an individual or organization.} -\index{statistical disclosure} -There will almost always be a trade-off between accuracy and privacy. -For publicly disclosed data, you should favor privacy. -There are a number of useful tools for de-identification: PII scanners for Stata\sidenote{ - \url{https://github.com/J-PAL/stata_PII_scan}} -or R\sidenote{ - \url{https://github.com/J-PAL/PII-Scan}}, -and tools for statistical disclosure control.\sidenote{ - \url{https://sdcpractice.readthedocs.io/en/latest/}} -In cases where PII data is required for analysis, -we recommend embargoing the sensitive variables when publishing the data. -Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. +\section{Finalizing data collection} When all data collection is complete, the survey team should prepare a final field report, which should report reasons for any deviations between the original sample and the dataset collected. @@ -762,5 +722,12 @@ \section{de-identification - for luiza to move - either back to this chapter or This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions and loss to follow-up occurred in the field and how they were implemented and resolved. -With the raw data securely stored and backed up, -and a de-identified dataset to work with, you are ready to move to data cleaning and analysis. \ No newline at end of file + +At this point, the raw data securely stored and backed up. +It can now be transformed into your final analysis data set, +through the steps described in the next chapter. +Once the data collection is over, +you typically will no longer need to interact with the identified data. +So you should create a working version of it that you can safely interact with. +This is described in the next chapter as the first task in the data cleaning process, +but it's useful to get it started as soon as encrypted data is downloaded to disk. \ No newline at end of file diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 337ba7e5f..af09c6cb1 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -431,42 +431,23 @@ \subsection{De-identifying and anonymizing information} such as treatment statuses and weights, then removing identifiers. Therefore, once data is securely collected and stored, -the first thing you will generally do is \textbf{de-identify} it.\sidenote{ +the first thing you will generally do is \textbf{de-identify} it, +that is, to remove direct identifiers of the individuals in the dataset.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/De-identification}} \index{de-identification} -(We will provide more detail on this in the chapter on data collection.) -This will create a working de-identified copy -that can safely be shared among collaborators. -De-identified data should avoid, for example, -you being sent back to every household -to alert them that someone dropped all their personal information -on a public bus and we don't know who has it. -This simply means creating a copy of the data -that contains no personally-identifiable information. -This data should be an exact copy of the raw data, -except it would be okay if it were for some reason publicly released.\cite{matthews2011data} - Note, however, that it is in practice impossible to \textbf{anonymize} data. There is always some statistical chance that an individual's identity will be re-linked to the data collected about them -- even if that data has had all directly identifying information removed -- by using some other data that becomes identifying when analyzed together. -There are a number of tools developed to help researchers de-identify data -and which you should use as appropriate at that stage of data collection. -These include \texttt{PII\_detection}\sidenote{\url{https://github.com/PovertyAction/PII\_detection}} from IPA, -\texttt{PII-scan}\sidenote{\url{https://github.com/J-PAL/PII-Scan}} from JPAL, -and \texttt{sdcMicro}\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} from the World Bank. -\index{anonymization} -The \texttt{sdcMicro} tool, in particular, has a feature -that allows you to assess the uniqueness of your data observations, -and simple measures of the identifiability of records from that. -Additional options to protect privacy in data that will become public exist, -and you should expect and intend to release your datasets at some point. -One option is to add noise to data, as the US Census Bureau has proposed,\cite{abowd2018us} -as it makes the trade-off between data accuracy and privacy explicit. -But there are no established norms for such ``differential privacy'' approaches: -most approaches fundamentally rely on judging ``how harmful'' information disclosure would be. -The fact remains that there is always a balance between information release (and therefore transparency) -and privacy protection, and that you should engage with it actively and explicitly. -The best thing you can do is make a complete record of the steps that have been taken -so that the process can be reviewed, revised, and updated as necessary. +For this reason, we recommend de-identification in two stages. +The \textbf{initial de-identification} process strips the data of direct identifiers +to create a working de-identified dataset that +can be \textit{within the research team} without the need for encryption. +The \textbf{final de-identification} process involves +making a decision about the trade-off between +risk of disclosure and utility of the data +before publicly releasing a dataset.\sidenote{ + \url{https://sdcpractice.readthedocs.io/en/latest/SDC\_intro.html\#need-for-sdc}} +We will provide more detail about the process and tools available +for initial and final de-identification in chapters 6 and 7, respectively. \ No newline at end of file diff --git a/chapters/publication.tex b/chapters/publication.tex index 9f8cf4c81..a33484c95 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -453,3 +453,45 @@ \subsection{Releasing a replication package} Do not use Dropbox or Google Drive for this purpose: many organizations do not allow access to these tools, and that includes blocking staff from accessing your material. + +%--------------------------------------- REFIT! +The \textbf{final de-identification} is a more involved process, +with the objective of creating a dataset for publication +that cannot be manipulated or linked to identify any individual research participant. +You must remove all direct and indirect identifiers, and assess the risk of statistical disclosure.\sidenote{ + \textbf{Disclosure risk:} the likelihood that a released data record can be associated with an individual or organization.} +\index{statistical disclosure} +There will almost always be a trade-off between accuracy and privacy. +For publicly disclosed data, you should favor privacy. +There are a number of useful tools for de-identification: PII scanners for Stata\sidenote{ + \url{https://github.com/J-PAL/stata_PII_scan}} +or R\sidenote{ + \url{https://github.com/J-PAL/PII-Scan}}, +and tools for statistical disclosure control.\sidenote{ + \url{https://sdcpractice.readthedocs.io/en/latest/}} +In cases where PII data is required for analysis, +we recommend embargoing the sensitive variables when publishing the data. +Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. + + + +There are a number of tools developed to help researchers de-identify data +and which you should use as appropriate at that stage of data collection. +These include \texttt{PII\_detection}\sidenote{\url{https://github.com/PovertyAction/PII\_detection}} from IPA, +\texttt{PII-scan}\sidenote{\url{https://github.com/J-PAL/PII-Scan}} from JPAL, +and \texttt{sdcMicro}\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} from the World Bank. +\index{anonymization} +The \texttt{sdcMicro} tool, in particular, has a feature +that allows you to assess the uniqueness of your data observations, +and simple measures of the identifiability of records from that. +Additional options to protect privacy in data that will become public exist, +and you should expect and intend to release your datasets at some point. +One option is to add noise to data, as the US Census Bureau has proposed,\cite{abowd2018us} +as it makes the trade-off between data accuracy and privacy explicit. +But there are no established norms for such ``differential privacy'' approaches: +most approaches fundamentally rely on judging ``how harmful'' information disclosure would be. +The fact remains that there is always a balance between information release (and therefore transparency) +and privacy protection, and that you should engage with it actively and explicitly. +The best thing you can do is make a complete record of the steps that have been taken +so that the process can be reviewed, revised, and updated as necessary. + From 4e6209cdde9625fbcd8bc4710a4d09f21dc173cf Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 10 Feb 2020 16:40:05 -0500 Subject: [PATCH 607/854] Update chapters/introduction.tex --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index fcc48289a..a69fc7e48 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -187,7 +187,7 @@ \section{Writing reproducible code in a collaborative environment} but you should reference Stata help-files \texttt{h [command]} whenever you do not understand the command that is being used. We hope that these snippets will provide a foundation for your code style. -Providing some standardization to Stata code style is also a goal of this team, +Providing some standardization to Stata code style is also a goal of this team; we provide our guidance on this in the DIME Analytics Stata Style Guide Appendix. From 68971b861f6e37dcd13af03909355740cf6bf1fb Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 16:40:25 -0500 Subject: [PATCH 608/854] Revert "[ch5] moved de-identification to other chapters" This reverts commit f48b1e2301e8e16edbca6e7e7b23fcfa7bf1a3cc. --- chapters/data-analysis.tex | 20 -------------- chapters/data-collection.tex | 53 +++++++++++++++++++++++++++++------- chapters/handling-data.tex | 45 +++++++++++++++++++++--------- chapters/publication.tex | 42 ---------------------------- 4 files changed, 75 insertions(+), 85 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index e5a548f35..d45c579f2 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -176,26 +176,6 @@ \subsection{De-identification} However, if sensitive information is strictly needed for analysis, the data must be encrypted while performing the tasks described in this chapter. - - -The \textbf{initial de-identification} should happen directly after the encrypted data is downloaded to disk. -At this time, for each variable that contains PII, ask: will this variable be needed for analysis? -If not, the variable should be dropped. -Examples include respondent names, enumerator names, interview dates, and respondent phone numbers. -If the variable is needed for analysis, ask: -can I encode or otherwise construct a variable to use for the analysis that masks the PII, -and drop the original variable? -Examples include geocoordinates (after constructing measures of distance or area, drop the specific location), and names for social network analysis (can be encoded to unique numeric IDs). - - -Flagging all potentially identifying variables in the questionnaire design stage, -as recommended above, simplifies the initial de-identification. -You already have the list of variables to assess, -and ideally have already assessed those against the analysis plan. -If so, all you need to do is write a script to drop the variables that are not required for analysis, -encode or otherwise mask those that are required, and save a working version of the data. - - \subsection{Correction of data entry errors} There are two main cases when the raw data will be modified during data cleaning. diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 9c043febc..7d74f330c 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -708,7 +708,47 @@ \subsection{Secure data sharing} Remember that you must always share passwords and keys in a secure way like password managers. -\section{Finalizing data collection} +%------------------------------------------------ + +\section{de-identification - for luiza to move - either back to this chapter or to other chapters} + +We recommend de-identification in two stages: +an initial process to remove direct identifiers to create a working de-identified dataset, +and a final process to remove all possible identifiers to create a publishable dataset. +The \textbf{initial de-identification} should happen directly after the encrypted data is downloaded to disk. +At this time, for each variable that contains PII, ask: will this variable be needed for analysis? +If not, the variable should be dropped. +Examples include respondent names, enumerator names, interview dates, and respondent phone numbers. +If the variable is needed for analysis, ask: +can I encode or otherwise construct a variable to use for the analysis that masks the PII, +and drop the original variable? +Examples include geocoordinates (after constructing measures of distance or area, drop the specific location), and names for social network analysis (can be encoded to unique numeric IDs). + + +Flagging all potentially identifying variables in the questionnaire design stage, +as recommended above, simplifies the initial de-identification. +You already have the list of variables to assess, +and ideally have already assessed those against the analysis plan. +If so, all you need to do is write a script to drop the variables that are not required for analysis, +encode or otherwise mask those that are required, and save a working version of the data. + +The \textbf{final de-identification} is a more involved process, +with the objective of creating a dataset for publication +that cannot be manipulated or linked to identify any individual research participant. +You must remove all direct and indirect identifiers, and assess the risk of statistical disclosure.\sidenote{ + \textbf{Disclosure risk:} the likelihood that a released data record can be associated with an individual or organization.} +\index{statistical disclosure} +There will almost always be a trade-off between accuracy and privacy. +For publicly disclosed data, you should favor privacy. +There are a number of useful tools for de-identification: PII scanners for Stata\sidenote{ + \url{https://github.com/J-PAL/stata_PII_scan}} +or R\sidenote{ + \url{https://github.com/J-PAL/PII-Scan}}, +and tools for statistical disclosure control.\sidenote{ + \url{https://sdcpractice.readthedocs.io/en/latest/}} +In cases where PII data is required for analysis, +we recommend embargoing the sensitive variables when publishing the data. +Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. When all data collection is complete, the survey team should prepare a final field report, which should report reasons for any deviations between the original sample and the dataset collected. @@ -722,12 +762,5 @@ \section{Finalizing data collection} This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions and loss to follow-up occurred in the field and how they were implemented and resolved. - -At this point, the raw data securely stored and backed up. -It can now be transformed into your final analysis data set, -through the steps described in the next chapter. -Once the data collection is over, -you typically will no longer need to interact with the identified data. -So you should create a working version of it that you can safely interact with. -This is described in the next chapter as the first task in the data cleaning process, -but it's useful to get it started as soon as encrypted data is downloaded to disk. \ No newline at end of file +With the raw data securely stored and backed up, +and a de-identified dataset to work with, you are ready to move to data cleaning and analysis. \ No newline at end of file diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index af09c6cb1..337ba7e5f 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -431,23 +431,42 @@ \subsection{De-identifying and anonymizing information} such as treatment statuses and weights, then removing identifiers. Therefore, once data is securely collected and stored, -the first thing you will generally do is \textbf{de-identify} it, -that is, to remove direct identifiers of the individuals in the dataset.\sidenote{ +the first thing you will generally do is \textbf{de-identify} it.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/De-identification}} \index{de-identification} +(We will provide more detail on this in the chapter on data collection.) +This will create a working de-identified copy +that can safely be shared among collaborators. +De-identified data should avoid, for example, +you being sent back to every household +to alert them that someone dropped all their personal information +on a public bus and we don't know who has it. +This simply means creating a copy of the data +that contains no personally-identifiable information. +This data should be an exact copy of the raw data, +except it would be okay if it were for some reason publicly released.\cite{matthews2011data} + Note, however, that it is in practice impossible to \textbf{anonymize} data. There is always some statistical chance that an individual's identity will be re-linked to the data collected about them -- even if that data has had all directly identifying information removed -- by using some other data that becomes identifying when analyzed together. -For this reason, we recommend de-identification in two stages. -The \textbf{initial de-identification} process strips the data of direct identifiers -to create a working de-identified dataset that -can be \textit{within the research team} without the need for encryption. -The \textbf{final de-identification} process involves -making a decision about the trade-off between -risk of disclosure and utility of the data -before publicly releasing a dataset.\sidenote{ - \url{https://sdcpractice.readthedocs.io/en/latest/SDC\_intro.html\#need-for-sdc}} -We will provide more detail about the process and tools available -for initial and final de-identification in chapters 6 and 7, respectively. \ No newline at end of file +There are a number of tools developed to help researchers de-identify data +and which you should use as appropriate at that stage of data collection. +These include \texttt{PII\_detection}\sidenote{\url{https://github.com/PovertyAction/PII\_detection}} from IPA, +\texttt{PII-scan}\sidenote{\url{https://github.com/J-PAL/PII-Scan}} from JPAL, +and \texttt{sdcMicro}\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} from the World Bank. +\index{anonymization} +The \texttt{sdcMicro} tool, in particular, has a feature +that allows you to assess the uniqueness of your data observations, +and simple measures of the identifiability of records from that. +Additional options to protect privacy in data that will become public exist, +and you should expect and intend to release your datasets at some point. +One option is to add noise to data, as the US Census Bureau has proposed,\cite{abowd2018us} +as it makes the trade-off between data accuracy and privacy explicit. +But there are no established norms for such ``differential privacy'' approaches: +most approaches fundamentally rely on judging ``how harmful'' information disclosure would be. +The fact remains that there is always a balance between information release (and therefore transparency) +and privacy protection, and that you should engage with it actively and explicitly. +The best thing you can do is make a complete record of the steps that have been taken +so that the process can be reviewed, revised, and updated as necessary. diff --git a/chapters/publication.tex b/chapters/publication.tex index a33484c95..9f8cf4c81 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -453,45 +453,3 @@ \subsection{Releasing a replication package} Do not use Dropbox or Google Drive for this purpose: many organizations do not allow access to these tools, and that includes blocking staff from accessing your material. - -%--------------------------------------- REFIT! -The \textbf{final de-identification} is a more involved process, -with the objective of creating a dataset for publication -that cannot be manipulated or linked to identify any individual research participant. -You must remove all direct and indirect identifiers, and assess the risk of statistical disclosure.\sidenote{ - \textbf{Disclosure risk:} the likelihood that a released data record can be associated with an individual or organization.} -\index{statistical disclosure} -There will almost always be a trade-off between accuracy and privacy. -For publicly disclosed data, you should favor privacy. -There are a number of useful tools for de-identification: PII scanners for Stata\sidenote{ - \url{https://github.com/J-PAL/stata_PII_scan}} -or R\sidenote{ - \url{https://github.com/J-PAL/PII-Scan}}, -and tools for statistical disclosure control.\sidenote{ - \url{https://sdcpractice.readthedocs.io/en/latest/}} -In cases where PII data is required for analysis, -we recommend embargoing the sensitive variables when publishing the data. -Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. - - - -There are a number of tools developed to help researchers de-identify data -and which you should use as appropriate at that stage of data collection. -These include \texttt{PII\_detection}\sidenote{\url{https://github.com/PovertyAction/PII\_detection}} from IPA, -\texttt{PII-scan}\sidenote{\url{https://github.com/J-PAL/PII-Scan}} from JPAL, -and \texttt{sdcMicro}\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} from the World Bank. -\index{anonymization} -The \texttt{sdcMicro} tool, in particular, has a feature -that allows you to assess the uniqueness of your data observations, -and simple measures of the identifiability of records from that. -Additional options to protect privacy in data that will become public exist, -and you should expect and intend to release your datasets at some point. -One option is to add noise to data, as the US Census Bureau has proposed,\cite{abowd2018us} -as it makes the trade-off between data accuracy and privacy explicit. -But there are no established norms for such ``differential privacy'' approaches: -most approaches fundamentally rely on judging ``how harmful'' information disclosure would be. -The fact remains that there is always a balance between information release (and therefore transparency) -and privacy protection, and that you should engage with it actively and explicitly. -The best thing you can do is make a complete record of the steps that have been taken -so that the process can be reviewed, revised, and updated as necessary. - From 0d9a6a8ca796a25e2bd81a3d58c50517f4d20295 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 10 Feb 2020 16:41:24 -0500 Subject: [PATCH 609/854] Update chapters/introduction.tex --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index a69fc7e48..225427916 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -73,7 +73,7 @@ \section{Doing credible research at scale} \section{Outline of this book} -The book progresses through the typical workflow of an empirical research project. +The book covers each stage of an empirical research project, from design to publication. We start with ethical principles to guide empirical research, focusing on research transparency and the right to privacy. The second chapter discusses the importance of planning data work at the outset of the research project - From 2dafda8c8ac76a27ea080f2b19ff4c73e2290ae1 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 16:45:14 -0500 Subject: [PATCH 610/854] [ch5] removed de-identification --- chapters/data-collection.tex | 53 +++++++----------------------------- chapters/handling-data.tex | 45 +++++++++--------------------- 2 files changed, 23 insertions(+), 75 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 7d74f330c..9c043febc 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -708,47 +708,7 @@ \subsection{Secure data sharing} Remember that you must always share passwords and keys in a secure way like password managers. -%------------------------------------------------ - -\section{de-identification - for luiza to move - either back to this chapter or to other chapters} - -We recommend de-identification in two stages: -an initial process to remove direct identifiers to create a working de-identified dataset, -and a final process to remove all possible identifiers to create a publishable dataset. -The \textbf{initial de-identification} should happen directly after the encrypted data is downloaded to disk. -At this time, for each variable that contains PII, ask: will this variable be needed for analysis? -If not, the variable should be dropped. -Examples include respondent names, enumerator names, interview dates, and respondent phone numbers. -If the variable is needed for analysis, ask: -can I encode or otherwise construct a variable to use for the analysis that masks the PII, -and drop the original variable? -Examples include geocoordinates (after constructing measures of distance or area, drop the specific location), and names for social network analysis (can be encoded to unique numeric IDs). - - -Flagging all potentially identifying variables in the questionnaire design stage, -as recommended above, simplifies the initial de-identification. -You already have the list of variables to assess, -and ideally have already assessed those against the analysis plan. -If so, all you need to do is write a script to drop the variables that are not required for analysis, -encode or otherwise mask those that are required, and save a working version of the data. - -The \textbf{final de-identification} is a more involved process, -with the objective of creating a dataset for publication -that cannot be manipulated or linked to identify any individual research participant. -You must remove all direct and indirect identifiers, and assess the risk of statistical disclosure.\sidenote{ - \textbf{Disclosure risk:} the likelihood that a released data record can be associated with an individual or organization.} -\index{statistical disclosure} -There will almost always be a trade-off between accuracy and privacy. -For publicly disclosed data, you should favor privacy. -There are a number of useful tools for de-identification: PII scanners for Stata\sidenote{ - \url{https://github.com/J-PAL/stata_PII_scan}} -or R\sidenote{ - \url{https://github.com/J-PAL/PII-Scan}}, -and tools for statistical disclosure control.\sidenote{ - \url{https://sdcpractice.readthedocs.io/en/latest/}} -In cases where PII data is required for analysis, -we recommend embargoing the sensitive variables when publishing the data. -Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. +\section{Finalizing data collection} When all data collection is complete, the survey team should prepare a final field report, which should report reasons for any deviations between the original sample and the dataset collected. @@ -762,5 +722,12 @@ \section{de-identification - for luiza to move - either back to this chapter or This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions and loss to follow-up occurred in the field and how they were implemented and resolved. -With the raw data securely stored and backed up, -and a de-identified dataset to work with, you are ready to move to data cleaning and analysis. \ No newline at end of file + +At this point, the raw data securely stored and backed up. +It can now be transformed into your final analysis data set, +through the steps described in the next chapter. +Once the data collection is over, +you typically will no longer need to interact with the identified data. +So you should create a working version of it that you can safely interact with. +This is described in the next chapter as the first task in the data cleaning process, +but it's useful to get it started as soon as encrypted data is downloaded to disk. \ No newline at end of file diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 337ba7e5f..a8fff50b0 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -431,42 +431,23 @@ \subsection{De-identifying and anonymizing information} such as treatment statuses and weights, then removing identifiers. Therefore, once data is securely collected and stored, -the first thing you will generally do is \textbf{de-identify} it.\sidenote{ +the first thing you will generally do is \textbf{de-identify} it, +that is, to remove direct identifiers of the individuals in the dataset.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/De-identification}} \index{de-identification} -(We will provide more detail on this in the chapter on data collection.) -This will create a working de-identified copy -that can safely be shared among collaborators. -De-identified data should avoid, for example, -you being sent back to every household -to alert them that someone dropped all their personal information -on a public bus and we don't know who has it. -This simply means creating a copy of the data -that contains no personally-identifiable information. -This data should be an exact copy of the raw data, -except it would be okay if it were for some reason publicly released.\cite{matthews2011data} - Note, however, that it is in practice impossible to \textbf{anonymize} data. There is always some statistical chance that an individual's identity will be re-linked to the data collected about them -- even if that data has had all directly identifying information removed -- by using some other data that becomes identifying when analyzed together. -There are a number of tools developed to help researchers de-identify data -and which you should use as appropriate at that stage of data collection. -These include \texttt{PII\_detection}\sidenote{\url{https://github.com/PovertyAction/PII\_detection}} from IPA, -\texttt{PII-scan}\sidenote{\url{https://github.com/J-PAL/PII-Scan}} from JPAL, -and \texttt{sdcMicro}\sidenote{\url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} from the World Bank. -\index{anonymization} -The \texttt{sdcMicro} tool, in particular, has a feature -that allows you to assess the uniqueness of your data observations, -and simple measures of the identifiability of records from that. -Additional options to protect privacy in data that will become public exist, -and you should expect and intend to release your datasets at some point. -One option is to add noise to data, as the US Census Bureau has proposed,\cite{abowd2018us} -as it makes the trade-off between data accuracy and privacy explicit. -But there are no established norms for such ``differential privacy'' approaches: -most approaches fundamentally rely on judging ``how harmful'' information disclosure would be. -The fact remains that there is always a balance between information release (and therefore transparency) -and privacy protection, and that you should engage with it actively and explicitly. -The best thing you can do is make a complete record of the steps that have been taken -so that the process can be reviewed, revised, and updated as necessary. +For this reason, we recommend de-identification in two stages. +The \textbf{initial de-identification} process strips the data of direct identifiers +to create a working de-identified dataset that +can be \textit{within the research team} without the need for encryption. +The \textbf{final de-identification} process involves +making a decision about the trade-off between +risk of disclosure and utility of the data +before publicly releasing a dataset.\sidenote{ + \url{https://sdcpractice.readthedocs.io/en/latest/SDC\_intro.html\#need-for-sdc}} +We will provide more detail about the process and tools available +for initial and final de-identification in chapters 6 and 7, respectively. \ No newline at end of file From 93d734ccccd62329a765f4c743473920fb7669f1 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 10 Feb 2020 16:46:20 -0500 Subject: [PATCH 611/854] Update chapters/introduction.tex --- chapters/introduction.tex | 1 - 1 file changed, 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 225427916..46eacfa20 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -4,7 +4,6 @@ how to handle data effectively, efficiently, and ethically. An empirical revolution has changed the face of research economics rapidly over the last decade. %had to remove cite {\cite{angrist2017economic}} because of full page width -Economics graduate students of the 2000s expected to work with primarily ``clean'' data from secondhand sources. Today, especially in the development subfield, working with raw data -- whether collected through surveys or acquired from ``big'' data sources like sensors, satellites, or call data records -- is a key skill for researchers and their staff. From d72ef2b151f8e40218d09ffb149c9035e2c793f9 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 17:37:20 -0500 Subject: [PATCH 612/854] [ch1] removed citation I don't see how this link relates to the text --- chapters/handling-data.tex | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 37ffc7f6c..b89f3d232 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -77,8 +77,7 @@ \section{Protecting confidence in development research} \subsection{Research reproducibility} Reproducible research means that the actual analytical processes you used are executable by others.\cite{dafoe2014science} -(We use ``reproducibility'' to refer to the precise analytical code in a specific study.\sidenote{ - \url{http://datacolada.org/76}}) +(We use ``reproducibility'' to refer to the precise analytical code in a specific study. All your code files involving data cleaning, construction and analysis should be public, unless they contain identifying information. Nobody should have to guess what exactly comprises a given index, From 3c7fd8402920bfe717d60f5d33099a4d199edf43 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 17:37:42 -0500 Subject: [PATCH 613/854] [ch1] reference to the empirical revolution --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index b89f3d232..e2165aae4 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -57,7 +57,7 @@ \section{Protecting confidence in development research} are accurately reported and preserved as outputs in themselves.\sidenote{ \url{https://www.aeaweb.org/journals/policies/data-code/}} -The empirical revolution in development research +The empirical revolution in development research\cite{angrist2017economic} has therefore led to increased public scrutiny of the reliability of research.\cite{rogers_2017}\index{transparency}\index{credibility}\index{reproducibility} Three major components make up this scrutiny: \textbf{reproducibility}\cite{duvendack2017meant}, \textbf{transparency},\cite{christensen2018transparency} and \textbf{credibility}.\cite{ioannidis2017power} Development researchers should take these concerns seriously. From b0b2b7b3a45aff438abb54553c6e4d34f25efd74 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 17:43:38 -0500 Subject: [PATCH 614/854] [ch1] clarified confusing sentence --- chapters/handling-data.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index e2165aae4..07c2be53d 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -77,7 +77,9 @@ \section{Protecting confidence in development research} \subsection{Research reproducibility} Reproducible research means that the actual analytical processes you used are executable by others.\cite{dafoe2014science} -(We use ``reproducibility'' to refer to the precise analytical code in a specific study. +(We use ``reproducible'' and ``replicable'' interchangeably in this book, +though there is much discussion about the use and definition of these concepts.\sidenote{ +\url{https://www.nap.edu/resource/25303/R&R.pdf}}) All your code files involving data cleaning, construction and analysis should be public, unless they contain identifying information. Nobody should have to guess what exactly comprises a given index, From 2193cf11f75ed194671a820f71fe96fe45608404 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 17:47:42 -0500 Subject: [PATCH 615/854] [ch1] moving citations to first appearance of term --- chapters/handling-data.tex | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 07c2be53d..1fb734e8d 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -133,8 +133,11 @@ \subsection{Research transparency} and, as we hope to convince you, make the process easier for themselves, because it requires methodical organization that is labor-saving and efficient over the complete course of a project. -Tools like pre-registration, pre-analysis plans, and -\textbf{Registered Reports}\sidenote{ +Tools like \textbf{pre-registration}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Pre-Registration}}, +\textbf{pre-analysis plans}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}}, +and \textbf{registered reports}\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} can help with this process where they are available.\index{pre-registration}\index{pre-analysis plans}\index{Registered Reports} By pre-specifying a large portion of the research design,\sidenote{ @@ -192,15 +195,13 @@ \subsection{Research credibility} Is the research design sufficiently powered through its sampling and randomization? Were the key research outcomes pre-specified or chosen ex-post? How sensitive are the results to changes in specifications or definitions? -Tools such as \textbf{pre-analysis plans}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} +Tools such as \textbf{pre-analysis plans} can be used to assuage these concerns for experimental evaluations \index{pre-analysis plan} by fully specifying some set of analysis intended to be conducted, but they may feel like ``golden handcuffs'' for other types of research.\cite{olken2015promises} Regardless of whether or not a formal pre-analysis plan is utilized, -all experimental and observational studies should be \textbf{pre-registered}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Pre-Registration}} +all experimental and observational studies should be pre-registered simply to create a record of the fact that the study was undertaken.\sidenote{\url{http://datacolada.org/12}} This is increasingly required by publishers and can be done very quickly using the \textbf{AEA} database\sidenote{\url{https://www.socialscienceregistry.org/}}, From 07ace5618d61f73ba9f2468b1644523a05b3a262 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 17:50:06 -0500 Subject: [PATCH 616/854] [c1] sentence clarification --- chapters/handling-data.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 1fb734e8d..38ff6c835 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -155,8 +155,9 @@ \subsection{Research transparency} Documenting a project in detail greatly increases transparency. Many disciplines have a tradition of keeping a ``lab notebook'', -and adapting and expanding this process for the development -of lab-style working groups in development is a critical step. +and adapting and expanding this process to create a +lab-style workflow in the development field is a +critical step towards more transparent practices. This means explicitly noting decisions as they are made, and explaining the process behind the decision-making. Documentation on data processing and additional hypotheses tested From 188b8d1a8415b3f271d669655278fc809ef0c512 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 17:53:40 -0500 Subject: [PATCH 617/854] [c1] repetitive content on PAP --- chapters/handling-data.tex | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 38ff6c835..026195ff0 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -147,7 +147,7 @@ \subsection{Research transparency} This is meant to combat the ``file-drawer problem'',\cite{simonsohn2014p} and ensure that researchers are transparent in the additional sense that all the results obtained from registered studies are actually published. -In no way should this be viewed as binding the hands of the researcher: +In no way should this be viewed as binding the hands of the researcher:\cite{olken2015promises} anything outside the original plan is just as interesting and valuable as it would have been if the the plan was never published; but having pre-committed to any particular inquiry makes its results @@ -181,7 +181,7 @@ \subsection{Research transparency} \textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}},\index{task management}\index{GitHub} in addition to version histories and wiki pages. -Such services offers multiple different ways +Such services offer multiple different ways to record the decision process leading to changes and additions, track and register discussions, and manage tasks. These are flexible tools that can be adapted to different team and project dynamics. @@ -196,11 +196,9 @@ \subsection{Research credibility} Is the research design sufficiently powered through its sampling and randomization? Were the key research outcomes pre-specified or chosen ex-post? How sensitive are the results to changes in specifications or definitions? -Tools such as \textbf{pre-analysis plans} -can be used to assuage these concerns for experimental evaluations +Pre-analysis plans can be used to assuage these concerns for experimental evaluations \index{pre-analysis plan} -by fully specifying some set of analysis intended to be conducted, -but they may feel like ``golden handcuffs'' for other types of research.\cite{olken2015promises} +by fully specifying some set of analysis intended to be conducted. Regardless of whether or not a formal pre-analysis plan is utilized, all experimental and observational studies should be pre-registered simply to create a record of the fact that the study was undertaken.\sidenote{\url{http://datacolada.org/12}} From 002bb79701bc1bcc1adb2a04a29be00506ef90c1 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 18:42:06 -0500 Subject: [PATCH 618/854] [ch1] changing language as in #370 --- chapters/handling-data.tex | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 026195ff0..6867e2072 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -350,16 +350,14 @@ \subsection{Transmitting and storing data securely} inside that secure environment if multiple users share accounts. However, password-protection alone is not sufficient, because if the underlying data is obtained through a leak the information itself remains usable. -Raw data which contains PII \textit{must} therefore be \textbf{encrypted}\sidenote{ +Data sets that confidential information +\textit{must} therefore be \textbf{encrypted}\sidenote{ \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked, or any other type of unauthorized access is obtained. \url{https://dimewiki.worldbank.org/wiki/encryption}} during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} The biggest security gap is often in transmitting survey plans to and from staff in the field, since staff with technical specialization are usually in an HQ office. To protect information in transit to field staff, some key steps are: -(a) to ensure that all devices that store PII data have hard drive encryption and password-protection; -(b) that no PII information is sent over e-mail, WhatsApp, etc. without encrypting the information first; -and (c) all field staff receive adequate training on the privacy standards applicable to their work. Most modern data collection software has features that, if enabled, make secure transmission straightforward.\sidenote{ @@ -371,6 +369,11 @@ \subsection{Transmitting and storing data securely} the files that would be obtained would be useless to the recipient. In security language this person is often referred to as an ``intruder'' but it is rare that data breaches are malicious or even intentional. +(a) ensure that all devices that store confidential data +have hard drive encryption and password-protection; +(b) never send confidential data over e-mail, WhatsApp, etc. +without encrypting the information first; and +(c) train all field staff on the adequate privacy standards applicable to their work. The easiest way to protect personal information is not to use it. It is often very simple to conduct planning and analytical work From 445f46042a895b1b98d5a0fee13a375e8d914202 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 18:42:35 -0500 Subject: [PATCH 619/854] [ch1] minor revisions --- chapters/handling-data.tex | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 6867e2072..c68a7477e 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -98,7 +98,7 @@ \subsection{Research reproducibility} \index{GitHub} Such services can show things like modifications made in response to referee comments, by having tagged version histories at each major revision. -These services can also use issue trackers and abandoned work branches +They also allow you to use issue trackers and abandoned work branches to document the research paths and questions you may have tried to answer as a resource to others who have similar questions. @@ -121,9 +121,9 @@ \subsection{Research reproducibility} \subsection{Research transparency} Transparent research will expose not only the code, -but all the other research processes involved in developing the analytical approach.\sidenote{ +but all research processes involved in developing the analytical approach.\sidenote{ \url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} -This means that readers be able to judge for themselves if the research was done well +This means that readers are able to judge for themselves if the research was done well and the decision-making process was sound. If the research is well-structured, and all of the relevant documentation\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} @@ -209,8 +209,7 @@ \subsection{Research credibility} or the \textbf{OSF} registry\sidenote{\url{https://osf.io/registries}} as appropriate. \index{pre-registration} -Common research standards from journals, funders, and others feature both ex -ante +Common research standards from journals and funders feature both ex ante (or ``regulation'') and ex post (or ``verification'') policies.\cite{stodden2013toward} Ex ante policies require that authors bear the burden of ensuring they provide some set of materials before publication @@ -253,7 +252,7 @@ \section{Ensuring privacy and security in research data} that can be used to identify an individual research subject. \url{https://dimewiki.worldbank.org/wiki/De-identification\#Personally\_Identifiable\_Information}}. PII data contains information that can, without any transformation, be used to identify -individual people, households, villages, or firms that were included in \textbf{data collection}. +individual people, households, villages, or firms that were part of data collection. \index{data collection} This includes names, addresses, and geolocations, and extends to personal information such as email addresses, phone numbers, and financial information.\index{geodata}\index{de-identification} @@ -289,12 +288,13 @@ \section{Ensuring privacy and security in research data} from recently advanced data rights and regulations, these considerations are critically important. Check with your organization if you have any legal questions; -in general, you are responsible to avoid taking any action that +in general, you are responsible for avoiding any action that knowingly or recklessly ignores these considerations. \subsection{Obtaining ethical approval and consent} -For almost all data collection or research activities that involves PII data, +For almost all data collection and research activities that involves +human subjects or PII data, you will be required to complete some form of \textbf{Institutional Review Board (IRB)} process.\sidenote{ \textbf{Institutional Review Board (IRB):} An institution formally responsible for ensuring that research meets ethical standards.} \index{Institutional Review Board} @@ -342,6 +342,7 @@ \subsection{Transmitting and storing data securely} Secure data storage and transfer are ultimately your personal responsibility.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Data_Security}} +There are several precautions needed to ensure that your data is safe. First, all online and offline accounts -- including personal accounts like computer logins and email -- need to be protected by strong and unique passwords. @@ -355,8 +356,7 @@ \subsection{Transmitting and storing data securely} \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked, or any other type of unauthorized access is obtained. \url{https://dimewiki.worldbank.org/wiki/encryption}} during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} -The biggest security gap is often in transmitting survey plans to and from staff in the field, -since staff with technical specialization are usually in an HQ office. +The biggest security gap is often in transmitting survey plans to and from staff in the field. To protect information in transit to field staff, some key steps are: Most modern data collection software has features that, @@ -380,9 +380,9 @@ \subsection{Transmitting and storing data securely} using a subset of the data that has anonymous identifying ID variables, and has had personal characteristics removed from the dataset altogether. We encourage this approach, because it is easy. -However, when PII is absolutely necessary for work, -such as geographical location, application of intervention programs, -or planning or submission of survey materials, +However, when PII is absolutely necessary to a task, +such as implementing an intervention +or submitting survey data, you must actively protect those materials in transmission and storage. There are plenty of options available to keep your data safe, From 13d42bf60f0cb7d75b8f2910bf18c86c93f0f524 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 18:42:47 -0500 Subject: [PATCH 620/854] [c1] remove duplicated content from chapter 5 --- chapters/handling-data.tex | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index c68a7477e..637bb53f6 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -358,17 +358,6 @@ \subsection{Transmitting and storing data securely} during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} The biggest security gap is often in transmitting survey plans to and from staff in the field. To protect information in transit to field staff, some key steps are: - -Most modern data collection software has features that, -if enabled, make secure transmission straightforward.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Form_Settings}} -Many also have features that ensure data is encrypted when stored on their servers, -although this usually needs to be actively enabled and administered. -Proper encryption means that, -even if the information were to be intercepted or made public, -the files that would be obtained would be useless to the recipient. -In security language this person is often referred to as an ``intruder'' -but it is rare that data breaches are malicious or even intentional. (a) ensure that all devices that store confidential data have hard drive encryption and password-protection; (b) never send confidential data over e-mail, WhatsApp, etc. From 1f5fa89f4140defdb0775d0cf760fe818ff30057 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 20:14:03 -0500 Subject: [PATCH 621/854] [ch7] add content on final deidentification --- chapters/publication.tex | 82 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 76 insertions(+), 6 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 1fb3830fa..cc1bd7a6c 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -242,6 +242,72 @@ \subsection{Getting started with \LaTeX\ in the cloud} %------------------------------------------------ +\section{Publishing primary data} + +If your project collected primary data, +releasing the raw dataset is a significant contribution that can be made +in addition to any publication of analysis results. +Publishing raw data can foster collaboration with researchers +interested in the same subjects as your team. +Collaboration can enable your team to fully explore variables and +questions that you may not have time to focus on otherwise, +even though data was collected on them. +There are different options for data publication. +The World Bank's Development Data Hub\sidenote{ + \url{https://data.worldbank.org/}} +includes a Microdata Catalog\sidenote{ +\url{https://microdata.worldbank.org/index.php/home}} +where researchers can publish data and documentation for their projects. +The Harvard Dataverse\sidenote{ + \url{https://dataverse.harvard.edu}} +publishes both data and code, +and also creates a data citation for its entries -- +IPA/J-PAL field experiment repository is especially relevant\sidenote{ + \url{https://www.povertyactionlab.org/blog/9-11-19/new-hub-data-randomized-evaluations}} +for those interested in impact evaluation. + +There will almost always be a trade-off between accuracy and privacy. +For publicly disclosed data, you should favor privacy. +Therefore, before publishing data, +you should carefully perform a \textbf{final de-identification}. +Its objective is to create a dataset for publication +that cannot be manipulated or linked to identify any individual research participant. +If you are following the steps outlined in this book, +you have already removed any direct identifiers after collecting the data. +At this stage, however, you should further remove +all indirect identifiers, and assess the risk of statistical disclosure.\sidenote{ + \textbf{Disclosure risk:} the likelihood that a released data record can be associated with an individual or organization.}\index{statistical disclosure} +To the extent required to ensure reasonable privacy, +potentially identifying variables must be further masked or removed. + +There are a number of tools developed to help researchers de-identify data +and which you should use as appropriate at that stage of data collection. +These include \texttt{PII\_detection}\sidenote{ + \url{https://github.com/PovertyAction/PII\_detection}} +from IPA, +\texttt{PII-scan}\sidenote{ + \url{https://github.com/J-PAL/PII-Scan}} +from JPAL, +and \texttt{sdcMicro}\sidenote{ + \url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} +from the World Bank. +\index{anonymization} +The \texttt{sdcMicro} tool, in particular, has a feature +that allows you to assess the uniqueness of your data observations, +and simple measures of the identifiability of records from that. +Additional options to protect privacy in data that will become public exist, +and you should expect and intend to release your datasets at some point. +One option is to add noise to data, as the US Census Bureau has proposed,\cite{abowd2018us} +as it makes the trade-off between data accuracy and privacy explicit. +But there are no established norms for such ``differential privacy'' approaches: +most approaches fundamentally rely on judging ``how harmful'' information disclosure would be. +The fact remains that there is always a balance between information release (and therefore transparency) +and privacy protection, and that you should engage with it actively and explicitly. +The best thing you can do is make a complete record of the steps that have been taken +so that the process can be reviewed, revised, and updated as necessary. + +%---------------------------------------------------- + \section{Preparing a complete replication package} While we have focused so far on the preparation of written materials for publication, @@ -295,12 +361,6 @@ \subsection{Publishing data for replication} even if it is just the derived indicators you constructed. If you have questions about your rights over original or derived materials, check with the legal team at your organization or at the data provider's. -You should only directly publish data which is fully de-identified -and, to the extent required to ensure reasonable privacy, -potentially identifying characteristics are further masked or removed. -In all other cases, you should contact an appropriate data catalog -to determine what privacy and licensing options are available. - Make sure you have a clear understanding of the rights associated with the data release and communicate them to any future users of the data. You must provide a license with any data release.\sidenote{ @@ -331,6 +391,16 @@ \subsection{Publishing data for replication} particularly where definitions may vary, so that others can learn from your work and adapt it as they like. +As in the case of raw primary data, +final analysis data sets that will become public for the purpose of replication +must also be fully de-identified. +In cases where PII data is required for analysis, +we recommend embargoing the sensitive variables when publishing the data. +You should contact an appropriate data catalog +to determine what privacy and licensing options are available. +Access to the embargoed data could be granted for the purposes of study replication, +if approved by an IRB. + \subsection{Publishing code for replication} Before publishing your code, you should edit it for content and clarity From 15b7ec68d08df53ded972c6db3cdbad4096a47fc Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 20:21:05 -0500 Subject: [PATCH 622/854] [ch7] minor languase adjustments --- chapters/publication.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 9f8cf4c81..3ebd14729 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -58,8 +58,8 @@ \subsection{Dynamic documents} This means that, whenever outputs are updated, the next time the document is loaded or compiled, it will automatically include all changes made to all outputs without any additional intervention from the user. -This means that updates will never be accidentally excluded, -and it further means that updating results will not become more difficult +This way, updates will never be accidentally excluded, +and updating results will not become more difficult as the number of inputs grows, because they are all managed by a single integrated process. @@ -100,7 +100,7 @@ \subsection{Dynamic documents} \index{\LaTeX} Rather than using a coding language that is built for another purpose or trying to hide the code entirely, -\LaTeX\ is a special code language designed for document preparation and typesetting. +\LaTeX\ is a document preparation and typesetting system with a unique syntax. While this tool has a significant learning curve, its enormous flexibility in terms of operation, collaboration, and output formatting and styling From 7dfe953ed5dc56c6adce37ffc7aa24c7d2f263e5 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 10 Feb 2020 20:24:17 -0500 Subject: [PATCH 623/854] resolves #349 --- chapters/publication.tex | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index cc1bd7a6c..7765d95fb 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -245,9 +245,9 @@ \subsection{Getting started with \LaTeX\ in the cloud} \section{Publishing primary data} If your project collected primary data, -releasing the raw dataset is a significant contribution that can be made +releasing the cleaned dataset is a significant contribution that can be made in addition to any publication of analysis results. -Publishing raw data can foster collaboration with researchers +Publishing data can foster collaboration with researchers interested in the same subjects as your team. Collaboration can enable your team to fully explore variables and questions that you may not have time to focus on otherwise, @@ -257,7 +257,11 @@ \section{Publishing primary data} \url{https://data.worldbank.org/}} includes a Microdata Catalog\sidenote{ \url{https://microdata.worldbank.org/index.php/home}} -where researchers can publish data and documentation for their projects. +where researchers can publish data and documentation for their projects.\sidenote{ +\url{https://dimewiki.worldbank.org/wiki/Microdata\_Catalog} +\newline +\url{https://dimewiki.worldbank.org/wiki/Checklist:\_Microdata\_Catalog\_submission} +} The Harvard Dataverse\sidenote{ \url{https://dataverse.harvard.edu}} publishes both data and code, From a3e839914ac4beb5ab4586cd3fd6b4df0482281a Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 11 Feb 2020 00:02:17 -0500 Subject: [PATCH 624/854] [ch6] moved final deidentification content resolved #327 --- chapters/data-analysis.tex | 54 +++++++++++++++++++++++++++++++------- 1 file changed, 44 insertions(+), 10 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index d45c579f2..fde35857b 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -145,36 +145,70 @@ \subsection{De-identification} It should contain only materials that are received directly from the field. They will invariably come in a host of file formats and nearly always contain personally-identifying information.\index{personally-identifying information} These files should be retained in the raw data folder \textit{exactly as they were received}. -Be mindful of where this file is stored. +Be mindful of where they are stored. Maintain a backup copy in a secure offsite location. Every other file is created from the raw data, and therefore can be recreated. -The exception, of course, is the raw data itself, so it should never be edited -directly. +The exception, of course, is the raw data itself, so it should never be edited directly. The rare and only case when the raw data can be edited directly is when it is encoded incorrectly and some non-English character is causing rows or columns to break at the wrong place when the data is imported. In this scenario, you will have to remove the special character manually, save the resulting data set \textit{in a new file} and securely back up \textit{both} the broken and the fixed version of the raw data. -Note that no one who is not listed in the IRB should be able to access its content, not even the company providing file-sharing services. +Note that no one who is not listed in the IRB should be able to access confidential data, +not even the company providing file-sharing services. Check if your organization has guidelines on how to store data securely, as they may offer an institutional solution. If that is not the case, you will need to encrypt the data, especially before sharing it, and make sure that only IRB-listed team members have the encryption key.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}}. - Secure storage of the raw data means access to it will be restricted even inside the research team.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Security}} + Loading encrypted data frequently can be disruptive to the workflow. To facilitate the handling of the data, remove any personally identifiable information from the data set. This will create a de-identified data set, that can be saved in a non-encrypted folder. De-identification,\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} -at this stage, means stripping the data set of direct identifiers such as names, phone numbers, addresses, and geolocations.\sidenote{\url{ https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf}} +at this stage, means stripping the data set of direct identifiers.\sidenote{\url{ https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf}} +To be able to do so, you will need to go through your data set and +find all the variables that contain identifying information. +Flagging all potentially identifying variables in the questionnaire design stage +simplifies the initial de-identification process. +If you haven't done that, that are a few tools that can help you with it. +JPAL's \texttt{PII scan}, as indicated by its name, +scans variable names and labels for common string patterns associated with identifying information.\sidenote{ + \url{https://github.com/J-PAL/PII-Scan}} +The World Bank's \texttt{sdcMicro} +lists variables that uniquely identify observations, +as well as allowing for more sophisticated disclosure risk calculations.\sidenote{ + \url{http://sdctools.github.io/sdcMicro/articles/sdcMicro.html}} +\texttt{iefielkit}'s \texttt{iecodebook} +function lists all variables in a data set and exports an Excel sheet +where you can easily select which variables to keep or drop.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Iecodebook}} + +Once you have a list of variables that contain PII, +assess them against the pre-analysis plan and ask: +will this variable be needed for analysis? +If not, the variable should be dropped. +Examples include respondent names, enumerator names, interview date, respondent phone number. +If the variable is needed for analysis, ask: +can I encode or otherwise construct a variable to use for the analysis that masks the PII, +and drop the original variable? +This is typically the case for most identifying information. +Examples include geocoordinates +(after constructing measures of distance or area, +drop the specific location), +and names for social network analysis (can be encoded to unique numeric IDs). +If PII variables are strictly required for the analysis itself, +it will be necessary to keep at least a subset of the data encrypted through the data analysis process. +If the answer is yes to either of these questions, +all you need to do is write a script to drop the variables that are not required for analysis, +encode or otherwise mask those that are required, +and save a working version of the data. + The resulting de-identified data will be the underlying source for all cleaned and constructed data. +This is the data set that you will interact with directly during the remaining tasks described in this chapter. Because identifying information is typically only used during data collection, to find and confirm the identity of interviewees, de-identification should not affect the usability of the data. -In fact, most identifying information can be converted into non-identified variables for analysis purposes -(e.g. GPS coordinates can be translated into distances). -However, if sensitive information is strictly needed for analysis, -the data must be encrypted while performing the tasks described in this chapter. \subsection{Correction of data entry errors} From 9c9104686554fcd20845c4a1de56e7e2dc66bba1 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 11 Feb 2020 00:10:23 -0500 Subject: [PATCH 625/854] [ch6] crossed content on chapters 5 and 6 --- chapters/data-analysis.tex | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index fde35857b..41fbbfdb1 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -227,9 +227,13 @@ \subsection{Correction of data entry errors} create an automated workflow to identify, correct and document occurrences of duplicate entries. -Looking for duplicated entries is usually part of data quality monitoring, -as is the only other reason to change the raw data during cleaning: -correcting mistakes in data entry. +As discussed in the previous chapter, +looking for duplicated entries is usually part of data quality monitoring, +and is typically addressed as part of that process. +So, in practice, you will start writing data cleaning code during data collection. +The other only other case when changes to the raw data are made during cleaning +is also directly connected to data quality monitoring: +it's when you need to correct mistakes in data entry. During data quality monitoring, you will inevitably encounter data entry mistakes, such as typos and inconsistent values. These mistakes should be fixed in the cleaned data set, From 3999aa027501eacc778aeac16d7c881de599e5e1 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 10:38:32 -0500 Subject: [PATCH 626/854] Apply suggestions from code review Very minor things --- chapters/data-analysis.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 41fbbfdb1..2dac73707 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -179,13 +179,13 @@ \subsection{De-identification} lists variables that uniquely identify observations, as well as allowing for more sophisticated disclosure risk calculations.\sidenote{ \url{http://sdctools.github.io/sdcMicro/articles/sdcMicro.html}} -\texttt{iefielkit}'s \texttt{iecodebook} -function lists all variables in a data set and exports an Excel sheet +The \texttt{iefieldkit} command \texttt{iecodebook} +lists all variables in a data set and exports an Excel sheet where you can easily select which variables to keep or drop.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Iecodebook}} Once you have a list of variables that contain PII, -assess them against the pre-analysis plan and ask: +assess them against the analysis plan and ask: will this variable be needed for analysis? If not, the variable should be dropped. Examples include respondent names, enumerator names, interview date, respondent phone number. From 3614f6d6828e97c67ece33e4213001fddfb861ba Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 11 Feb 2020 11:17:36 -0500 Subject: [PATCH 627/854] Revert "Revert "Ch2 intro"" This reverts commit d60a41c45a52012b4c668b22ab427f6e1bf0a9e8. --- chapters/research-design.tex | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 329eef0f6..34d2a7757 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -3,23 +3,20 @@ \begin{fullwidth} Research design is the process of defining the methods and data that will be used to answer a specific research question. -You don't need to be an expert in this, -and there are lots of good resources out there -that focus on designing interventions and evaluations -as well as on econometric approaches. -Therefore, without going into technical detail, -this section will present a brief overview -of the most common methods that are used in development research, -particularly those that are widespread in program evaluation. -These ``causal inference'' methods will turn up in nearly every project, -so you will need to have a broad knowledge of how the methods in your project -are used in order to manage data and code appropriately. +You don't need to be an expert in research design to do effective data work, +but it is essential that you understand the design of the study you are working on, +and how the design affects data work. +Without going into too much technical detail, +as there are many excellent resources on impact evaluation design, +this section presents a brief overview +of the most common ``causal inference'' methods, +focusing on implications for data structure and analysis. The intent of this chapter is for you to obtain an understanding of the way in which each method constructs treatment and control groups, the data structures needed to estimate the corresponding effects, -and some available code tools designed for each method (the list, of course, is not exhaustive). +and specific code tools designed for each method (the list, of course, is not exhaustive). -Thinking through your design before starting data work is important for several reasons. +Thinking through research design before starting data work is important for several reasons. If you do not know how to calculate the correct estimator for your study, you will not be able to assess the statistical power of your research design. You will also be unable to make decisions in the field @@ -36,6 +33,12 @@ in response to an unexpected event. Intuitive knowledge of your project's chosen approach will make you much more effective at the analytical part of your work. + +This chapter first covers causal inference methods. +Next we discuss how to measure treatment effects and structure data for specific methods, +including: cross-sectional randomized control trials, difference-in-difference designs, +regression discontinuity, instrumental variables, matching, and synthetic controls. + \end{fullwidth} %----------------------------------------------------------------------------------------------- From 4205fcedaefb0e8bf4147c57c70da25a41d51bec Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 11 Feb 2020 11:47:26 -0500 Subject: [PATCH 628/854] [ch6] intro reivew The "though" in the next paragraph is not in opposition to anything otherwise. --- chapters/data-analysis.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index d45c579f2..ba26ef960 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -11,6 +11,7 @@ It is their job to translate the data received from the field into economically meaningful indicators and to analyze them while making sure that code and outputs do not become too difficult to follow or get lost over time. +This can be a complex process. When it comes to code, though, analysis is the easy part, \textit{as long as you have organized your data well}. From 889be12a979df3d83af993fbde4b9887bf138a2f Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 11 Feb 2020 11:51:28 -0500 Subject: [PATCH 629/854] Ch4 intro --- chapters/sampling-randomization-power.tex | 53 ++++++++++++----------- 1 file changed, 27 insertions(+), 26 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 47590d7a3..d9387ba15 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -1,27 +1,20 @@ %----------------------------------------------------------------------------------------------- \begin{fullwidth} -Sampling and randomization are two core elements of research design. +Sampling and randomized assignment are two core elements of research design. In experimental methods, sampling and randomization directly determine the set of individuals who are going to be observed and what their status will be for the purpose of effect estimation. Since we only get one chance to implement a given experiment, we need to have a detailed understanding of how these processes work and how to implement them properly. -This allows us to ensure the field reality corresponds well to our experimental design. -In quasi-experimental methods, -sampling determines what populations the study +In quasi-experimental methods, sampling determines what populations the study will be able to make meaningful inferences about, -and randomization analyses simulate counterfactual possibilities -if the events being studied had happened differently. -These analytical dimensions are particularly important in the initial phases of development research -- -typically conducted well before any actual fieldwork occurs -- -and often have implications for feasibility, planning, and budgeting. +and randomization analyses simulate counterfactual possibilities; +what would have happened in the absence of the event. +Demonstrating that sampling and randomization were taken into consideration +before going to field lends credibility to any research study. -Power calculations and randomization inference methods -give us the tools to critically and quantitatively assess different -sampling and randomization designs in light of our theories of change -and to make optimal choices when planning studies. All random processes introduce statistical noise or uncertainty into the final estimates of effect sizes. Choosing one sample from all the possibilities produces some probability of @@ -30,19 +23,20 @@ creating groups that are not good counterfactuals for each other. Power calculation and randomization inference are the main methods by which these probabilities of error are assessed. -Good experimental design has high \textbf{power} -- a low likelihood that statistical noise -will substantially affect estimates of treatment effects. +These analytical dimensions are particularly important in the initial phases of development research -- +typically conducted well before any actual fieldwork occurs -- +and often have implications for feasibility, planning, and budgeting. + +In this chapter, we first cover the necessary practices to ensure that random processes are reproducible. +We next turn to how to implement sampling and randomized assignment, +both for simple, uniform probability cases, and more complex designs, +such as those that require clustering or stratification. +We include code examples so the guidance is concrete and applicable. +The last section discusses power calculations and randomization inference, +and how both are important tools to critically and quantitatively assess different +sampling and randomization designs and to make optimal choices when planning studies. + -Not all studies are capable of achieving traditionally high power: -sufficiently precise sampling or treatment assignments may not be available. -This may be especially true for novel or small-scale studies -- -things that have never been tried before may be hard to fund or execute at scale. -What is important is that every study includes reasonable estimates of its power, -so that the evidentiary value of its results can be assessed. -Demonstrating that sampling and randomization were taken into consideration -before going to field lends credibility to any research study. -Using these tools to design the best experiments possible -maximizes the likelihood that reported estimates are accurate. \end{fullwidth} %----------------------------------------------------------------------------------------------- @@ -381,7 +375,7 @@ \section{Power calculation and randomization inference} you can find some possible subsets that have higher-than-average values of some measure; similarly, you can find some that have lower-than-average values. Your sample or randomization will inevitably fall in one of these categories, -and we need to assess the likelihood and magnitude of this occurence.\sidenote{ +and we need to assess the likelihood and magnitude of this occurrence.\sidenote{ \url{https://davegiles.blogspot.com/2019/04/what-is-permutation-test.html}} Power calculation and randomization inference are the two key tools to doing so. @@ -454,6 +448,13 @@ \subsection{Power calculations} simulation ensures you will have understood the key questions well enough to report standard measures of power once your design is decided. +Not all studies are capable of achieving traditionally high power: +sufficiently precise sampling or treatment assignments may not be available. +This may be especially true for novel or small-scale studies -- +things that have never been tried before may be hard to fund or execute at scale. +What is important is that every study includes reasonable estimates of its power, +so that the evidentiary value of its results can be assessed. + \subsection{Randomization inference} Randomization inference is used to analyze the likelihood From e22cf71d823ef725caeaf4ed23b3891a0fbd3653 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 11:56:27 -0500 Subject: [PATCH 630/854] Restart sidenote numbering within each chapter --- chapters/preamble.tex | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/chapters/preamble.tex b/chapters/preamble.tex index 5ba9585d8..71c550d2e 100644 --- a/chapters/preamble.tex +++ b/chapters/preamble.tex @@ -117,6 +117,16 @@ %---------------------------------------------------------------------------------------- +% Reset the sidenote number each chapter +\let\oldchapter\chapter +\def\chapter{% + \setcounter{footnote}{0}% + \oldchapter +} + +%---------------------------------------------------------------------------------------- + + \begin{document} \frontmatter From 527c3e4189c96792a45369badd23d9ea0428492f Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 11 Feb 2020 12:02:49 -0500 Subject: [PATCH 631/854] Ch 6 intro I did the most substantial re-write here. i took out all language about RAs, as i don't think we want to limit our audience that much. --- chapters/data-analysis.tex | 49 +++++++++++++++++++++++++------------- 1 file changed, 33 insertions(+), 16 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index d45c579f2..dc4a851a2 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -4,23 +4,39 @@ Transforming raw data into a substantial contribution to scientific knowledge requires a mix of subject expertise, programming skills, and statistical and econometric knowledge. -The process of data analysis is, therefore, +The process of data analysis is typically a back-and-forth discussion between people with differing skill sets. -The research assistant usually ends up being the pivot of this discussion. -It is their job to translate the data received from the field into -economically meaningful indicators and to analyze them -while making sure that code and outputs do not become too difficult to follow or get lost over time. - -When it comes to code, though, analysis is the easy part, -\textit{as long as you have organized your data well}. -Of course, there is plenty of complexity behind it: -the econometrics, the theory of change, the measurement methods, and so much more. -But none of those are the subject of this book. -\textit{Instead, this chapter will focus on how to organize your data work so that coding the analysis becomes easy}. -Most of a Research Assistant's time is spent cleaning data and getting it into the right format. -When the practices recommended here are adopted, -analyzing the data is as simple as using a command that is already implemented in a statistical software. +An essential part of the process is translating the +raw data received from the field into economically meaningful indicators. +To effectively do this in a team environment, +data, code and outputs must be well-organized, +with a clear system for version control, +and analytical scripts structured such that any member of the research team can run them. +Putting in time upfront to structure data work well +pays off substantial dividends throughout the process. + +In this chapter, we first cover data management: +how to organize your data work at the start of a project +so that coding the analysis itself is straightforward. +This includes setting up folders, organizing tasks, master scripts, +and putting in place a version control system +so that your work is easy for all research team members to follow, +and meets standards for transparency and reproducibility. +Second, we turn to de-identification, +a critical step when working with any personally-identified data. +In the third section, we offer detailed guidance on data cleaning, +from identifying duplicate entries to labeling and annotating raw data, +and how to transparently document the cleaning process. +Section four focuses on how to transform your clean data +into the actual indicators you will need for analysis, +again emphasizing the importance of transparent documentation. +Finally, we turn to analysis itself. +We do not offer instructions on how to conduct specific analyses, +as that is determined by research design; +rather, we discuss how to structure analysis code, +and how to automate common outputs so that your analysis is fully reproducible. + \end{fullwidth} @@ -126,7 +142,8 @@ \subsection{Version control} \section{Data cleaning} -Data cleaning is the first stage of transforming the data you received from the field into data that you can analyze.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} +Data cleaning is the first stage of transforming the data you received from the field +into data that you can analyze.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. From 1501e5fe9f735e027892694db051eaa9847ca9fe Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 11 Feb 2020 12:07:23 -0500 Subject: [PATCH 632/854] Ch7 intro --- chapters/publication.tex | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 9f8cf4c81..e3cd8ba97 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -10,9 +10,9 @@ fussing with the technical requirements of publication. It is in nobody's interest for a skilled and busy researcher to spend days re-numbering references (and it can take days) -if a small amount of up-front effort could automate the task. +when a small amount of up-front effort can automate the task. In this section we suggest several methods -- -collectively refered to as ``dynamic documents'' -- +collectively referred to as ``dynamic documents'' -- for managing the process of collaboration on any technical product. For most research projects, completing a manuscript is not the end of the task. @@ -23,8 +23,11 @@ and better understand the results you have obtained. Holding code and data to the same standards a written work is a new practice for many researchers. -In this chapter, we provide guidelines that will help you -prepare a functioning and informative replication package. + +In this chapter, we first discuss tools and workflows for collaborating on technical writing. +Next, we turn to publishing data, +noting that the data can itself be a significant contribution in addition to analytical results. +Finally, we provide guidelines that will help you to prepare a functioning and informative replication package. In all cases, we note that technology is rapidly evolving and that the specific tools noted here may not remain cutting-edge, but the core principles involved in publication and transparency will endure. From 75b6fa2eb5c94acf4126d6238aed3dee368bd4b7 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 11 Feb 2020 12:26:00 -0500 Subject: [PATCH 633/854] [ch4 - code] space before comment after // --- code/replicability.do | 8 ++++---- code/simple-multi-arm-randomization.do | 10 +++++----- code/simple-sample.do | 4 ++-- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/code/replicability.do b/code/replicability.do index bc52ab1bc..78d93b4ff 100644 --- a/code/replicability.do +++ b/code/replicability.do @@ -12,11 +12,11 @@ set seed 287608 * Demonstrate stability after VERSIONING, SORTING and SEEDING - gen check1 = rnormal() //Create random number - gen check2 = rnormal() //Create a second random number without resetting seed + gen check1 = rnormal() // Create random number + gen check2 = rnormal() // Create a second random number without resetting seed - set seed 287608 //Reset the seed - gen check3 = rnormal() //Create a third random number after resetting seed + set seed 287608 // Reset the seed + gen check3 = rnormal() // Create a third random number after resetting seed * Visualize randomization results. See how check1 and check3 are identical, * but check2 is random relative check1 and check3 diff --git a/code/simple-multi-arm-randomization.do b/code/simple-multi-arm-randomization.do index 65bf6520d..846b5bbb4 100644 --- a/code/simple-multi-arm-randomization.do +++ b/code/simple-multi-arm-randomization.do @@ -7,16 +7,16 @@ * Generate a random number and use it to sort the observation. Then * the order the observations are sorted in is random. - gen treatment_rand = rnormal() //Generate a random number - sort treatment_rand //Sort based on the random number + gen treatment_rand = rnormal() // Generate a random number + sort treatment_rand // Sort based on the random number * See simple-sample.do example for an explanation of "(_n <= _N * X)". The code * below randomly selects one third of the observations into group 0, one third into group 1 and * one third into group 2. Typically 0 represents the control group and 1 and * 2 represents two treatment arms - generate treatment = 0 //Set all observations to 0 - replace treatment = 1 if (_n <= _N * (2/3)) //Set only the first two thirds to 1 - replace treatment = 2 if (_n <= _N * (1/3)) //Set only the first third to 2 + generate treatment = 0 // Set all observations to 0 + replace treatment = 1 if (_n <= _N * (2/3)) // Set only the first two thirds to 1 + replace treatment = 2 if (_n <= _N * (1/3)) // Set only the first third to 2 * Restore the original sort order isid patient, sort diff --git a/code/simple-sample.do b/code/simple-sample.do index 067a86c6d..f5205a7a5 100644 --- a/code/simple-sample.do +++ b/code/simple-sample.do @@ -7,8 +7,8 @@ * Generate a random number and use it to sort the observation. Then * the order the observations are sorted in is random. - gen sample_rand = rnormal() //Generate a random number - sort sample_rand //Sort based on the random number + gen sample_rand = rnormal() // Generate a random number + sort sample_rand // Sort based on the random number * Use the sort order to sample 20% (0.20) of the observations. _N in * Stata is the number of observations in the active data set , and _n From 9cfbe439aaeef4235b3b7d7a540797da942959b1 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 11 Feb 2020 12:29:55 -0500 Subject: [PATCH 634/854] [ch7] remove index.php from link. This can get outdated --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 7765d95fb..a1de5972e 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -256,7 +256,7 @@ \section{Publishing primary data} The World Bank's Development Data Hub\sidenote{ \url{https://data.worldbank.org/}} includes a Microdata Catalog\sidenote{ -\url{https://microdata.worldbank.org/index.php/home}} +\url{https://microdata.worldbank.org}} where researchers can publish data and documentation for their projects.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Microdata\_Catalog} \newline From 4cbba2be81a144520522bb23d115580aa8603aa8 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 11 Feb 2020 12:31:24 -0500 Subject: [PATCH 635/854] Ch5 re-write first pass on intro --- chapters/data-collection.tex | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 9c043febc..b64534bc4 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -1,19 +1,25 @@ %------------------------------------------------ \begin{fullwidth} -High quality research begins with a thoughtfully-designed, field-tested survey instrument, and a carefully supervised survey. Much of the recent push toward credibility in the social sciences has focused on analytical practices. -We contest that credible research depends, first and foremost, on the quality of the raw data. This chapter covers the data generation workflow, from questionnaire design to field monitoring, for electronic data collection. +We contest that credible research depends, first and foremost, on the quality of the raw data. + +This chapter covers the data acquisition + +We then dive specifically into survey data, providing guidance on the data generation workflow, +from questionnaire design to programming electronic survey instruments and monitoring data quality. +We conclude with a discussion of safe data handling, storage, and sharing. + There are many excellent resources on questionnaire design and field supervision, -but few covering the particularly challenges and opportunities presented by electronic surveys. +but few covering the particular challenges and opportunities presented by electronic surveys. As there are many survey software, and the market is rapidly evolving, we focus on workflows and primary concepts, rather than software-specific tools. -The chapter covers questionnaire design, piloting and programming; monitoring data quality during the survey; and how to ensure confidential data is handled securely from collection to storage and sharing. + \end{fullwidth} %------------------------------------------------ -\section{Collecting primary data with development partners} +\section{Acquiring data} Primary data is the key to most modern development research. Often, there is simply no source of reliable official statistics From 2ba22d0c90efa7c441d9b870f5a8174df9d639e4 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 11 Feb 2020 12:33:03 -0500 Subject: [PATCH 636/854] Ch1 intro fixed line breaks --- chapters/handling-data.tex | 1 - 1 file changed, 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index b7aa5460a..75ebbc54e 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -16,7 +16,6 @@ Respecting the respondents' right to privacy, by intelligently assessing and proactively averting risks they might face, is a core tenet of research ethics. - On the consumer side, it is important to protect confidence in development research by following modern practices for \textbf{transparency} and \textbf{reproducibility}. Across the social sciences, the open science movement has been fueled by discoveries of low-quality research practices, From fac425f2aa7f4cdf8d23116168c862df5534c619 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 11 Feb 2020 12:35:51 -0500 Subject: [PATCH 637/854] Ch2 intro fixing 'in this chapter' reference in first section --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 408743050..05371a7d3 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -58,7 +58,7 @@ \section{Preparing a collaborative work environment} Therefore, there are large efficiency gains over time to thinking in advance about the best way to do these tasks, instead of throwing together a solution when the task arises. -This chapter will outline the main points to discuss within the team, +This section will outline the main points to discuss within the team, and suggest some common solutions for these tasks. % ---------------------------------------------------------------------------------------------- From 9788db354f2369f4171e20131158c59ba02b6a15 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 11 Feb 2020 12:43:26 -0500 Subject: [PATCH 638/854] Ch 2 small rewording --- chapters/planning-data-work.tex | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 05371a7d3..e23aebcb4 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -34,22 +34,21 @@ \section{Preparing a collaborative work environment} -Being comfortable using your computer and having the tools you need in reach is key. This section provides a brief introduction to core concepts and tools -that can help you handle the work you will be primarily responsible for. +that can help you to organize your data work in an efficient, collaborative and reproducible manner. Some of these skills may seem elementary, but thinking about simple things from a workflow perspective -can help you make marginal improvements every day you work -that add up to substantial gains over the course of many projects. +can help you make marginal improvements every day you work; +those add up to substantial gains over the course of multiple years and projects. Together, these processes should form a collaborative workflow that will greatly accelerate your team's ability to get tasks done on every project you take on together. Teams often develop their workflows over time, solving new challenges as they arise. -This is good. But it is important to recognize -that there are a number of tasks that will exist for every project, -and that their corresponding workflows can be agreed on in advance. +Adaptation is good, of course. But it is important to recognize +that there are a number of tasks that exist for every project, +and it is more efficient to agree on the corresponding workflows in advance. These include documentation methods, software choices, naming schema, organizing folders and outputs, collaborating on code, managing revisions to files, and reviewing each other's work. From 2991ad82a8ba8b0887b32423c9baf9c6093656c0 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 11 Feb 2020 12:55:27 -0500 Subject: [PATCH 639/854] [ch6] resolves #347 --- chapters/data-analysis.tex | 405 ++++++++++++++++++++++++------------- 1 file changed, 261 insertions(+), 144 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index ba26ef960..5d7acd2ba 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -29,6 +29,7 @@ %------------------------------------------------ \section{Data management} + The goal of data management is to organize the components of data work so it can traced back and revised without massive effort. In our experience, there are four key elements to good data management: @@ -44,22 +45,27 @@ \section{Data management} how each edit affects other files in the project. \subsection{Folder structure} -There are many schemes to organize research data. -Our preferred scheme reflects the task breakdown just discussed. + +There are many ways to organize research data. +Our preferred scheme reflects the task breakdown that will be outlined in this chapter. \index{data organization} -DIME Analytics created the \texttt{iefolder}\sidenote{ +Our team at DIME Analytics developed the \texttt{iefolder}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iefolder}} package (part of \texttt{ietoolkit}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/ietoolkit}}) +to automatize the creation of a folder following this scheme and to standardize folder structures across teams and projects. -This means that PIs and RAs face very small costs when switching between projects, -because they are organized in the same way.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Folder}} +Standardizing folders greatly reduces the costs that PIs and RAs face when switching between projects, +because they are organized in the same way.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/DataWork\_Folder}} We created the command based on our experience with primary data, -but it can be used for different types of data. -Whatever you team may need in terms of organization, +but it can be used for different types of data, +and adapted to fit different needs. +No matter what are your team's preference in terms of folder organization, the principle of creating one standard remains. -At the first level of the structure created by \texttt{iefolder} are what we call survey round folders.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Survey\_Round}} +At the first level of the structure created by \texttt{iefolder} are what we call survey round folders.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/DataWork\_Survey\_Round}} You can think of a ``round'' as one source of data, that will be cleaned in the same script. Inside round folders, there are dedicated folders for @@ -67,16 +73,19 @@ \subsection{Folder structure} There is a folder for raw results, as well as for final outputs. The folders that hold code are organized in parallel to these, so that the progression through the whole project can be followed by anyone new to the team. -Additionally, \texttt{iefolder} creates \textbf{master do-files}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} +Additionally, \texttt{iefolder} creates \textbf{master do-files}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} so all project code is reflected in a top-level script. \subsection{Task breakdown} -We divide the process of turning raw data into analysis data into three stages: + +We divide the data work process that starts from the raw data +and builds on it to create final analysis outputs into three stages: data cleaning, variable construction, and data analysis. Though they are frequently implemented at the same time, we find that creating separate scripts and data sets prevents mistakes. It will be easier to understand this division as we discuss what each stage comprises. -What you should know by now is that each of these stages has well-defined inputs and outputs. +What you should know for now is that each of these stages has well-defined inputs and outputs. This makes it easier to track tasks across scripts, and avoids duplication of code that could lead to inconsistent results. For each stage, there should be a code folder and a corresponding data set. @@ -93,26 +102,30 @@ \subsection{Task breakdown} It helps to keep the level of the outputs high, and is also a great way to learn and improve your code. \subsection{Master scripts} + Master scripts allow users to execute all the project code from a single file. -They briefly describes what each code, -and maps the files they require and create. -They also connects code and folder structure through globals or objects. +They briefly describe what each code does, +and map the files they require and create. +They also connect code and folder structure through macros or objects. In short, a master script is a human-readable map to the tasks, files and folder structure that comprise a project. Having a master script eliminates the need for complex instructions to replicate results. -Reading the master do-file should be enough for anyone unfamiliar with the project +Reading it should be enough for anyone unfamiliar with the project to understand what are the main tasks, which scripts execute them, and where different files can be found in the project folder. That is, it should contain all the information needed to interact with a project's data work. \subsection{Version control} -Finally, everything that can be version-controlled should be. + +Finally, establishing a version control system is an incredibly useful +and important step for documentation, collaboration and conflict-solving. Version control allows you to effectively track code edits, including the addition and deletion of files. This way you can delete code you no longer need, and still recover it easily if you ever need to get back previous work. +Everything that can be version-controlled should be. Both analysis results and data sets will change with the code. -You should have each of them stored with the code that created it. +Whenever possible, you should track have each of them with the code that created it. If you are writing code in Git/GitHub, you can output plain text files such as \texttt{.tex} tables and metadata saved in \texttt{.txt} or \texttt{.csv} to that directory. @@ -125,20 +138,23 @@ \subsection{Version control} %------------------------------------------------ -\section{Data cleaning} +\section{Cleaning data} -Data cleaning is the first stage of transforming the data you received from the field into data that you can analyze.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} +Data cleaning is the first stage of transforming the data you received from the field into data that you can analyze.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. The cleaned data set should contain only the variables collected in the field. -No modifications to data points are made at this stage, except for corrections of mistaken entries. +No modifications to data points are made at this stage, +except for corrections of mistaken entries. Cleaning is probably the most time consuming of the stages discussed in this chapter. -This is the time when you obtain an extensive understanding of the contents and structure of the data that was collected. +This is the stage where you obtain an extensive understanding of the contents and structure of the data that was collected. Explore your data set using tabulations, summaries, and descriptive plots. -You should use this time to understand the types of responses collected, both within each survey question and across respondents. -Knowing your data set well will make it possible to do analysis. +You should use this time to understand the types of responses collected, +both within each survey question and across respondents. +Knowing your data set well is necessary to analyze it well. \subsection{De-identification} @@ -177,7 +193,7 @@ \subsection{De-identification} However, if sensitive information is strictly needed for analysis, the data must be encrypted while performing the tasks described in this chapter. -\subsection{Correction of data entry errors} +\subsection{Correcting data entry errors} There are two main cases when the raw data will be modified during data cleaning. The first one is when there are duplicated entries in the data. @@ -207,24 +223,31 @@ \subsection{Labeling and annotating the raw data} On average, making corrections to primary data is more time-consuming than when using secondary data. But you should always check for possible issues in any data you are about to use. -The last step of data cleaning, however, will most likely still be necessary. -It consists of labeling and annotating the data, so that its users have all the -information needed to interact with it. +The last step of data cleaning, however, +will most likely be necessary no matter what type of data is involved. +It consists of labeling and annotating the data, +so that its users have all the information needed to interact with it. This is a key step to making the data easy to use, but it can be quite repetitive. The \texttt{iecodebook} command suite, also part of \texttt{iefieldkit}, is designed to make some of the most tedious components of this process, -such as renaming, relabeling, and value labeling, much easier.\sidenote{\url{https://dimewiki.worldbank.org/wiki/iecodebook}} +such as renaming, relabeling, and value labeling, much easier.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/iecodebook}} \index{iecodebook} -We have a few recommendations on how to use this command for data cleaning. -First, we suggest keeping the same variable names in the cleaned data set as in the survey instrument, so it's straightforward to link data points for a variable to the question that originated them. +We have a few recommendations on how to use this command, +and how to approach data cleaning in general. +First, we suggest keeping the same variable names in the cleaned data set as in the survey instrument, +so it's straightforward to link data points for a variable to the question that originated them. Second, don't skip the labeling. -Applying labels makes it easier to understand what the data is showing while exploring the data. -This minimizes the risk of small errors making their way through into the analysis stage. -Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Applying\_Labels}} +Applying labels makes it easier to understand what the data mean as you explore it, +and thus reduces the risk of small errors making their way through into the analysis stage. +Variable and value labels should be accurate and concise.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Applying\_Labels}} Third, recodes should be used to turn codes for ``Don't know'', ``Refused to answer'', and -other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} -String variables need to be encoded, and open-ended responses, categorized or dropped\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Strings}} -(unless you are using qualitative or classification analyses, which are less common). +other non-responses into extended missing values.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} +String variables need to be encoded, and open-ended responses, categorized or dropped\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Strings}} +(unless you are conducting qualitative or classification analyses). Finally, any additional information collected only for quality monitoring purposes, such as notes and duration fields, can also be dropped. @@ -233,7 +256,8 @@ \subsection{Documenting data cleaning} Throughout the data cleaning process, you will need inputs from the field, including enumerator manuals, survey instruments, supervisor notes, and data quality monitoring reports. -These materials are essential for data documentation.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Documentation}} +These materials are essential for data documentation.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Data\_Documentation}} \index{Documentation} They should be stored in the corresponding ``Documentation'' folder for easy access, as you will probably need them during analysis, @@ -244,9 +268,9 @@ \subsection{Documenting data cleaning} Be very careful not to include sensitive information in documentation that is not securely stored, or that you intend to release as part of a replication package or data publication. -Another important component of data cleaning documentation are the results of +Another important component of data cleaning documentation are the results of data exploration. As clean your data set, take the time to explore the variables in it. -Use tabulations, histograms and density plots to understand the structure of data, +Use tabulations, summary statistics, histograms and density plots to understand the structure of data, and look for potentially problematic patterns such as outliers, missing values and distributions that may be caused by data entry errors. Don't spend time trying to correct data points that were not flagged during data quality monitoring. @@ -268,10 +292,11 @@ \subsection{The cleaned data set} you may want to break the data cleaning into sub-steps, and create intermediate cleaned data sets (for example, one per survey module). -Breaking cleaned data sets into the smallest unit of observation inside a roster -make the cleaning faster and the data easier to handle during construction. +When dealing with complex surveys with multiple nested groups, +is is also useful to have each cleaned data set at the smallest unit of observation inside a roster. +This will make the cleaning faster and the data easier to handle during construction. But having a single cleaned data set will help you with sharing and publishing the data. -To make sure this file doesn't get too big to be handled, +To make sure the cleaned data set file doesn't get too big to be handled, use commands such as \texttt{compress} in Stata to make sure the data is always stored in the most efficient format. Once you have a cleaned, de-identified data set, and documentation to support it, @@ -283,43 +308,57 @@ \subsection{The cleaned data set} This will help you organize your files and create a back up of the data, and some donors require that the data be filed as an intermediate step of the project. -\section{Indicator construction} +\section{Constructing final indicators} -% What is construction ------------------------------------- The second stage in the creation of analysis data is construction. -Constructing variables means processing the data points as provided in the raw data to make them suitable for analysis. +Constructing variables means processing the data points in the raw data to make them suitable for analysis. It is at this stage that the raw data is transformed into analysis data. This is done by creating derived variables (dummies, indices, and interactions, to name a few), -as planned during research design, and using the pre-analysis plan as a guide. +as planned during research design\index{Research design}, +and using the pre-analysis plan as a guide.\index{Pre-analysis plan} To understand why construction is necessary, let's take the example of a household survey's consumption module. -For each item in a context-specific bundle, it will ask whether the household consumed any of it over a certain period of time. +For each item in a context-specific bundle, +this module will ask whether the household consumed any of it over a certain period of time. If they did, it will then ask about quantities, units and expenditure for each item. -However, it is difficult to run a meaningful regression on the number of cups of milk and handfuls of beans that a household consumed over a week. +However, it is difficult to run a meaningful regression +on the number of cups of milk and handfuls of beans that a household consumed over a week. You need to manipulate them into something that has \textit{economic} meaning, such as caloric input or food expenditure per adult equivalent. During this process, the data points will typically be reshaped and aggregated -so that level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} +so that level of the data set goes from the unit of observation +(one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} \subsection{Why construction?} % From cleaning Construction is done separately from data cleaning for two reasons. -The first one is to clearly differentiate the data originally collected from the result of data processing decisions. +The first one is to clearly differentiate the data originally collected +from the result of data processing decisions. The second is to ensure that variable definition is consistent across data sources. Unlike cleaning, construction can create many outputs from many inputs. Let's take the example of a project that has a baseline and an endline survey. -Unless the two instruments are exactly the same, which is preferable but often not the case, the data cleaning for them will require different steps, and therefore will be done separately. +Unless the two instruments are exactly the same, +which is preferable but often not the case, +the data cleaning for them will require different steps, +and therefore will be done separately. However, you still want the constructed variables to be calculated in the same way, so they are comparable. -To do this, you will at least two cleaning scripts, and a single one for construction -- +To do this, you will at least two cleaning scripts, +and a single one for construction -- we will discuss how to do this in practice in a bit. % From analysis -Ideally, indicator construction should be done right after data cleaning, according to the pre-analysis plan. +Ideally, indicator construction should be done right after data cleaning, +according to the pre-analysis plan.\index{Pre-analysis plan} In practice, however, following this principle is not always easy. -As you analyze the data, different constructed variables will become necessary, as well as subsets and other alterations to the data. -Still, constructing variables in a separate script from the analysis will help you ensure consistency across different outputs. -If every script that creates a table starts by loading a data set, subsetting it and manipulating variables, any edits to construction need to be replicated in all scripts. +As you analyze the data, different constructed variables will become necessary, +as well as subsets and other alterations to the data. +Still, constructing variables in a separate script from the analysis +will help you ensure consistency across different outputs. +If every script that creates a table starts by loading a data set, +subsetting it, and manipulating variables, +any edits to construction need to be replicated in all scripts. This increases the chances that at least one of them will have a different sample or variable definition. Therefore, even if construction ends up coming before analysis only in the order the code is run, it's important to think of them as different steps. @@ -327,42 +366,61 @@ \subsection{Why construction?} \subsection{Construction tasks and how to approach them} The first thing that comes to mind when we talk about variable construction is, of course, creating new variables. -Do this by adding new variables to the data set instead of overwriting the original information, and assign functional names to them. +Do this by adding new variables to the data set instead of overwriting the original information, +and assign functional names to them. During cleaning, you want to keep all variables consistent with the survey instrument. But constructed variables were not present in the survey to start with, so making their names consistent with the survey form is not as crucial. -Of course, whenever possible, having variable names that are both intuitive \textit{and} can be linked to the survey is ideal, but if you need to choose, prioritize functionality. -Ordering the data set so that related variables are together and adding notes to each of them as necessary will also make your data set more user-friendly. +Of course, whenever possible, having variable names that are both intuitive +\textit{and} can be linked to the survey is ideal, +but if you need to choose, prioritize functionality. +Ordering the data set so that related variables are together, +and adding notes to each of them as necessary will also make your data set more user-friendly. The most simple case of new variables to be created are aggregate indicators. -For example, you may want to add a household's income from different sources into a single total income variable, or create a dummy for having at least one child in school. +For example, you may want to add a household's income from different sources into a single total income variable, +or create a dummy for having at least one child in school. Jumping to the step where you actually create this variables seems intuitive, -but it can also cause you a lot of problems, as overlooking details may affect your results. -It is important to check and double-check the value-assignments of questions and their scales before constructing new variables based on them. +but it can also cause you a lot of problems, +as overlooking details may affect your results. +It is important to check and double-check the value-assignments of questions, +as well as their scales, before constructing new variables based on them. This is when you will use the knowledge of the data you acquired and the documentation you created during the cleaning step the most. It is often useful to start looking at comparisons and other documentation outside the code editor. -Make sure to standardize units and recode categorical variables so their values are consistent. +Make sure there is consistency across constructed variables. It's possible that your questionnaire asked respondents to report some answers as percentages and others as proportions, -or that in one variable 0 means ``no'' and 1 means ``yes'', while in another one the same answers were coded are 1 and 2. -We recommend coding yes/no questions as either 1/0 or TRUE/FALSE, so they can be used numerically as frequencies in means and as dummies in regressions. +or that in one variable 0 means ``no'' and 1 means ``yes'', +while in another one the same answers were coded are 1 and 2. +We recommend coding yes/no questions as either 1/0 or TRUE/FALSE, +so they can be used numerically as frequencies in means and as dummies in regressions. Check that non-binary categorical variables have the same value-assignment, i.e., that labels and levels have the same correspondence across variables that use the same options. -Finally, make sure that any numeric variables you are comparing are converted to the same scale or unit of measure. You cannot add one hectare and twos acres into a meaningful number. +Finally, make sure that any numeric variables you are comparing are converted to the same scale or unit of measure. +You cannot add one hectare and twos acres into a meaningful number. -During construction, you will also need to address some of the issues you identified in the data during data cleaning. +During construction, you will also need to address some of the issues +you identified in the data set as you were cleaning it. The most common of them is the presence of outliers. -How to treat outliers is a research question, but make sure to note what we the decision made by the research team, and how you came to it. -Results can be sensitive to the treatment of outliers, so keeping the original variable in the data set will allow you to test how much it affects the estimates. +How to treat outliers is a research question, +but make sure to note what was the decision made by the research team, +and how you came to it. +Results can be sensitive to the treatment of outliers, +so keeping the original variable in the data set will allow you to test how much it affects the estimates. All these points also apply to imputation of missing values and other distributional patterns. The more complex construction tasks involve changing the structure of the data: adding new observations or variables by merging data sets, and changing the unit of observation through collapses or reshapes. -There are always ways for things to go wrong that we never anticipated, but two issues to pay extra attention to are missing values and dropped observations. -Merging, reshaping and aggregating data sets can change both the total number of observations and the number of observations with missing values. -Make sure to read about how each command treats missing observations and, whenever possible, add automated checks in the script that throw an error message if the result is changing. -If you are subsetting your data, drop observations explicitly, indicating why you are doing that and how the data set changed. +There are always ways for things to go wrong that we never anticipated, +but two issues to pay extra attention to are missing values and dropped observations. +Merging, reshaping and aggregating data sets can change both the total number of observations +and the number of observations with missing values. +Make sure to read about how each command treats missing observations and, +whenever possible, add automated checks in the script that throw an error message if the result is changing. +If you are subsetting your data, +drop observations explicitly, +indicating why you are doing that and how the data set changed. Finally, primary panel data involves additional timing complexities. It is common to construct indicators soon after receiving data from a new survey round. @@ -373,33 +431,48 @@ \subsection{Construction tasks and how to approach them} Then the first thing you should do is create a panel data set -- \{texttt{iecodebook}'s \texttt{append} subcommand will help you reconcile and append survey rounds. After that, adapt the construction code so it can be used on the panel data set. -Apart from preventing inconsistencies, this process will also save you time and give you an opportunity to review your original code. +Apart from preventing inconsistencies, +this process will also save you time and give you an opportunity to review your original code. \subsection{Documenting indicators construction} -Because data construction involves translating concrete data points to more abstract measurements, it is important to document exactly how each variable is derived or calculated. +Because data construction involves translating concrete data points to more abstract measurements, +it is important to document exactly how each variable is derived or calculated. Adding comments to the code explaining what you are doing and why is a crucial step both to prevent mistakes and to guarantee transparency. -To make sure that these comments can be more easily navigated, it is wise to start writing a variable dictionary as soon as you begin making changes to the data. -Carefully record how specific variables have been combined, recoded, and scaled, and refer to those records in the code. -This can be part of a wider discussion with your team about creating protocols for variable definition, which will guarantee that indicators are defined consistently across projects. -When all your final variables have been created, you can use \texttt{iecodebook}'s \texttt{export} subcommand to list all variables in the data set, +To make sure that these comments can be more easily navigated, +it is wise to start writing a variable dictionary as soon as you begin making changes to the data. +Carefully record how specific variables have been combined, recoded, and scaled, +and refer to those records in the code. +This can be part of a wider discussion with your team about creating protocols for variable definition, +which will guarantee that indicators are defined consistently across projects. +When all your final variables have been created, +you can use \texttt{iecodebook}'s \texttt{export} subcommand to list all variables in the data set, and complement it with the variable definitions you wrote during construction to create a concise meta data document. Documentation is an output of construction as relevant as the code and the data. -Someone unfamiliar with the project should be able to understand the contents of the analysis data sets, the steps taken to create them, and the decision-making process through your documentation. +Someone unfamiliar with the project should be able to understand the contents of the analysis data sets, +the steps taken to create them, +and the decision-making process through your documentation. The construction documentation will complement the reports and notes created during data cleaning. Together, they will form a detailed account of the data processing. \subsection{Constructed data sets} -The other set of construction outputs, as expected, consists of the data sets that will be used for analysis. +The other set of construction outputs, as expected, +consists of the data sets that will be used for analysis. A constructed data set is built to answer an analysis question. -Since different pieces of analysis may require different samples, or even different units of observation, -you may have one or multiple constructed data sets, depending on how your analysis is structured. +Since different pieces of analysis may require different samples, +or even different units of observation, +you may have one or multiple constructed data sets, +depending on how your analysis is structured. So don't worry if you cannot create a single, ``canonical'' analysis data set. It is common to have many purpose-built analysis datasets. -Think of an agricultural intervention that was randomized across villages and only affected certain plots within each village. -The research team may want to run household-level regressions on income, test for plot-level productivity gains, and check if village characteristics are balanced. -Having three separate datasets for each of these three pieces of analysis will result in much cleaner do files than if they all started from the same file. +Think of an agricultural intervention that was randomized across villages +and only affected certain plots within each village. +The research team may want to run household-level regressions on income, +test for plot-level productivity gains, +and check if village characteristics are balanced. +Having three separate datasets for each of these three pieces of analysis +will result in much cleaner do files than if they all started from the same file. %------------------------------------------------ @@ -422,40 +495,54 @@ \subsection{Organizing analysis code} The analysis stage usually starts with a process we call exploratory data analysis. This is when you are trying different things and looking for patterns in your data. -It progresses into final analysis when your team starts to decide what are the main results, those that will make it into the research output. -The way you deal with code and outputs for exploratory and final analysis is different, and this section will discuss how. -During exploratory data analysis, you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. +It progresses into final analysis when your team starts to decide what are the main results, +those that will make it into the research output. +The way you deal with code and outputs for exploratory and final analysis is different. +During exploratory data analysis, +you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. It subtly encourages poor practices such as not clearing the workspace and not reloading the constructed data set before each analysis task. -It's important to take the time to organize scripts in a clean manner and to avoid mistakes. - -A well-organized analysis script starts with a completely fresh workspace and explicitly loads data before analyzing it. -This encourages data manipulation to be done earlier in the workflow (that is, during construction). -It also and prevents you from accidentally writing pieces of analysis code that depend on one another and requires manual instructions for all required code snippets be run in the right order. -Each script should run completely independently of all other code. +To avoid mistakes, it's important to take the time +to organize the code that you want to use again in a clean manner. + +A well-organized analysis script starts with a completely fresh workspace +and explicitly loads data before analyzing it. +This setup encourages data manipulation to be done earlier in the workflow +(that is, during construction). +It also and prevents you from accidentally writing pieces of analysis code that depend on one another +and require manual instructions for all necessary chuncks of code to be run in the right order. +Each script should run completely independently of all other code, +except for the master script. You can go as far as coding every output in a separate script. -There is nothing wrong with code files being short and simple -- as long as they directly correspond to specific pieces of analysis. -Analysis files should be as simple as possible, so whoever is reading it can focus on the econometrics. -All research decisions should be made very explicit in the code. +There is nothing wrong with code files being short and simple. +In fact, analysis scripts should be as simple as possible, +so whoever is reading them can focus on the econometrics, not the coding. +All research decisions should be very explicit in the code. This includes clustering, sampling, and control variables, to name a few. -If you have multiple analysis data sets, each of them should have a descriptive name about its sample and unit of observation. -As your team comes to a decision about model specification, you can create globals or objects in the master script to use across scripts. +If you have multiple analysis data sets, +each of them should have a descriptive name about its sample and unit of observation. +As your team comes to a decision about model specification, +you can create globals or objects in the master script to use across scripts. This is a good way to make sure specifications are consistent throughout the analysis. -Using pre-specified globals or objects also makes your code more dynamic, so it is easy to update specifications and results without changing every script. -It is completely acceptable to have folders for each task, and compartmentalize each analysis as much as needed. - -\subsection{Exporting outputs} - -To accomplish this, you will need to make sure that you have an effective data management system, including naming, file organization, and version control. -Just like you did with each of the analysis datasets, name each of the individual analysis files descriptively. -Code files such as \path{spatial-diff-in-diff.do}, \path{matching-villages.R}, and \path{summary-statistics.py} +Using pre-specified globals or objects also makes your code more dynamic, +so it is easy to update specifications and results without changing every script. +It is completely acceptable to have folders for each task, +and compartmentalize each analysis as much as needed. + +To accomplish this, you will need to make sure that you have an effective data management system, +including naming, file organization, and version control. +Just like you did with each of the analysis datasets, +name each of the individual analysis files descriptively. +Code files such as \path{spatial-diff-in-diff.do}, +\path{matching-villages.R}, and \path{summary-statistics.py} are clear indicators of what each file is doing, and allow you to find code quickly. If you intend to numerically order the code as they appear in a paper or report, leave this to near publication time. -% Self-promotion ------------------------------------------------ - +\subsection{Visualizing data} +\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_visualization}} \index{data visualization} +is increasingly popular, and is becoming a field in its own right.\cite{healy2018data,wilke2019fundamentals} Whole books have been written on how to create good data visualizations, so we will not attempt to give you advice on it. Rather, here are a few resources we have found useful. @@ -468,10 +555,12 @@ \subsection{Exporting outputs} Graphics tools like Stata are highly customizable. There is a fair amount of learning curve associated with extremely-fine-grained adjustment, but it is well worth reviewing the graphics manual\sidenote{\url{https://www.stata.com/manuals/g.pdf}} -For an easier way around it, Gray Kimbrough's \textit{Uncluttered Stata Graphs} code is an excellent default replacement for Stata graphics that is easy to install. +For an easier way around it, Gray Kimbrough's \textit{Uncluttered Stata Graphs} +code is an excellent default replacement for Stata graphics that is easy to install. \sidenote{\url{https://graykimbrough.github.io/uncluttered-stata-graphs/}} If you are a R user, the \textit{R Graphics Cookbook}\sidenote{\url{https://r-graphics.org/}} -is a great resource for the most popular visualization package \texttt{ggplot}\sidenote{\url{https://ggplot2.tidyverse.org/}}. +is a great resource for the its popular visualization package, \texttt{ggplot}\sidenote{ + \url{https://ggplot2.tidyverse.org/}}. But there are a variety of other visualization packages, such as \texttt{highcharter}\sidenote{\url{http://jkunst.com/highcharter/}}, \texttt{r2d3}\sidenote{\url{https://rstudio.github.io/r2d3/}}, @@ -480,44 +569,63 @@ \subsection{Exporting outputs} We have no intention of creating an exhaustive list, and this one is certainly missing very good references. But at least it is a place to start. + +We attribute some of the difficulty of creating good data visualization +to writing code to create them. +Making a visually compelling graph would already be hard enough if +you didn't have to go through many rounds of googling to understand a command. +The trickiest part of using plot commands is to get the data in the right format. +This is why we create the \textbf{Stata Visual Library}\sidenote{ + \url{https://worldbank.github.io/Stata-IE-Visual-Library/}}, +has examples of graphs created in Stata and curated by us.\sidenote{ + A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com/}} +The Stata Visual Library includes example data sets to use with each do-file, +so you get a good sense of what your data should look like +before you can start writing code to create a visualization. + \section{Exporting analysis outputs} Our team has created a few products to automate common outputs and save you precious research time. The \texttt{ietoolkit} package includes two commands to export nicely formatted tables. -\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} creates and exports balance tables to excel or {\LaTeX}. -\texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} does the same for difference-in-differences regressions. -The \textbf{Stata Visual Library}\sidenote{ \url{https://worldbank.github.io/Stata-IE-Visual-Library/}} -has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com/}} -\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_visualization}} \index{data visualization} -is increasingly popular, and is becoming a field in its own right.\cite{healy2018data,wilke2019fundamentals} -We attribute some of this to the difficulty of writing code to create them. -Making a visually compelling graph would already be hard enough if you didn't have to go through many rounds of googling to understand a command. -The trickiest part of using plot commands is to get the data in the right format. -This is why the \textbf{Stata Visual Library} includes example data sets to use -with each do-file. +\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} +creates and exports balance tables to excel or {\LaTeX}. +\texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} +does the same for difference-in-differences regressions. +It also includes a command, \texttt{iegraph}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Iegraph}}, +to export pre-formatted impact evaluation results graphs It's ok to not export each and every table and graph created during exploratory analysis. -Final analysis scripts, on the other hand, should export final outputs, which are ready to be included to a paper or report. +Final analysis scripts, on the other hand, should export final outputs, +which are ready to be included to a paper or report. No manual edits, including formatting, should be necessary after exporting final outputs -- -those that require copying and pasting edited outputs, in particular, are absolutely not advisable. -Manual edits are difficult to replicate, and you will inevitably need to make changes to the outputs. +those that require copying and pasting edited outputs, +in particular, are absolutely not advisable. +Manual edits are difficult to replicate, +and you will inevitably need to make changes to the outputs. Automating them will save you time by the end of the process. -However, don't spend too much time formatting tables and graphs until you are ready to publish.\sidenote{For a more detailed discussion on this, including different ways to export tables from Stata, see \url{https://github.com/bbdaniels/stata-tables}} +However, don't spend too much time formatting tables and graphs until you are ready to publish.\sidenote{ + For a more detailed discussion on this, including different ways to export tables from Stata, see \url{https://github.com/bbdaniels/stata-tables}} Polishing final outputs can be a time-consuming process, and you want to it as few times as possible. -We cannot stress this enough: don't ever set a workflow that requires copying and pasting results. +We cannot stress this enough: +don't ever set a workflow that requires copying and pasting results. Copying results from excel to word is error-prone and inefficient. -Copying results from a software console is risk-prone, even more inefficient, and unnecessary. -There are numerous commands to export outputs from both R and Stata to a myriad of formats.\sidenote{Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, +Copying results from a software console is risk-prone, +even more inefficient, and unnecessary. +There are numerous commands to export outputs from both R and Stata to a myriad of formats.\sidenote{ + Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, and \href{https://www.benjaminbdaniels.com/stata-code/outwrite/}{\texttt{outwrite}} in Stata, and \href{https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf}{\texttt{stargazer}} and \href{https://ggplot2.tidyverse.org/reference/ggsave.html}{\texttt{ggsave}} in R.} Save outputs in accessible and, whenever possible, lightweight formats. Accessible means that it's easy for other people to open them. -In Stata, that would mean always using \texttt{graph export} to save images as \texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., -instead of \texttt{graph save}, which creates a \texttt{.gph} file that can only be opened through a Stata installation. +In Stata, that would mean always using \texttt{graph export} to save images as +\texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., +instead of \texttt{graph save}, +which creates a \texttt{.gph} file that can only be opened through a Stata installation. Some publications require ``lossless'' TIFF of EPS files, which are created by specifying the desired extension. Whichever format you decide to use, remember to always specify the file extension explicitly. For tables there are less options and more consideration to be made. @@ -527,22 +635,31 @@ \section{Exporting analysis outputs} The amount of work needed in a copy-paste workflow increases rapidly with the number of tables and figures included in a research output, and do the chances of having the wrong version a result in your paper or report. - -% Formatting -If you need to create a table with a very particular format, that is not automated by any command you know, consider writing the it manually +If you need to create a table with a very particular format, +that is not automated by any command you know, consider writing the it manually (Stata's \texttt{filewrite}, for example, allows you to do that). -This will allow you to write a cleaner script that focuses on the econometrics, and not on complicated commands to create and append intermediate matrices. +This will allow you to write a cleaner script that focuses on the econometrics, +and not on complicated commands to create and append intermediate matrices. To avoid cluttering your scripts with formatting and ensure that formatting is consistent across outputs, define formatting options in an R object or a Stata global and call them when needed. -% Output content + Keep in mind that final outputs should be self-standing. This means it should be easy to read and understand them with only the information they contain. -Make sure labels and notes cover all relevant information, such as sample, unit of observation, unit of measurement and variable definition.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Checklist:\_Reviewing\_Graphs} \\ \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Submit\_Table}} - -If you follow the steps outlined in this chapter, most of the data work involved in the last step of the research process -- publication -- will already be done. -If you used de-identified data for analysis, publishing the cleaned data set in a trusted repository will allow you to cite your data. -Some of the documentation produced during cleaning and construction can be published even if your data is too sensitive to be published. -Your analysis code will be organized in a reproducible way, so will need to do release a replication package is a last round of code review. -This will allow you to focus on what matters: writing up your results into a compelling story. +Make sure labels and notes cover all relevant information, such as sample, +unit of observation, unit of measurement and variable definition.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Reviewing\_Graphs} \\ + \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Submit\_Table}} + +If you follow the steps outlined in this chapter, +most of the data work involved in the last step of the research process +-- publication -- will already be done. +If you used de-identified data for analysis, +publishing the cleaned data set in a trusted repository will allow you to cite your data. +Some of the documentation produced during cleaning and construction can be published +even if your data is too sensitive to be published. +Your analysis code will be organized in a reproducible way, +so will need to do release a replication package is a last round of code review. +This will allow you to focus on what matters: +writing up your results into a compelling story. %------------------------------------------------ From 5bb58785e6cf5ae37ec34e7a477c1e48480a8f75 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 13:48:00 -0500 Subject: [PATCH 640/854] Title change --- manuscript.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/manuscript.tex b/manuscript.tex index a56a64209..2ca8a7d08 100644 --- a/manuscript.tex +++ b/manuscript.tex @@ -68,7 +68,7 @@ \chapter{Chapter 4: Sampling, randomization, and power} % CHAPTER 5 %---------------------------------------------------------------------------------------- -\chapter{Chapter 5: Collecting primary data} +\chapter{Chapter 5: Acquiring development data} \label{ch:5} \input{chapters/data-collection.tex} From db2a37fb7f5cb8e69809ddb18791bcf0381f46fb Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 11 Feb 2020 13:55:59 -0500 Subject: [PATCH 641/854] typo --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 5d7acd2ba..a7d89e992 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -429,7 +429,7 @@ \subsection{Construction tasks and how to approach them} but the best way to guarantee it won't happen is to create the indicators for all rounds in the same script. Say you constructed variables after baseline, and are now receiving midline data. Then the first thing you should do is create a panel data set --- \{texttt{iecodebook}'s \texttt{append} subcommand will help you reconcile and append survey rounds. +-- \texttt{iecodebook}'s \texttt{append} subcommand will help you reconcile and append survey rounds. After that, adapt the construction code so it can be used on the panel data set. Apart from preventing inconsistencies, this process will also save you time and give you an opportunity to review your original code. From 50e1d8107fbf390d846bcafc9ef4c41b09805d14 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 11 Feb 2020 13:58:20 -0500 Subject: [PATCH 642/854] [ch6] change "raw data" to "original data points" --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 2dac73707..38f9bd110 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -231,7 +231,7 @@ \subsection{Correction of data entry errors} looking for duplicated entries is usually part of data quality monitoring, and is typically addressed as part of that process. So, in practice, you will start writing data cleaning code during data collection. -The other only other case when changes to the raw data are made during cleaning +The other only other case when changes to the original data points are made during cleaning is also directly connected to data quality monitoring: it's when you need to correct mistakes in data entry. During data quality monitoring, you will inevitably encounter data entry mistakes, From ff500d2f3c69aba4c2868e0dc4570b2824f2e5d4 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 11 Feb 2020 14:00:16 -0500 Subject: [PATCH 643/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 38f9bd110..1d70b700c 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -188,6 +188,10 @@ \subsection{De-identification} assess them against the analysis plan and ask: will this variable be needed for analysis? If not, the variable should be dropped. +Don't be afraid to drop too many variables the first time, +as you can always go back and remove variables from the list of variables to be dropped, +but you can not go back in time and drop a PII variable that was leaked +because it was incorrectly kept. Examples include respondent names, enumerator names, interview date, respondent phone number. If the variable is needed for analysis, ask: can I encode or otherwise construct a variable to use for the analysis that masks the PII, From 71d6d7e6957bc2c2a5a80682abc5a151b5732a25 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 11 Feb 2020 14:00:34 -0500 Subject: [PATCH 644/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 1d70b700c..f14a13ba5 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -200,7 +200,7 @@ \subsection{De-identification} Examples include geocoordinates (after constructing measures of distance or area, drop the specific location), -and names for social network analysis (can be encoded to unique numeric IDs). +and names for social network analysis (can be encoded to secret and unique IDs). If PII variables are strictly required for the analysis itself, it will be necessary to keep at least a subset of the data encrypted through the data analysis process. If the answer is yes to either of these questions, From 342e476a9b640378837e8c18c29c5bb019122d58 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 11 Feb 2020 14:06:38 -0500 Subject: [PATCH 645/854] [ch6] make de-identification its own section --- chapters/data-analysis.tex | 37 +++++++++++++++++++------------------ 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index f14a13ba5..a52824ec0 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -124,24 +124,9 @@ \subsection{Version control} %------------------------------------------------ -\section{Data cleaning} - -Data cleaning is the first stage of transforming the data you received from the field into data that you can analyze.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} -The cleaning process involves (1) making the data set easily usable and understandable, -and (2) documenting individual data points and patterns that may bias the analysis. -The underlying data structure does not change. -The cleaned data set should contain only the variables collected in the field. -No modifications to data points are made at this stage, except for corrections of mistaken entries. +\section{De-identification} -Cleaning is probably the most time consuming of the stages discussed in this chapter. -This is the time when you obtain an extensive understanding of the contents and structure of the data that was collected. -Explore your data set using tabulations, summaries, and descriptive plots. -You should use this time to understand the types of responses collected, both within each survey question and across respondents. -Knowing your data set well will make it possible to do analysis. - -\subsection{De-identification} - -The initial input for data cleaning is the raw data. +The starting point for all tasks described in this chapter is the raw data. It should contain only materials that are received directly from the field. They will invariably come in a host of file formats and nearly always contain personally-identifying information.\index{personally-identifying information} These files should be retained in the raw data folder \textit{exactly as they were received}. @@ -214,6 +199,22 @@ \subsection{De-identification} to find and confirm the identity of interviewees, de-identification should not affect the usability of the data. + +\section{Data cleaning} + +Data cleaning is the second stage in the transformation of data you received from the field into data that you can analyze.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} +The cleaning process involves (1) making the data set easily usable and understandable, +and (2) documenting individual data points and patterns that may bias the analysis. +The underlying data structure does not change. +The cleaned data set should contain only the variables collected in the field. +No modifications to data points are made at this stage, except for corrections of mistaken entries. + +Cleaning is probably the most time consuming of the stages discussed in this chapter. +This is the time when you obtain an extensive understanding of the contents and structure of the data that was collected. +Explore your data set using tabulations, summaries, and descriptive plots. +You should use this time to understand the types of responses collected, both within each survey question and across respondents. +Knowing your data set well will make it possible to do analysis. + \subsection{Correction of data entry errors} There are two main cases when the raw data will be modified during data cleaning. @@ -327,7 +328,7 @@ \subsection{The cleaned data set} \section{Indicator construction} % What is construction ------------------------------------- -The second stage in the creation of analysis data is construction. +The third stage in the creation of analysis data is construction. Constructing variables means processing the data points as provided in the raw data to make them suitable for analysis. It is at this stage that the raw data is transformed into analysis data. This is done by creating derived variables (dummies, indices, and interactions, to name a few), From 7df7d608720d7ce5a85022ee5706c940eb1a7b37 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 14:08:45 -0500 Subject: [PATCH 646/854] Small tasks from #366 --- chapters/data-collection.tex | 376 ++++++++++++++++++----------------- 1 file changed, 191 insertions(+), 185 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index b64534bc4..9de753f04 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -2,11 +2,11 @@ \begin{fullwidth} Much of the recent push toward credibility in the social sciences has focused on analytical practices. -We contest that credible research depends, first and foremost, on the quality of the raw data. +We contest that credible research depends, first and foremost, on the quality of the raw data. -This chapter covers the data acquisition +This chapter covers the data acquisition -We then dive specifically into survey data, providing guidance on the data generation workflow, +We then dive specifically into survey data, providing guidance on the data generation workflow, from questionnaire design to programming electronic survey instruments and monitoring data quality. We conclude with a discussion of safe data handling, storage, and sharing. @@ -21,12 +21,14 @@ %------------------------------------------------ \section{Acquiring data} -Primary data is the key to most modern development research. +High-quality data is essential to most modern development research. Often, there is simply no source of reliable official statistics on the inputs or outcomes we are interested in. -Therefore we undertake to create or obtain new data, +Therefore we undertake to create or obtain development data +-- including administrative data, secondary data, original records, +field surveys, or other forms of big data -- typically in partnership with a local agency or organization. -The intention of primary data collection +The intention of this mode of data acquisition is to answer a unique question that cannot be approached in any other way, so it is important to properly collect and handle that data, especially when it belongs to or describes people. @@ -395,13 +397,12 @@ \subsection{Programming electronic questionnaires} %------------------------------------------------ \section{Data quality assurance} -Whether you are handling data from a partner or collecting it directly, +Whether you are acquiring data from a partner or collecting it directly, it is important to make sure that data faithfully reflects ground realities. Data quality assurance requires a combination of real-time data checks and back-checks or validation audits, which often means tracking down the people whose information is in the dataset. - \subsection{Implementing high frequency quality checks} A key advantage of continuous electronic data intake methods, @@ -422,17 +423,17 @@ \subsection{Implementing high frequency quality checks} High-frequency checks (HFCs) should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, and that additional effort is centered where it is most important. -Data quality checks should be run on the data every time it is received (ideally on a daily basis) -to flag irregularities in survey progress, sample completeness or response quality. +Data quality checks should be run on the data every time it is received from the field or partner +to flag irregularities in the aquisition progress, in sample completeness, or in response quality. \texttt{ipacheck}\sidenote{ \url{https://github.com/PovertyAction/high-frequency-checks}} is a very useful command that automates some of these tasks, regardless of the source of the data. It is important to check continuously that the observations in the data match the intended sample. -Many survey softwares provide some form of case management features +In surveys, the software often provides some form of case management features through which sampled units are directly assigned to individual enumerators. -For data recieved from partners this may be harder to validate, +For data recieved from partners, this may be harder to validate, since they are the authoritative source of the data, so cross-referencing with other data sources may be necessary to validate data. Even with careful management, it is often the case that raw data includes duplicate or missing entries, @@ -498,7 +499,7 @@ \subsection{Conducting back-checks and data validation} For back-checks and validation audies, a random subset of the main data is selected, and a subset of information from the full survey is verified through a brief targeted survey with the original respondent -or a cross-referenced data set from another source. +or a cross-referenced data set from another source (if the original data is not a field survey). Design of the back-checks or validations follows the same survey design principles discussed above: you should use the analysis plan or list of key outcomes to establish which subset of variables to prioritize, @@ -506,7 +507,7 @@ \subsection{Conducting back-checks and data validation} Real-time access to the data massively increases the potential utility of validation, and both simplifies and improves the rigor of the associated workflows. -You can use the raw primary data to draw the back-check or validation sample; +You can use the raw data to draw the back-check or validation sample; this ensures that the validation is correctly apportioned across observations. As soon as checking is complete, the comparator data can be tested against the original data to identify areas of concern in real-time. @@ -518,222 +519,227 @@ \subsection{Conducting back-checks and data validation} \textbf{Audio audits} are a useful means to assess whether enumerators are conducting interviews as expected. Do note, however, that audio audits must be included in the informed consent for the respondents. +\subsection{Finalizing data collection} + +When all data collection is complete, the survey team should prepare a final field report, +which should report reasons for any deviations between the original sample and the dataset collected. +Identification and reporting of \textbf{missing data} and \textbf{attrition} +is critical to the interpretation of survey data. +It is important to structure this reporting in a way that not only +groups broad rationales into specific categories +but also collects all the detailed, open-ended responses +to questions the field team can provide for any observations that they were unable to complete. +This reporting should be validated and saved alongside the final raw data, and treated the same way. +This information should be stored as a dataset in its own right +-- a \textbf{tracking dataset} -- that records all events in which survey substitutions +and loss to follow-up occurred in the field and how they were implemented and resolved. + %------------------------------------------------ \section{Collecting and sharing data securely} -All sensitive data must be handled in a way where there is no risk that anyone who is -not approved by an Institutional Review Board (IRB)\sidenote{\url{ -https://dimewiki.worldbank.org/wiki/IRB\_Approval}} -for the specific project has the -ability to access the data. - Data can be sensitive for multiple reasons, but the two most +All sensitive data must be handled in a way where there is no risk that anyone who is +not approved by an Institutional Review Board (IRB)\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/IRB\_Approval}} +for the specific project has the ability to access the data. +Data can be sensitive for multiple reasons, but the two most common reasons are that it contains personally identifiable information (PII)\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Personally\_Identifiable\_Information\_(PII)}} -or that the partner providing the data does not want it to be released. + \url{https://dimewiki.worldbank.org/wiki/Personally\_Identifiable\_Information\_(PII)}} +or that the partner providing the data does not want it to be released. -Central to data security is \index{encryption}\textbf{data encryption} which is a group +Central to data security is \index{encryption}\textbf{data encryption}, which is a group of methods that ensure that files are unreadable even if laptops are stolen, servers -are hacked, or unauthorized access to the data is obtained in any other way. -\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} -Proper encryption is rarely just one thing as the data will travel through many servers, -devices and computers from the source of the data to the final analysis. -So encryption should be seen as a system that is only as secure as its weakest link. -This section recommends a workflow with as few parts as possible, -so that it is easy as possible to make sure the weakest link is still strong enough. - -Encrypted data is made readable again using decryption, and decryption requires a password or a key. -You must never share passwords or keys by email, WhatsApp or other insecure modes of communication; -instead you must use a secure password manager\sidenote{\url{ -https://lastpass.com} or \url{https://bitwarden.com}}. -In addition to providing a way to securely share passwords, -password managers also provide a secure location for long term storage for passwords and keys regardless if -they are shared or not. - -Many data sharing software providers you are using will promote their services by saying they have -on-the-fly encryption and decryption. -While this is not a bad thing and it makes your data more secure, -on-the-fly encryption/decryption by itself is never secure enough, -as in order to make it automatic -they need to keep a copy of the password or key. -Since it unlikely that that software provider is -included in your IRB, this is not good enough. - -It is possible in some enterprise versions of data sharing software, to set up on-the-fly encryption. -However, that set up is advanced and you should never trust it unless you are a cyber security expert, -or a cyber security expert within your organization have specified what it can be used for. +are hacked, or unauthorized access to the data is obtained in any other way.\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Encryption}} +Proper encryption is rarely just a single method, +as the data will travel through many servers, devices, and computers +from the source of the data to the final analysis. +So encryption should be seen as a system that is only as secure as its weakest link. +This section recommends a workflow with as few parts as possible, +so that it is easy as possible to make sure the weakest link is still sufficiently secure. + +Encrypted data is made readable again using decryption, +and decryption requires a password or a key. +You must never share passwords or keys by email, +WhatsApp or other insecure modes of communication; +instead you must use a secure password manager.\sidenote{ + \url{https://lastpass.com} or \url{https://bitwarden.com}} +In addition to providing a way to securely share passwords, +password managers also provide a secure location +for long term storage for passwords and keys, whether they are shared or not. + +Many data-sharing software providers you are using will promote their services +by saying they have on-the-fly encryption and decryption. +While this is not a bad thing and it makes your data more secure, +on-the-fly encryption is never secure enough. +This is because, in order to make it automatic, +the service provider needs to keep a copy of the password or key. +Since it unlikely that that software provider is included in your IRB, +this is not secure enough. + +It is possible, in some enterprise versions of data sharing software, +to set up appropriately secure on-the-fly encryption. +However, that setup is advanced, and you should never trust it +unless you are a cybersecurity expert, +or a cybersecurity expert within your organization +has specified what it can be used for. In all other cases you should follow the steps laid out in this section. -\subsection{Data security during data collection} +\subsection{Collecting data securely} In field surveys, most common data collection software will automatically encrypt all data in transit (i.e., upload from field or download from server).\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_in\_Transit}} -If this is implemented by the software you are using, -then your data will be encrypted from the time it leaves the device -(in tablet-assisted data collection) or browser (in web data collection), -until it reaches the server. -Therefore, as long as you are using an established survey software, -this step is largely taken care of. -However, the research team must ensure that all computers, tablets, -and accounts that are used in data collection have secure a logon + \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_in\_Transit}} +If this is implemented by the software you are using, +then your data will be encrypted from the time it leaves the device +(in tablet-assisted data collection) or browser (in web data collection), +until it reaches the server. +Therefore, as long as you are using an established survey software, +this step is largely taken care of. +However, the research team must ensure that all computers, tablets, +and accounts that are used in data collection have secure a logon password and are never left unlocked. Even though your data is therefore usually safe while it is being transmitted, it is not automatically secure when it is being stored. \textbf{Encryption at rest}\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_at\_Rest}} + \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_at\_Rest}} is the only way to ensure that PII data remains private when it is stored on a -server on the internet. -You must keep your data encrypted on the data collection server whenever PII data is collected. -If you do not, the raw data will be accessible by -individuals who are not approved by your IRB, -such as tech support personnel, -server administrators and other third-party staff. +server on the internet. +You must keep your data encrypted on the data collection server whenever PII data is collected. +If you do not, the raw data will be accessible by +individuals who are not approved by your IRB, +such as tech support personnel, +server administrators and other third-party staff. Encryption at rest must be used to make -data files completely unusable without access to a security key specific to that -data -- a higher level of security than password-protection. -Encryption at rest requires active participation from the user, -and you should be fully aware that if your decryption key is lost, +data files completely unusable without access to a security key specific to that +data -- a higher level of security than password-protection. +Encryption at rest requires active participation from the user, +and you should be fully aware that if your decryption key is lost, there is absolutely no way to recover your data. -You should not assume that your data is encrypted at rest by default because of -the careful protocols necessary. -In most data collection platforms, -encryption at rest needs to be explicitly enabled and operated by the user. -There is no automatic way to implement this protocol, -because the encryption key that is generated may -never pass through the hands of a third party, +You should not assume that your data is encrypted at rest by default because of +the careful protocols necessary. +In most data collection platforms, +encryption at rest needs to be explicitly enabled and operated by the user. +There is no automatic way to implement this protocol, +because the encryption key that is generated may +never pass through the hands of a third party, including the data storage application. -Most survey software implement \textbf{asymmetric encryption}\sidenote{\url{ -https://dimewiki.worldbank.org/wiki/Encryption\#Asymmetric\_Encryption}} -where there are two keys in a public/private key pair. -Only the private key can be used to decrypt the encrypted data, +Most survey software implement \textbf{asymmetric encryption}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Encryption\#Asymmetric\_Encryption}} +where there are two keys in a public/private key pair. +Only the private key can be used to decrypt the encrypted data, and the public key can only be used to encrypt the data. -It is therefore safe to send the public key to the tablet or the browser used to collect the data. +It is therefore safe to send the public key +to the tablet or the browser used to collect the data. When you enable encryption, the survey software will allow you to create and -download -- once -- the public/private keyfile pair needed to decrypt the data. +download -- once -- the public/private key pair needed to encrypt and decrypt the data. You upload the public key when you start a new survey, and all data collected using that -public key can only be accessed with the private key from that specific public/private key pair. -You must store the key pair in a secure location, such as a password manager, -as there is no way to access your data if the private key is lost. -Make sure you store keyfiles with descriptive names to match the survey to which they correspond. -Any time anyone accesses the data -- -either when viewing it in the browser or downloading it to your computer -- -they will be asked to provide the keyfile. -Only project team members named in the IRB are allowed access to the private keyfile. - -\subsection{Data security after data collection} - -For most analytical needs, you typically need a to store the data somewhere else -than the survey software's server, for example your computer or a cloud drive. -While asymmetric encryption is optimal for one-way transfer from the data collection device -to the data collection server, +public key can only be accessed with the private key from that specific public/private key pair. +You must store the key pair in a secure location, such as a password manager, +as there is no way to access your data if the private key is lost. +Make sure you store keyfiles with descriptive names to match the survey to which they correspond. +Any time anyone accesses the data -- +either when viewing it in the browser or downloading it to your computer -- +they will be asked to provide the key. +Only project team members named in the IRB are allowed access to the private key. + +\subsection{Storing data securely} + +For most analytical needs, you typically need a to store the data somewhere other +than the survey software's server, for example, on your computer or a cloud drive. +While public/private key encryption is optimal for one-way transfer +from the data collection device to the data collection server, it is not practical once you start interacting with the data. - -Instead we want to use \textbf{symmetric encryption}\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Encryption\#Symmetric\_Encryption}} -where we create a secure encrypted folder, -using for example VeraCrypt\sidenote{\url{https://www.veracrypt.fr/}}, -where a single key is used to both encrypt and decrypt the information. -Since only one key is used, the work flow can be simplified, -the re-encryption after decrypting can be done automatically and the same secure folder can be used for multiple files, -and these files can be interacted with and modified like any unencryted file as long as you have the key. +Instead, we use \textbf{symmetric encryption}\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Encryption\#Symmetric\_Encryption}} +where we create a secure encrypted folder, +using, for example, VeraCrypt.\sidenote{\url{https://www.veracrypt.fr/}} +Here, a single key is used to both encrypt and decrypt the information. +Since only one key is used, the workflow can be simplified: +the re-encryption after decrypted access can be done automatically, +and the same secure folder can be used for multiple files. +These files can be interacted with and modified like any unencrypted file as long as you have the key. The following workflow allows you to receive data and store it securely, without compromising data security: \begin{enumerate} - \item Create a secure encrypted folder in your project folder, - this should be on your computer and could be in a shared folder. - \item Download data from the data collection server to that secure folder -- - if you encrypted the data during data collection you will need \textit{both} the - private key used during data collection to be able to download the data, - \textit{and} you will need the key used when created the secure folder to save it there. - This your first copy of your raw data, and the copy you will used in your cleaning and analysis. - \item Then create a secure folder on a pen-drive or a external hard drive, - that you can keep in your office. - Copy the data you just downloaded to this second secure folder. - This is your ''master'' copy of your raw data. - (Instead of creating a second secure folder, you can simply copy the first secure folder) - \item Finally, create a third secure folder. - Either you can create this on your computer and upload it to a long-term cloud storage service, - or you can create it on an external hard drive that you then store in a separate location, - for example at another office of your organization. - This is your ''golden master'' copy of your raw data. - You should never store the ''golden master'' copy of your raw data in a synced - folder where it is also deleted in the cloud storage if it is deleted on your computer. - (Instead of creating a third secure folder, you can simply copy the first secure folder). + \item Create a secure encrypted folder in your project folder. + This should be on your computer, and could be in a shared folder. + \item Download data from the data collection server to that secure folder. + If you encrypted the data during data collection, you will need \textit{both} the + private key used during data collection to be able to download the data, + \textit{and} you will need the key used when you created the secure folder to save it there. + This your first copy of your raw data, and the copy you will use in your cleaning and analysis. + \item Create a secure folder on a pen-drive or a external hard drive that you can keep in your office. + Copy the data you just downloaded to this second secure folder. + This is your ``master'' copy of your raw data. + (Instead of creating a second secure folder, you can simply copy the first secure folder.) + \item Finally, create a third secure folder. + Either you can create this on your computer and upload it to a long-term cloud storage service, + or you can create it on an external hard drive that you then store in a separate location, + for example, at another office of your organization. + This is your ``golden master'' copy of your raw data. + You should never store the ``golden master'' copy of your raw data in a synced + folder, where it is also deleted in the cloud storage if it is deleted on your computer. + (Instead of creating a third secure folder, you can simply copy the first secure folder.) \end{enumerate} -\noindent This handling satisfies the \textbf{3-2-1 rule}: -there are two on-site copies of the data and one off-site copy, -so the data can never be lost in case of hardware failure.\sidenote{\url{ -https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} -However, you still need to keep track of your encryption keys as without them your data is lost. -If you remain lucky, you will never have to access your ``master'' or ``golden master'' copies -- +\noindent This handling satisfies the \textbf{3-2-1 rule}: +there are two on-site copies of the data and one off-site copy, +so the data can never be lost in case of hardware failure.\sidenote{ + \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} +However, you still need to keep track of your encryption keys as without them your data is lost. +If you remain lucky, you will never have to access your ``master'' or ``golden master'' copies -- you just want to know it is out there, safe, if you need it. -\subsection{Secure data sharing} -You and your team will use your first copy of the raw data as the starting point for data -cleaning analysis of the data. -This raw data set must remain encrypted at all times if it includes PII data, -which is almost always the case. -As long as the data is properly encrypted, using for example VeraCrypt, -it can be shared using insecure modes of communication such as email or third-party syncing services. -While this is safe from a data security perspective, -this is a burdensome workflow as anyone accessing the raw data must be listed on the IRB, -have access to the decryption key and know how to use that key. -Fortunately there is a way to simplify the workflow without compromising data security. - -To simplify the workflow, -the PII variables should be removed from your data at the earliest -possible opportunity creating a de-identified copy of the data. +\subsection{Sharing data securely} +You and your team will use your first copy of the raw data +as the starting point for data cleaning and analysis of the data. +This raw data set must remain encrypted at all times if it includes PII data, +which is almost always the case. +As long as the data is properly encrypted, +it can be shared using insecure modes of communication +such as email or third-party syncing services. +While this is safe from a data security perspective, +this is a burdensome workflow, as anyone accessing the raw data must be listed on the IRB, +have access to the decryption key and know how to use that key. +Fortunately, there is a way to simplify the workflow without compromising data security. + +To simplify the workflow, +the PII variables should be removed from your data at the earliest +possible opportunity creating a de-identified copy of the data. Once the data is de-identified, -it no longer needs to be encrypted -- -therefore you and you team members can share it directly -without having to encrypt it and handle decryption keys. -ext chapter will discuss how to de-identify your data. -If PII variables are directly required for the analysis itself, +it no longer needs to be encrypted -- +you and you team members can share it directly +without having to encrypt it and handle decryption keys. +The next chapter will discuss how to de-identify your data. +If PII variables are directly required for the analysis itself, it will be necessary to keep at least a subset of the data encrypted through the data analysis process. -The data security standards that apply when receiving PII data obviously also apply when sending PII data. -A common example where this is often forgotten is when sending survey information, -such as sampling lists, to the field partner. -This data is by all definitions also PII data and must be encrypted. +The data security standards that apply when receiving PII data also apply when transferring PII data. +A common example where this is often forgotten is when sending survey information, +such as sampling lists, to a field partner. +This data is -- by all definitions -- also PII data and must be encrypted. A sampling list can often be used to reverse identify a de-identified data set, -so if you were to share it using an insecure method, -then that would be your weakest link that could break all the other steps +so if you were to share it using an insecure method, +then that would be your weakest link that could render useless all the other steps you have taken to ensure the privacy of the respondents. -In some survey software you can use the same encryption that allows you to receive data securely -from the field, to also send data, such a sampling list, to the field. -But if you are not sure how that is done, or even can be done, -in the survey software you are using, -then you should create a secure folder using, for example, -VeraCrypt and share that secure folder with the field team. +In some survey software, you can use the same encryption that allows you to receive data securely +from the field, to also send data, such a sampling list, to the field. +But if you are not sure how that is done, or even can be done, +in the survey software you are using, +then you should create a secure folder using, for example, +VeraCrypt and share that secure folder with the field team. Remember that you must always share passwords and keys in a secure way like password managers. - -\section{Finalizing data collection} - -When all data collection is complete, the survey team should prepare a final field report, -which should report reasons for any deviations between the original sample and the dataset collected. -Identification and reporting of \textbf{missing data} and \textbf{attrition} -is critical to the interpretation of survey data. -It is important to structure this reporting in a way that not only -groups broad rationales into specific categories -but also collects all the detailed, open-ended responses -to questions the field team can provide for any observations that they were unable to complete. -This reporting should be validated and saved alongside the final raw data, and treated the same way. -This information should be stored as a dataset in its own right --- a \textbf{tracking dataset} -- that records all events in which survey substitutions -and loss to follow-up occurred in the field and how they were implemented and resolved. - At this point, the raw data securely stored and backed up. It can now be transformed into your final analysis data set, through the steps described in the next chapter. -Once the data collection is over, +Once the data collection is over, you typically will no longer need to interact with the identified data. So you should create a working version of it that you can safely interact with. This is described in the next chapter as the first task in the data cleaning process, -but it's useful to get it started as soon as encrypted data is downloaded to disk. \ No newline at end of file +but it's useful to get it started as soon as encrypted data is downloaded to disk. From adb453739d7bb9a8b9a2aa1ba48efba5e200e60a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 14:19:43 -0500 Subject: [PATCH 647/854] Chp 1 revision --- chapters/handling-data.tex | 68 +++++++++++++++++--------------------- 1 file changed, 31 insertions(+), 37 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 1fd0ce8c8..ffcc7a861 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -6,37 +6,31 @@ Policy decisions are made every day using the results of briefs and studies, and these can have wide-reaching consequences on the lives of millions. As the range and importance of the policy-relevant questions asked by development researchers grow, - so too does the (rightful) scrutiny under which methods and results are placed. - It is useful to think of research as a public service, - one that requires you to be accountable to both research participants and research consumers. - - On the research participant side, - it is essential to respect individual \textbf{privacy} and ensure \textbf{data security}. + so too does the (rightful) scrutiny under which methods and results are placed. + It is useful to think of research as a public service, + one that requires you to be accountable to both research participants and research consumers. + + On the research participant side, + it is essential to respect individual \textbf{privacy} and ensure \textbf{data security}. Researchers look deeply into real people's personal lives, financial conditions, and other sensitive subjects. - Respecting the respondents' right to privacy, - by intelligently assessing and proactively averting risks they might face, + Respecting the respondents' right to privacy, + by intelligently assessing and proactively averting risks they might face, is a core tenet of research ethics. - On the consumer side, it is important to protect confidence in development research - by following modern practices for \textbf{transparency} and \textbf{reproducibility}. + On the consumer side, it is important to protect confidence in development research + by following modern practices for \textbf{transparency} and \textbf{reproducibility}. Across the social sciences, the open science movement has been fueled by discoveries of low-quality research practices, data and code that are inaccessible to the public, analytical errors in major research papers, and in some cases even outright fraud. While the development research community has not yet - experienced any major scandals, it has become clear that there are necessary incremental improvements + experienced any major scandals, it has become clear that there are necessary incremental improvements in the way that code and data are handled as part of research. - Neither privacy nor transparency is an ``all-or-nothing'' objective. - Most important is to report the transparency and privacy measures taken. - Otherwise, reputation is the primary signal for the quality of evidence, and two failures may occur: - low-quality studies from reputable sources may be used as evidence when in fact they don't warrant it, - and high-quality studies from sources without an international reputation may be ignored. - Both these outcomes reduce the quality of evidence overall. - Simple transparency standards mean that it is easier to judge research quality, - and identifying high-quality research increases its impact. - + Neither privacy nor transparency is an ``all-or-nothing'' objective: + the most important thing is to report the transparency and privacy measures you have taken + and always strive to to the best that you are capable of with current technology. In this chapter, we outline a set of practices that help to ensure research participants are appropriately protected and research consumers can be confident in the conclusions reached. - + \end{fullwidth} %------------------------------------------------ @@ -126,9 +120,9 @@ \subsection{Research transparency} because it requires methodical organization that is labor-saving and efficient over the complete course of a project. Tools like \textbf{pre-registration}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Pre-Registration}}, + \url{https://dimewiki.worldbank.org/wiki/Pre-Registration}}, \textbf{pre-analysis plans}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}}, + \url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}}, and \textbf{registered reports}\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} can help with this process where they are available.\index{pre-registration}\index{pre-analysis plans}\index{Registered Reports} @@ -148,7 +142,7 @@ \subsection{Research transparency} Documenting a project in detail greatly increases transparency. Many disciplines have a tradition of keeping a ``lab notebook'', and adapting and expanding this process to create a -lab-style workflow in the development field is a +lab-style workflow in the development field is a critical step towards more transparent practices. This means explicitly noting decisions as they are made, and explaining the process behind the decision-making. @@ -285,7 +279,7 @@ \section{Ensuring privacy and security in research data} \subsection{Obtaining ethical approval and consent} -For almost all data collection and research activities that involves +For almost all data collection and research activities that involves human subjects or PII data, you will be required to complete some form of \textbf{Institutional Review Board (IRB)} process.\sidenote{ \textbf{Institutional Review Board (IRB):} An institution formally responsible for ensuring that research meets ethical standards.} @@ -350,10 +344,10 @@ \subsection{Transmitting and storing data securely} during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} The biggest security gap is often in transmitting survey plans to and from staff in the field. To protect information in transit to field staff, some key steps are: -(a) ensure that all devices that store confidential data +(a) ensure that all devices that store confidential data have hard drive encryption and password-protection; -(b) never send confidential data over e-mail, WhatsApp, etc. -without encrypting the information first; and +(b) never send confidential data over e-mail, WhatsApp, etc. +without encrypting the information first; and (c) train all field staff on the adequate privacy standards applicable to their work. Most modern data collection software has features that, @@ -362,14 +356,14 @@ \subsection{Transmitting and storing data securely} Many also have features that ensure data is encrypted when stored on their servers, although this usually needs to be actively enabled and administered.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_at\_Rest}} -When files are properly encrypted, +When files are properly encrypted, The information they contain will be completely unreadable and unusable -even if they were to be intercepted my a malicious -``intruder'' or accidentally made public. +even if they were to be intercepted my a malicious +``intruder'' or accidentally made public. When the proper data security precautions are taken, no one who is not listed on the IRB may have access to the decryption key. -This means that is it usually not -enough to rely service providers' on-the-fly encryption as they need to keep a copy +This means that is it usually not +enough to rely service providers' on-the-fly encryption as they need to keep a copy of the decryption key to make it automatic. The easiest way to reduce the risk of leaking personal information is to use it as rarely as possible. @@ -440,12 +434,12 @@ \subsection{De-identifying and anonymizing information} by using some other data that becomes identifying when analyzed together. For this reason, we recommend de-identification in two stages. The \textbf{initial de-identification} process strips the data of direct identifiers -to create a working de-identified dataset that +to create a working de-identified dataset that can be \textit{within the research team} without the need for encryption. The \textbf{final de-identification} process involves -making a decision about the trade-off between -risk of disclosure and utility of the data +making a decision about the trade-off between +risk of disclosure and utility of the data before publicly releasing a dataset.\sidenote{ \url{https://sdcpractice.readthedocs.io/en/latest/SDC\_intro.html\#need-for-sdc}} We will provide more detail about the process and tools available -for initial and final de-identification in chapters 6 and 7, respectively. \ No newline at end of file +for initial and final de-identification in chapters 6 and 7, respectively. From 5650cb234a7d7044d2dcda86bd292c4dd4a78f25 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 11 Feb 2020 14:20:03 -0500 Subject: [PATCH 648/854] [ch6] string variables --- chapters/data-analysis.tex | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index a52824ec0..339dce62f 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -265,8 +265,11 @@ \subsection{Labeling and annotating the raw data} Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Applying\_Labels}} Third, recodes should be used to turn codes for ``Don't know'', ``Refused to answer'', and other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} -String variables need to be encoded, and open-ended responses, categorized or dropped\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Strings}} -(unless you are using qualitative or classification analyses, which are less common). +String variables that correspond to categorical variables need to be encoded. +Open-ended responses stored as strings usually have a high-risk of being identifiers, +so they should be dropped at this point. +You can use the encrypted data as an input to a construction script +that categorizes these responses and merges them to the rest of the dataset. Finally, any additional information collected only for quality monitoring purposes, such as notes and duration fields, can also be dropped. From 626fbca6f7412f8b4360ebf657abcc7185c0d24a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 15:21:29 -0500 Subject: [PATCH 649/854] Chp 5 revision --- chapters/data-collection.tex | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 9de753f04..f36acc681 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -2,19 +2,38 @@ \begin{fullwidth} Much of the recent push toward credibility in the social sciences has focused on analytical practices. -We contest that credible research depends, first and foremost, on the quality of the raw data. - -This chapter covers the data acquisition +However, credible development research often depends, first and foremost, on the quality of the raw data. +This is because, when you are collecting the data yourself, +or it is provided only to you through a unique partnership, +there is no way for others to validate that it actually reflects the field reality +and that the indicators you have based your analysis on are meaningful. +This chapter details the necessary components for a high-quality data acquisition process, +no matter whether you are recieving large amounts of unique data from partners +or fielding a small, specialized custom survey. +It begins with a discussion of some key ethical and legal descriptions +to ensure that you have the right to do research using a specific dataset. +Particularly when sensitive data is being collected by you +or shared with you from a program implementer, government, or other partner, +you need to make sure these permissions are correctly granted and documented, +so that the ownership and licensing of all information is established +and the privacy rights of the people it describes are respected. We then dive specifically into survey data, providing guidance on the data generation workflow, from questionnaire design to programming electronic survey instruments and monitoring data quality. -We conclude with a discussion of safe data handling, storage, and sharing. - +While surveys remain popular, the rise of electronic data collection instruments +means that there are additional workflow considerations needed +to ensure that your data is accurate and usable in statistical software. There are many excellent resources on questionnaire design and field supervision, but few covering the particular challenges and opportunities presented by electronic surveys. -As there are many survey software, and the market is rapidly evolving, we focus on workflows and primary concepts, rather than software-specific tools. - - +As there are many survey software, and the market is rapidly evolving, +we focus on workflows and primary concepts, rather than software-specific tools. +We conclude with a discussion of safe handling, storage, and sharing of data. +Regardless of the type of data you collect, +the secure management of those files is a basic requirement +for satisfying the legal and ethical agreements that have allowed you +to access personal information for research purposes in the first place. +By following these guidelines, you will be able to move on to data analysis, +assured that your data has been obtained at high standards of both quality and security. \end{fullwidth} From 0040f479d4388337bf92522994848d9201c22f72 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 15:23:47 -0500 Subject: [PATCH 650/854] Remove emphasis in intros --- chapters/data-analysis.tex | 284 ++++++++++++++++---------------- chapters/handling-data.tex | 13 +- chapters/planning-data-work.tex | 11 +- chapters/publication.tex | 25 ++- chapters/research-design.tex | 11 +- 5 files changed, 169 insertions(+), 175 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index dc4a851a2..7ae01f360 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -1,65 +1,63 @@ %------------------------------------------------ \begin{fullwidth} -Transforming raw data into a substantial contribution to scientific knowledge -requires a mix of subject expertise, programming skills, -and statistical and econometric knowledge. +Transforming raw data into a substantial contribution to scientific knowledge +requires a mix of subject expertise, programming skills, +and statistical and econometric knowledge. The process of data analysis is typically -a back-and-forth discussion between people -with differing skill sets. -An essential part of the process is translating the -raw data received from the field into economically meaningful indicators. -To effectively do this in a team environment, -data, code and outputs must be well-organized, -with a clear system for version control, -and analytical scripts structured such that any member of the research team can run them. -Putting in time upfront to structure data work well -pays off substantial dividends throughout the process. - -In this chapter, we first cover data management: -how to organize your data work at the start of a project +a back-and-forth discussion between people +with differing skill sets. +An essential part of the process is translating the +raw data received from the field into economically meaningful indicators. +To effectively do this in a team environment, +data, code and outputs must be well-organized, +with a clear system for version control, +and analytical scripts structured such that any member of the research team can run them. +Putting in time upfront to structure data work well +pays off substantial dividends throughout the process. + +In this chapter, we first cover data management: +how to organize your data work at the start of a project so that coding the analysis itself is straightforward. -This includes setting up folders, organizing tasks, master scripts, +This includes setting up folders, organizing tasks, master scripts, and putting in place a version control system so that your work is easy for all research team members to follow, -and meets standards for transparency and reproducibility. -Second, we turn to de-identification, +and meets standards for transparency and reproducibility. +Second, we turn to de-identification, a critical step when working with any personally-identified data. In the third section, we offer detailed guidance on data cleaning, -from identifying duplicate entries to labeling and annotating raw data, -and how to transparently document the cleaning process. -Section four focuses on how to transform your clean data -into the actual indicators you will need for analysis, +from identifying duplicate entries to labeling and annotating raw data, +and how to transparently document the cleaning process. +Section four focuses on how to transform your clean data +into the actual indicators you will need for analysis, again emphasizing the importance of transparent documentation. -Finally, we turn to analysis itself. -We do not offer instructions on how to conduct specific analyses, +Finally, we turn to analysis itself. +We do not offer instructions on how to conduct specific analyses, as that is determined by research design; rather, we discuss how to structure analysis code, and how to automate common outputs so that your analysis is fully reproducible. - - \end{fullwidth} %------------------------------------------------ \section{Data management} -The goal of data management is to organize the components of data work +The goal of data management is to organize the components of data work so it can traced back and revised without massive effort. -In our experience, there are four key elements to good data management: -folder structure, task breakdown, master scripts, and version control. +In our experience, there are four key elements to good data management: +folder structure, task breakdown, master scripts, and version control. A good folder structure organizes files so that any material can be found when needed. It reflects a task breakdown into steps with well-defined inputs, tasks, and outputs. This breakdown is applied to code, data sets, and outputs. A master script connects folder structure and code. It is a one-file summary of your whole project. -Finally, version histories and backups enable the team +Finally, version histories and backups enable the team to edit files without fear of losing information. -Smart use of version control also allows you to track +Smart use of version control also allows you to track how each edit affects other files in the project. \subsection{Folder structure} -There are many schemes to organize research data. +There are many schemes to organize research data. Our preferred scheme reflects the task breakdown just discussed. \index{data organization} DIME Analytics created the \texttt{iefolder}\sidenote{ @@ -67,36 +65,36 @@ \subsection{Folder structure} package (part of \texttt{ietoolkit}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/ietoolkit}}) to standardize folder structures across teams and projects. -This means that PIs and RAs face very small costs when switching between projects, +This means that PIs and RAs face very small costs when switching between projects, because they are organized in the same way.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Folder}} We created the command based on our experience with primary data, but it can be used for different types of data. -Whatever you team may need in terms of organization, +Whatever you team may need in terms of organization, the principle of creating one standard remains. At the first level of the structure created by \texttt{iefolder} are what we call survey round folders.\sidenote{\url{https://dimewiki.worldbank.org/wiki/DataWork\_Survey\_Round}} -You can think of a ``round'' as one source of data, -that will be cleaned in the same script. -Inside round folders, there are dedicated folders for -raw (encrypted) data; de-identified data; cleaned data; and final (constructed) data. -There is a folder for raw results, as well as for final outputs. -The folders that hold code are organized in parallel to these, -so that the progression through the whole project can be followed by anyone new to the team. -Additionally, \texttt{iefolder} creates \textbf{master do-files}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} +You can think of a ``round'' as one source of data, +that will be cleaned in the same script. +Inside round folders, there are dedicated folders for +raw (encrypted) data; de-identified data; cleaned data; and final (constructed) data. +There is a folder for raw results, as well as for final outputs. +The folders that hold code are organized in parallel to these, +so that the progression through the whole project can be followed by anyone new to the team. +Additionally, \texttt{iefolder} creates \textbf{master do-files}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} so all project code is reflected in a top-level script. \subsection{Task breakdown} -We divide the process of turning raw data into analysis data into three stages: -data cleaning, variable construction, and data analysis. -Though they are frequently implemented at the same time, -we find that creating separate scripts and data sets prevents mistakes. -It will be easier to understand this division as we discuss what each stage comprises. -What you should know by now is that each of these stages has well-defined inputs and outputs. -This makes it easier to track tasks across scripts, -and avoids duplication of code that could lead to inconsistent results. -For each stage, there should be a code folder and a corresponding data set. +We divide the process of turning raw data into analysis data into three stages: +data cleaning, variable construction, and data analysis. +Though they are frequently implemented at the same time, +we find that creating separate scripts and data sets prevents mistakes. +It will be easier to understand this division as we discuss what each stage comprises. +What you should know by now is that each of these stages has well-defined inputs and outputs. +This makes it easier to track tasks across scripts, +and avoids duplication of code that could lead to inconsistent results. +For each stage, there should be a code folder and a corresponding data set. The names of codes, data sets and outputs for each stage should be consistent, -making clear how they relate to one another. +making clear how they relate to one another. So, for example, a script called \texttt{clean-section-1} would create a data set called \texttt{cleaned-section-1}. @@ -109,22 +107,22 @@ \subsection{Task breakdown} \subsection{Master scripts} Master scripts allow users to execute all the project code from a single file. -They briefly describes what each code, -and maps the files they require and create. -They also connects code and folder structure through globals or objects. -In short, a master script is a human-readable map to the tasks, -files and folder structure that comprise a project. -Having a master script eliminates the need for complex instructions to replicate results. +They briefly describes what each code, +and maps the files they require and create. +They also connects code and folder structure through globals or objects. +In short, a master script is a human-readable map to the tasks, +files and folder structure that comprise a project. +Having a master script eliminates the need for complex instructions to replicate results. Reading the master do-file should be enough for anyone unfamiliar with the project to understand what are the main tasks, which scripts execute them, -and where different files can be found in the project folder. +and where different files can be found in the project folder. That is, it should contain all the information needed to interact with a project's data work. \subsection{Version control} -Finally, everything that can be version-controlled should be. +Finally, everything that can be version-controlled should be. Version control allows you to effectively track code edits, -including the addition and deletion of files. -This way you can delete code you no longer need, +including the addition and deletion of files. +This way you can delete code you no longer need, and still recover it easily if you ever need to get back previous work. Both analysis results and data sets will change with the code. You should have each of them stored with the code that created it. @@ -133,18 +131,18 @@ \subsection{Version control} and metadata saved in \texttt{.txt} or \texttt{.csv} to that directory. Binary files that compile the tables, as well as the complete data sets, on the other hand, -should be stored in your team's shared folder. +should be stored in your team's shared folder. Whenever data cleaning or data construction codes are edited, use the master script to run all the code for your project. -Git will highlight the changes that were in data sets and results that they entail. +Git will highlight the changes that were in data sets and results that they entail. %------------------------------------------------ \section{Data cleaning} -Data cleaning is the first stage of transforming the data you received from the field +Data cleaning is the first stage of transforming the data you received from the field into data that you can analyze.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} -The cleaning process involves (1) making the data set easily usable and understandable, +The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. The cleaned data set should contain only the variables collected in the field. @@ -162,18 +160,18 @@ \subsection{De-identification} It should contain only materials that are received directly from the field. They will invariably come in a host of file formats and nearly always contain personally-identifying information.\index{personally-identifying information} These files should be retained in the raw data folder \textit{exactly as they were received}. -Be mindful of where this file is stored. +Be mindful of where this file is stored. Maintain a backup copy in a secure offsite location. Every other file is created from the raw data, and therefore can be recreated. -The exception, of course, is the raw data itself, so it should never be edited +The exception, of course, is the raw data itself, so it should never be edited directly. The rare and only case when the raw data can be edited directly is when it is encoded incorrectly and some non-English character is causing rows or columns to break at the wrong place -when the data is imported. +when the data is imported. In this scenario, you will have to remove the special character manually, save the resulting data set \textit{in a new file} and securely back up \textit{both} the broken and the fixed version of the raw data. Note that no one who is not listed in the IRB should be able to access its content, not even the company providing file-sharing services. -Check if your organization has guidelines on how to store data securely, as they may offer an institutional solution. +Check if your organization has guidelines on how to store data securely, as they may offer an institutional solution. If that is not the case, you will need to encrypt the data, especially before sharing it, and make sure that only IRB-listed team members have the encryption key.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}}. @@ -181,16 +179,16 @@ \subsection{De-identification} Secure storage of the raw data means access to it will be restricted even inside the research team.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Security}} Loading encrypted data frequently can be disruptive to the workflow. To facilitate the handling of the data, remove any personally identifiable information from the data set. -This will create a de-identified data set, that can be saved in a non-encrypted folder. +This will create a de-identified data set, that can be saved in a non-encrypted folder. De-identification,\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} at this stage, means stripping the data set of direct identifiers such as names, phone numbers, addresses, and geolocations.\sidenote{\url{ https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf}} The resulting de-identified data will be the underlying source for all cleaned and constructed data. -Because identifying information is typically only used during data collection, -to find and confirm the identity of interviewees, +Because identifying information is typically only used during data collection, +to find and confirm the identity of interviewees, de-identification should not affect the usability of the data. In fact, most identifying information can be converted into non-identified variables for analysis purposes -(e.g. GPS coordinates can be translated into distances). -However, if sensitive information is strictly needed for analysis, +(e.g. GPS coordinates can be translated into distances). +However, if sensitive information is strictly needed for analysis, the data must be encrypted while performing the tasks described in this chapter. \subsection{Correction of data entry errors} @@ -204,11 +202,11 @@ \subsection{Correction of data entry errors} You want to make sure the data set has a unique ID variable that can be cross-referenced with other records, such as the Master Data Set\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Data\_Set}} and other rounds of data collection. -\texttt{ieduplicates} and \texttt{iecompdup}, -two Stata commands included in the \texttt{iefieldkit} +\texttt{ieduplicates} and \texttt{iecompdup}, +two Stata commands included in the \texttt{iefieldkit} package\index{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}} create an automated workflow to identify, correct and document -occurrences of duplicate entries. +occurrences of duplicate entries. Looking for duplicated entries is usually part of data quality monitoring, as is the only other reason to change the raw data during cleaning: @@ -224,7 +222,7 @@ \subsection{Labeling and annotating the raw data} On average, making corrections to primary data is more time-consuming than when using secondary data. But you should always check for possible issues in any data you are about to use. The last step of data cleaning, however, will most likely still be necessary. -It consists of labeling and annotating the data, so that its users have all the +It consists of labeling and annotating the data, so that its users have all the information needed to interact with it. This is a key step to making the data easy to use, but it can be quite repetitive. The \texttt{iecodebook} command suite, also part of \texttt{iefieldkit}, @@ -234,7 +232,7 @@ \subsection{Labeling and annotating the raw data} We have a few recommendations on how to use this command for data cleaning. First, we suggest keeping the same variable names in the cleaned data set as in the survey instrument, so it's straightforward to link data points for a variable to the question that originated them. Second, don't skip the labeling. -Applying labels makes it easier to understand what the data is showing while exploring the data. +Applying labels makes it easier to understand what the data is showing while exploring the data. This minimizes the risk of small errors making their way through into the analysis stage. Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Applying\_Labels}} Third, recodes should be used to turn codes for ``Don't know'', ``Refused to answer'', and @@ -246,21 +244,21 @@ \subsection{Labeling and annotating the raw data} \subsection{Documenting data cleaning} -Throughout the data cleaning process, you will need inputs from the field, -including enumerator manuals, survey instruments, +Throughout the data cleaning process, you will need inputs from the field, +including enumerator manuals, survey instruments, supervisor notes, and data quality monitoring reports. These materials are essential for data documentation.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Documentation}} \index{Documentation} -They should be stored in the corresponding ``Documentation'' folder for easy access, +They should be stored in the corresponding ``Documentation'' folder for easy access, as you will probably need them during analysis, and they must be made available for publication. Include in the \texttt{Documentation} folder records of any corrections made to the data, including to duplicated entries, as well as communications from the field where theses issues are reported. -Be very careful not to include sensitive information in documentation that is not securely stored, +Be very careful not to include sensitive information in documentation that is not securely stored, or that you intend to release as part of a replication package or data publication. -Another important component of data cleaning documentation are the results of +Another important component of data cleaning documentation are the results of As clean your data set, take the time to explore the variables in it. Use tabulations, histograms and density plots to understand the structure of data, and look for potentially problematic patterns such as outliers, @@ -272,7 +270,7 @@ \subsection{Documenting data cleaning} \subsection{The cleaned data set} -The main output of data cleaning is the cleaned data set. +The main output of data cleaning is the cleaned data set. It should contain the same information as the raw data set, with no changes to data points. It should also be easily traced back to the survey instrument, @@ -281,7 +279,7 @@ \subsection{The cleaned data set} i.e. per survey instrument. Each row in the cleaned data set represents one survey entry or unit of observation.\sidenote{\cite{tidy-data}} If the raw data set is very large, or the survey instrument is very complex, -you may want to break the data cleaning into sub-steps, +you may want to break the data cleaning into sub-steps, and create intermediate cleaned data sets (for example, one per survey module). Breaking cleaned data sets into the smallest unit of observation inside a roster @@ -290,7 +288,7 @@ \subsection{The cleaned data set} To make sure this file doesn't get too big to be handled, use commands such as \texttt{compress} in Stata to make sure the data is always stored in the most efficient format. -Once you have a cleaned, de-identified data set, and documentation to support it, +Once you have a cleaned, de-identified data set, and documentation to support it, you have created the first data output of your project: a publishable data set. The next chapter will get into the details of data publication. @@ -305,7 +303,7 @@ \section{Indicator construction} The second stage in the creation of analysis data is construction. Constructing variables means processing the data points as provided in the raw data to make them suitable for analysis. It is at this stage that the raw data is transformed into analysis data. -This is done by creating derived variables (dummies, indices, and interactions, to name a few), +This is done by creating derived variables (dummies, indices, and interactions, to name a few), as planned during research design, and using the pre-analysis plan as a guide. To understand why construction is necessary, let's take the example of a household survey's consumption module. @@ -313,29 +311,29 @@ \section{Indicator construction} If they did, it will then ask about quantities, units and expenditure for each item. However, it is difficult to run a meaningful regression on the number of cups of milk and handfuls of beans that a household consumed over a week. You need to manipulate them into something that has \textit{economic} meaning, -such as caloric input or food expenditure per adult equivalent. -During this process, the data points will typically be reshaped and aggregated -so that level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} +such as caloric input or food expenditure per adult equivalent. +During this process, the data points will typically be reshaped and aggregated +so that level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{\url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} \subsection{Why construction?} % From cleaning -Construction is done separately from data cleaning for two reasons. +Construction is done separately from data cleaning for two reasons. The first one is to clearly differentiate the data originally collected from the result of data processing decisions. -The second is to ensure that variable definition is consistent across data sources. -Unlike cleaning, construction can create many outputs from many inputs. -Let's take the example of a project that has a baseline and an endline survey. -Unless the two instruments are exactly the same, which is preferable but often not the case, the data cleaning for them will require different steps, and therefore will be done separately. +The second is to ensure that variable definition is consistent across data sources. +Unlike cleaning, construction can create many outputs from many inputs. +Let's take the example of a project that has a baseline and an endline survey. +Unless the two instruments are exactly the same, which is preferable but often not the case, the data cleaning for them will require different steps, and therefore will be done separately. However, you still want the constructed variables to be calculated in the same way, so they are comparable. To do this, you will at least two cleaning scripts, and a single one for construction -- we will discuss how to do this in practice in a bit. % From analysis -Ideally, indicator construction should be done right after data cleaning, according to the pre-analysis plan. +Ideally, indicator construction should be done right after data cleaning, according to the pre-analysis plan. In practice, however, following this principle is not always easy. As you analyze the data, different constructed variables will become necessary, as well as subsets and other alterations to the data. -Still, constructing variables in a separate script from the analysis will help you ensure consistency across different outputs. -If every script that creates a table starts by loading a data set, subsetting it and manipulating variables, any edits to construction need to be replicated in all scripts. +Still, constructing variables in a separate script from the analysis will help you ensure consistency across different outputs. +If every script that creates a table starts by loading a data set, subsetting it and manipulating variables, any edits to construction need to be replicated in all scripts. This increases the chances that at least one of them will have a different sample or variable definition. Therefore, even if construction ends up coming before analysis only in the order the code is run, it's important to think of them as different steps. @@ -350,7 +348,7 @@ \subsection{Construction tasks and how to approach them} Of course, whenever possible, having variable names that are both intuitive \textit{and} can be linked to the survey is ideal, but if you need to choose, prioritize functionality. Ordering the data set so that related variables are together and adding notes to each of them as necessary will also make your data set more user-friendly. -The most simple case of new variables to be created are aggregate indicators. +The most simple case of new variables to be created are aggregate indicators. For example, you may want to add a household's income from different sources into a single total income variable, or create a dummy for having at least one child in school. Jumping to the step where you actually create this variables seems intuitive, but it can also cause you a lot of problems, as overlooking details may affect your results. @@ -362,20 +360,20 @@ \subsection{Construction tasks and how to approach them} It's possible that your questionnaire asked respondents to report some answers as percentages and others as proportions, or that in one variable 0 means ``no'' and 1 means ``yes'', while in another one the same answers were coded are 1 and 2. We recommend coding yes/no questions as either 1/0 or TRUE/FALSE, so they can be used numerically as frequencies in means and as dummies in regressions. -Check that non-binary categorical variables have the same value-assignment, i.e., +Check that non-binary categorical variables have the same value-assignment, i.e., that labels and levels have the same correspondence across variables that use the same options. Finally, make sure that any numeric variables you are comparing are converted to the same scale or unit of measure. You cannot add one hectare and twos acres into a meaningful number. -During construction, you will also need to address some of the issues you identified in the data during data cleaning. +During construction, you will also need to address some of the issues you identified in the data during data cleaning. The most common of them is the presence of outliers. -How to treat outliers is a research question, but make sure to note what we the decision made by the research team, and how you came to it. +How to treat outliers is a research question, but make sure to note what we the decision made by the research team, and how you came to it. Results can be sensitive to the treatment of outliers, so keeping the original variable in the data set will allow you to test how much it affects the estimates. All these points also apply to imputation of missing values and other distributional patterns. The more complex construction tasks involve changing the structure of the data: -adding new observations or variables by merging data sets, +adding new observations or variables by merging data sets, and changing the unit of observation through collapses or reshapes. -There are always ways for things to go wrong that we never anticipated, but two issues to pay extra attention to are missing values and dropped observations. +There are always ways for things to go wrong that we never anticipated, but two issues to pay extra attention to are missing values and dropped observations. Merging, reshaping and aggregating data sets can change both the total number of observations and the number of observations with missing values. Make sure to read about how each command treats missing observations and, whenever possible, add automated checks in the script that throw an error message if the result is changing. If you are subsetting your data, drop observations explicitly, indicating why you are doing that and how the data set changed. @@ -396,9 +394,9 @@ \subsection{Documenting indicators construction} Because data construction involves translating concrete data points to more abstract measurements, it is important to document exactly how each variable is derived or calculated. Adding comments to the code explaining what you are doing and why is a crucial step both to prevent mistakes and to guarantee transparency. To make sure that these comments can be more easily navigated, it is wise to start writing a variable dictionary as soon as you begin making changes to the data. -Carefully record how specific variables have been combined, recoded, and scaled, and refer to those records in the code. +Carefully record how specific variables have been combined, recoded, and scaled, and refer to those records in the code. This can be part of a wider discussion with your team about creating protocols for variable definition, which will guarantee that indicators are defined consistently across projects. -When all your final variables have been created, you can use \texttt{iecodebook}'s \texttt{export} subcommand to list all variables in the data set, +When all your final variables have been created, you can use \texttt{iecodebook}'s \texttt{export} subcommand to list all variables in the data set, and complement it with the variable definitions you wrote during construction to create a concise meta data document. Documentation is an output of construction as relevant as the code and the data. Someone unfamiliar with the project should be able to understand the contents of the analysis data sets, the steps taken to create them, and the decision-making process through your documentation. @@ -413,21 +411,21 @@ \subsection{Constructed data sets} you may have one or multiple constructed data sets, depending on how your analysis is structured. So don't worry if you cannot create a single, ``canonical'' analysis data set. It is common to have many purpose-built analysis datasets. -Think of an agricultural intervention that was randomized across villages and only affected certain plots within each village. +Think of an agricultural intervention that was randomized across villages and only affected certain plots within each village. The research team may want to run household-level regressions on income, test for plot-level productivity gains, and check if village characteristics are balanced. -Having three separate datasets for each of these three pieces of analysis will result in much cleaner do files than if they all started from the same file. +Having three separate datasets for each of these three pieces of analysis will result in much cleaner do files than if they all started from the same file. %------------------------------------------------ \section{Writing data analysis code} % Intro -------------------------------------------------------------- -Data analysis is the stage when research outputs are created. +Data analysis is the stage when research outputs are created. \index{data analysis} Many introductions to common code skills and analytical frameworks exist, such as \textit{R for Data Science};\sidenote{\url{https://r4ds.had.co.nz/}} \textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical\_introduction\_to\_stata.pdf}} -\textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion}} +\textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion}} and \textit{Causal Inference: The Mixtape}.\sidenote{\url{http://scunning.com/mixtape.html}} This section will not include instructions on how to conduct specific analyses. That is a research question, and requires expertise beyond the scope of this book. @@ -437,11 +435,11 @@ \section{Writing data analysis code} \subsection{Organizing analysis code} The analysis stage usually starts with a process we call exploratory data analysis. -This is when you are trying different things and looking for patterns in your data. +This is when you are trying different things and looking for patterns in your data. It progresses into final analysis when your team starts to decide what are the main results, those that will make it into the research output. The way you deal with code and outputs for exploratory and final analysis is different, and this section will discuss how. -During exploratory data analysis, you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. -It subtly encourages poor practices such as not clearing the workspace and not reloading the constructed data set before each analysis task. +During exploratory data analysis, you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. +It subtly encourages poor practices such as not clearing the workspace and not reloading the constructed data set before each analysis task. It's important to take the time to organize scripts in a clean manner and to avoid mistakes. A well-organized analysis script starts with a completely fresh workspace and explicitly loads data before analyzing it. @@ -453,10 +451,10 @@ \subsection{Organizing analysis code} Analysis files should be as simple as possible, so whoever is reading it can focus on the econometrics. All research decisions should be made very explicit in the code. -This includes clustering, sampling, and control variables, to name a few. +This includes clustering, sampling, and control variables, to name a few. If you have multiple analysis data sets, each of them should have a descriptive name about its sample and unit of observation. As your team comes to a decision about model specification, you can create globals or objects in the master script to use across scripts. -This is a good way to make sure specifications are consistent throughout the analysis. +This is a good way to make sure specifications are consistent throughout the analysis. Using pre-specified globals or objects also makes your code more dynamic, so it is easy to update specifications and results without changing every script. It is completely acceptable to have folders for each task, and compartmentalize each analysis as much as needed. @@ -464,9 +462,9 @@ \subsection{Exporting outputs} To accomplish this, you will need to make sure that you have an effective data management system, including naming, file organization, and version control. Just like you did with each of the analysis datasets, name each of the individual analysis files descriptively. -Code files such as \path{spatial-diff-in-diff.do}, \path{matching-villages.R}, and \path{summary-statistics.py} +Code files such as \path{spatial-diff-in-diff.do}, \path{matching-villages.R}, and \path{summary-statistics.py} are clear indicators of what each file is doing, and allow you to find code quickly. -If you intend to numerically order the code as they appear in a paper or report, +If you intend to numerically order the code as they appear in a paper or report, leave this to near publication time. % Self-promotion ------------------------------------------------ @@ -486,22 +484,22 @@ \subsection{Exporting outputs} but it is well worth reviewing the graphics manual\sidenote{\url{https://www.stata.com/manuals/g.pdf}} For an easier way around it, Gray Kimbrough's \textit{Uncluttered Stata Graphs} code is an excellent default replacement for Stata graphics that is easy to install. \sidenote{\url{https://graykimbrough.github.io/uncluttered-stata-graphs/}} -If you are a R user, the \textit{R Graphics Cookbook}\sidenote{\url{https://r-graphics.org/}} -is a great resource for the most popular visualization package \texttt{ggplot}\sidenote{\url{https://ggplot2.tidyverse.org/}}. -But there are a variety of other visualization packages, -such as \texttt{highcharter}\sidenote{\url{http://jkunst.com/highcharter/}}, -\texttt{r2d3}\sidenote{\url{https://rstudio.github.io/r2d3/}}, -\texttt{leaflet}\sidenote{\url{https://rstudio.github.io/leaflet/}}, +If you are a R user, the \textit{R Graphics Cookbook}\sidenote{\url{https://r-graphics.org/}} +is a great resource for the most popular visualization package \texttt{ggplot}\sidenote{\url{https://ggplot2.tidyverse.org/}}. +But there are a variety of other visualization packages, +such as \texttt{highcharter}\sidenote{\url{http://jkunst.com/highcharter/}}, +\texttt{r2d3}\sidenote{\url{https://rstudio.github.io/r2d3/}}, +\texttt{leaflet}\sidenote{\url{https://rstudio.github.io/leaflet/}}, and \texttt{plotly}\sidenote{\url{https://plot.ly/r/}}, to name a few. We have no intention of creating an exhaustive list, and this one is certainly missing very good references. But at least it is a place to start. \section{Exporting analysis outputs} -Our team has created a few products to automate common outputs and save you +Our team has created a few products to automate common outputs and save you precious research time. The \texttt{ietoolkit} package includes two commands to export nicely formatted tables. -\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} creates and exports balance tables to excel or {\LaTeX}. +\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} creates and exports balance tables to excel or {\LaTeX}. \texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} does the same for difference-in-differences regressions. The \textbf{Stata Visual Library}\sidenote{ \url{https://worldbank.github.io/Stata-IE-Visual-Library/}} has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com/}} @@ -510,42 +508,42 @@ \section{Exporting analysis outputs} We attribute some of this to the difficulty of writing code to create them. Making a visually compelling graph would already be hard enough if you didn't have to go through many rounds of googling to understand a command. The trickiest part of using plot commands is to get the data in the right format. -This is why the \textbf{Stata Visual Library} includes example data sets to use +This is why the \textbf{Stata Visual Library} includes example data sets to use with each do-file. -It's ok to not export each and every table and graph created during exploratory analysis. +It's ok to not export each and every table and graph created during exploratory analysis. Final analysis scripts, on the other hand, should export final outputs, which are ready to be included to a paper or report. -No manual edits, including formatting, should be necessary after exporting final outputs -- -those that require copying and pasting edited outputs, in particular, are absolutely not advisable. -Manual edits are difficult to replicate, and you will inevitably need to make changes to the outputs. -Automating them will save you time by the end of the process. +No manual edits, including formatting, should be necessary after exporting final outputs -- +those that require copying and pasting edited outputs, in particular, are absolutely not advisable. +Manual edits are difficult to replicate, and you will inevitably need to make changes to the outputs. +Automating them will save you time by the end of the process. However, don't spend too much time formatting tables and graphs until you are ready to publish.\sidenote{For a more detailed discussion on this, including different ways to export tables from Stata, see \url{https://github.com/bbdaniels/stata-tables}} -Polishing final outputs can be a time-consuming process, +Polishing final outputs can be a time-consuming process, and you want to it as few times as possible. We cannot stress this enough: don't ever set a workflow that requires copying and pasting results. Copying results from excel to word is error-prone and inefficient. Copying results from a software console is risk-prone, even more inefficient, and unnecessary. -There are numerous commands to export outputs from both R and Stata to a myriad of formats.\sidenote{Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, -and \href{https://www.benjaminbdaniels.com/stata-code/outwrite/}{\texttt{outwrite}} in Stata, +There are numerous commands to export outputs from both R and Stata to a myriad of formats.\sidenote{Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, +and \href{https://www.benjaminbdaniels.com/stata-code/outwrite/}{\texttt{outwrite}} in Stata, and \href{https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf}{\texttt{stargazer}} and \href{https://ggplot2.tidyverse.org/reference/ggsave.html}{\texttt{ggsave}} in R.} Save outputs in accessible and, whenever possible, lightweight formats. Accessible means that it's easy for other people to open them. -In Stata, that would mean always using \texttt{graph export} to save images as \texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., +In Stata, that would mean always using \texttt{graph export} to save images as \texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., instead of \texttt{graph save}, which creates a \texttt{.gph} file that can only be opened through a Stata installation. Some publications require ``lossless'' TIFF of EPS files, which are created by specifying the desired extension. Whichever format you decide to use, remember to always specify the file extension explicitly. For tables there are less options and more consideration to be made. -Exporting table to \texttt{.tex} should be preferred. -Excel \texttt{.xlsx} and \texttt{.csv} are also commonly used, +Exporting table to \texttt{.tex} should be preferred. +Excel \texttt{.xlsx} and \texttt{.csv} are also commonly used, but require the extra step of copying the tables into the final output. -The amount of work needed in a copy-paste workflow increases rapidly with the number of tables and figures included in a research output, +The amount of work needed in a copy-paste workflow increases rapidly with the number of tables and figures included in a research output, and do the chances of having the wrong version a result in your paper or report. % Formatting -If you need to create a table with a very particular format, that is not automated by any command you know, consider writing the it manually +If you need to create a table with a very particular format, that is not automated by any command you know, consider writing the it manually (Stata's \texttt{filewrite}, for example, allows you to do that). This will allow you to write a cleaner script that focuses on the econometrics, and not on complicated commands to create and append intermediate matrices. To avoid cluttering your scripts with formatting and ensure that formatting is consistent across outputs, @@ -556,7 +554,7 @@ \section{Exporting analysis outputs} Make sure labels and notes cover all relevant information, such as sample, unit of observation, unit of measurement and variable definition.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Checklist:\_Reviewing\_Graphs} \\ \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Submit\_Table}} If you follow the steps outlined in this chapter, most of the data work involved in the last step of the research process -- publication -- will already be done. -If you used de-identified data for analysis, publishing the cleaned data set in a trusted repository will allow you to cite your data. +If you used de-identified data for analysis, publishing the cleaned data set in a trusted repository will allow you to cite your data. Some of the documentation produced during cleaning and construction can be published even if your data is too sensitive to be published. Your analysis code will be organized in a reproducible way, so will need to do release a replication package is a last round of code review. This will allow you to focus on what matters: writing up your results into a compelling story. diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index ffcc7a861..149711319 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -2,29 +2,28 @@ \begin{fullwidth} - Development research does not just \textit{involve} real people -- it also \textit{affects} real people. + Development research does not just involve real people -- it also affects real people. Policy decisions are made every day using the results of briefs and studies, and these can have wide-reaching consequences on the lives of millions. As the range and importance of the policy-relevant questions asked by development researchers grow, so too does the (rightful) scrutiny under which methods and results are placed. It is useful to think of research as a public service, one that requires you to be accountable to both research participants and research consumers. - On the research participant side, - it is essential to respect individual \textbf{privacy} and ensure \textbf{data security}. + it is essential to respect individual privacy and ensure data security. Researchers look deeply into real people's personal lives, financial conditions, and other sensitive subjects. Respecting the respondents' right to privacy, by intelligently assessing and proactively averting risks they might face, is a core tenet of research ethics. On the consumer side, it is important to protect confidence in development research - by following modern practices for \textbf{transparency} and \textbf{reproducibility}. - Across the social sciences, the open science movement has been fueled by discoveries of low-quality research practices, + by following modern practices for transparency and reproducibility. + + Across the social sciences, the open science movement has been fueled by discoveries of low-quality research practices, data and code that are inaccessible to the public, analytical errors in major research papers, and in some cases even outright fraud. While the development research community has not yet experienced any major scandals, it has become clear that there are necessary incremental improvements in the way that code and data are handled as part of research. - - Neither privacy nor transparency is an ``all-or-nothing'' objective: + Neither privacy nor transparency is an all-or-nothing objective: the most important thing is to report the transparency and privacy measures you have taken and always strive to to the best that you are capable of with current technology. In this chapter, we outline a set of practices that help to ensure diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index c6e6121c1..0a2a76d55 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -14,18 +14,17 @@ It's okay to update this data map once the project is underway. The point is that everyone knows -- at any given time -- what the plan is. -To do data work effectively in a team environment, +To do data work effectively in a team environment, you will need to prepare collaborative tools and workflows. Changing software or protocols halfway through a project can be costly and time-consuming, so it's important to plan ahead. -Seemingly small decisions such as sharing services, folder structures, +Seemingly small decisions such as sharing services, folder structures, and filenames can be extremely painful to alter down the line in any project. Similarly, making sure to set up a self-documenting discussion platform -and process for version control; +and process for version control; this makes working together on outputs much easier from the very first discussion. - -This chapter will guide you on preparing a collaborative work environment, -and structuring your data work to be well-organized and clearly documented. +This chapter will guide you on preparing a collaborative work environment, +and structuring your data work to be well-organized and clearly documented. \end{fullwidth} diff --git a/chapters/publication.tex b/chapters/publication.tex index 7d5344a93..7175eadbe 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -29,7 +29,7 @@ to spend days re-numbering references (and it can take days) when a small amount of up-front effort can automate the task. In this section we suggest several methods -- -collectively referred to as ``dynamic documents'' -- +collectively referred to as dynamic documents -- for managing the process of collaboration on any technical product. For most research projects, completing a manuscript is not the end of the task. @@ -40,10 +40,9 @@ and better understand the results you have obtained. Holding code and data to the same standards a written work is a new practice for many researchers. - -In this chapter, we first discuss tools and workflows for collaborating on technical writing. -Next, we turn to publishing data, -noting that the data can itself be a significant contribution in addition to analytical results. +In this chapter, we first discuss tools and workflows for collaborating on technical writing. +Next, we turn to publishing data, +noting that the data can itself be a significant contribution in addition to analytical results. Finally, we provide guidelines that will help you to prepare a functioning and informative replication package. In all cases, we note that technology is rapidly evolving and that the specific tools noted here may not remain cutting-edge, @@ -269,7 +268,7 @@ \section{Publishing primary data} in addition to any publication of analysis results. Publishing data can foster collaboration with researchers interested in the same subjects as your team. -Collaboration can enable your team to fully explore variables and +Collaboration can enable your team to fully explore variables and questions that you may not have time to focus on otherwise, even though data was collected on them. There are different options for data publication. @@ -292,7 +291,7 @@ \section{Publishing primary data} There will almost always be a trade-off between accuracy and privacy. For publicly disclosed data, you should favor privacy. -Therefore, before publishing data, +Therefore, before publishing data, you should carefully perform a \textbf{final de-identification}. Its objective is to create a dataset for publication that cannot be manipulated or linked to identify any individual research participant. @@ -307,13 +306,13 @@ \section{Publishing primary data} There are a number of tools developed to help researchers de-identify data and which you should use as appropriate at that stage of data collection. These include \texttt{PII\_detection}\sidenote{ - \url{https://github.com/PovertyAction/PII\_detection}} + \url{https://github.com/PovertyAction/PII\_detection}} from IPA, \texttt{PII-scan}\sidenote{ - \url{https://github.com/J-PAL/PII-Scan}} + \url{https://github.com/J-PAL/PII-Scan}} from JPAL, and \texttt{sdcMicro}\sidenote{ - \url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} + \url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} from the World Bank. \index{anonymization} The \texttt{sdcMicro} tool, in particular, has a feature @@ -415,19 +414,19 @@ \subsection{Publishing data for replication} If possible, you should publish both a clean version of the data which corresponds exactly to the original database or questionnaire as well as the constructed or derived dataset used for analysis. -You should also release the code +You should also release the code that constructs any derived measures, particularly where definitions may vary, so that others can learn from your work and adapt it as they like. -As in the case of raw primary data, +As in the case of raw primary data, final analysis data sets that will become public for the purpose of replication must also be fully de-identified. In cases where PII data is required for analysis, we recommend embargoing the sensitive variables when publishing the data. You should contact an appropriate data catalog to determine what privacy and licensing options are available. -Access to the embargoed data could be granted for the purposes of study replication, +Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. \subsection{Publishing code for replication} diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 32d121a1b..b92b60fe0 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -9,7 +9,7 @@ Without going into too much technical detail, as there are many excellent resources on impact evaluation design, this section presents a brief overview -of the most common ``causal inference'' methods, +of the most common causal inference methods, focusing on implications for data structure and analysis. The intent of this chapter is for you to obtain an understanding of the way in which each method constructs treatment and control groups, @@ -33,11 +33,10 @@ in response to an unexpected event. Intuitive knowledge of your project's chosen approach will make you much more effective at the analytical part of your work. - -This chapter first covers causal inference methods. -Next we discuss how to measure treatment effects and structure data for specific methods, -including: cross-sectional randomized control trials, difference-in-difference designs, -regression discontinuity, instrumental variables, matching, and synthetic controls. +This chapter first covers causal inference methods. +Next we discuss how to measure treatment effects and structure data for specific methods, +including: cross-sectional randomized control trials, difference-in-difference designs, +regression discontinuity, instrumental variables, matching, and synthetic controls. \end{fullwidth} From 505514fea1db8f89c663892956681f37f0013136 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 11 Feb 2020 17:39:03 -0500 Subject: [PATCH 651/854] Add conclusion paragraph to intro --- chapters/introduction.tex | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 46eacfa20..8f1e55435 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -190,5 +190,32 @@ \section{Writing reproducible code in a collaborative environment} we provide our guidance on this in the DIME Analytics Stata Style Guide Appendix. +The book proceeds as follows: +In Chapter 1, we outline a set of practices that help to ensure +research participants are appropriately protected and +research consumers can be confident in the conclusions reached. +Chapter 2 will teach you to structure your data work to be efficient, +collaborative and reproducible. +In Chapter 3, we turn to research design, +focusing specifically on how to measure treatment effects +and structure data for common experimental and quasi-experimental research methods. +Chapter 4 concerns sampling and randomization: +how to implement both simple and complex designs reproducibly, +and how to use power calculations and randomization inference +to critically and quantitatively assess +sampling and randomization designs to make optimal choices when planning studies. +Chapter 5 covers data acquisition. We start with +the legal and institutional frameworks for data ownership and licensing, +dive in depth on collecting high-quality survey data, +and finally discuss secure data handling during transfer, sharing, and storage. +Chapter 6 teaches reproducible and transparent workflows for data processing and analysis, +and provides guidance on de-identification of personally-identified data. +In Chapter 7, we turn to publication. You will learn +how to effectively collaborate on technical writing, +how and why to publish data, +and guidelines for preparing functional and informative replication packages. +We hope that by the end of the book, +you will have learned how to handle data more efficiently, effectively and ethically +at all stages of the research process. \mainmatter From 51c4029f120dcc6c72cd1048e6583d7fbfc07675 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 18:11:06 -0500 Subject: [PATCH 652/854] Suggest introduction changes --- chapters/data-analysis.tex | 448 ++++++++++++++++++------------------- 1 file changed, 224 insertions(+), 224 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 24431f55f..74485fc35 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -1,28 +1,28 @@ %------------------------------------------------ \begin{fullwidth} -Transforming raw data into a substantial contribution to scientific knowledge -requires a mix of subject expertise, programming skills, -and statistical and econometric knowledge. -The process of data analysis is, therefore, -a back-and-forth discussion between people -with differing skill sets. -The research assistant usually ends up being the pivot of this discussion. +Transforming raw data into a substantial contribution to scientific knowledge +requires a mix of subject expertise, programming skills, +and statistical and econometric knowledge. +The process of data analysis is, therefore, +a back-and-forth discussion between people +with differing skill sets. +The research assistant usually ends up being the pivot of this discussion. It is their job to translate the data received from the field into -economically meaningful indicators and to analyze them +economically meaningful indicators and to analyze them while making sure that code and outputs do not become too difficult to follow or get lost over time. This can be a complex process. -When it comes to code, though, analysis is the easy part, -\textit{as long as you have organized your data well}. -Of course, there is plenty of complexity behind it: +When it comes to code, though, analysis is substantially less intricate than data cleaning, +as long as you have already organized your data well. +Of course, there is plenty of complexity behind it: the econometrics, the theory of change, the measurement methods, and so much more. -But none of those are the subject of this book. -\textit{Instead, this chapter will focus on how to organize your data work so that coding the analysis becomes easy}. -Most of a Research Assistant's time is spent cleaning data and getting it into the right format. +However, none of those are the subject of this book, +and most statistical software packages complex functions into simple expressions. +This chapter will focus on how to organize your data work so that coding the analysis is simple. +Most of a research assistant's time is spent cleaning data and getting it into the right format. When the practices recommended here are adopted, -analyzing the data is as simple as using a command that is already implemented in a statistical software. - +analyzing the data is as simple as using a command that is already implemented in a statistical software. \end{fullwidth} @@ -30,23 +30,23 @@ \section{Data management} -The goal of data management is to organize the components of data work +The goal of data management is to organize the components of data work so it can traced back and revised without massive effort. -In our experience, there are four key elements to good data management: -folder structure, task breakdown, master scripts, and version control. +In our experience, there are four key elements to good data management: +folder structure, task breakdown, master scripts, and version control. A good folder structure organizes files so that any material can be found when needed. It reflects a task breakdown into steps with well-defined inputs, tasks, and outputs. This breakdown is applied to code, data sets, and outputs. A master script connects folder structure and code. It is a one-file summary of your whole project. -Finally, version histories and backups enable the team +Finally, version histories and backups enable the team to edit files without fear of losing information. -Smart use of version control also allows you to track +Smart use of version control also allows you to track how each edit affects other files in the project. \subsection{Folder structure} -There are many ways to organize research data. +There are many ways to organize research data. Our preferred scheme reflects the task breakdown that will be outlined in this chapter. \index{data organization} Our team at DIME Analytics developed the \texttt{iefolder}\sidenote{ @@ -55,42 +55,42 @@ \subsection{Folder structure} \url{https://dimewiki.worldbank.org/wiki/ietoolkit}}) to automatize the creation of a folder following this scheme and to standardize folder structures across teams and projects. -Standardizing folders greatly reduces the costs that PIs and RAs face when switching between projects, +Standardizing folders greatly reduces the costs that PIs and RAs face when switching between projects, because they are organized in the same way.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/DataWork\_Folder}} We created the command based on our experience with primary data, but it can be used for different types of data, and adapted to fit different needs. -No matter what are your team's preference in terms of folder organization, +No matter what are your team's preference in terms of folder organization, the principle of creating one standard remains. At the first level of the structure created by \texttt{iefolder} are what we call survey round folders.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/DataWork\_Survey\_Round}} -You can think of a ``round'' as one source of data, -that will be cleaned in the same script. -Inside round folders, there are dedicated folders for -raw (encrypted) data; de-identified data; cleaned data; and final (constructed) data. -There is a folder for raw results, as well as for final outputs. -The folders that hold code are organized in parallel to these, -so that the progression through the whole project can be followed by anyone new to the team. +You can think of a ``round'' as one source of data, +that will be cleaned in the same script. +Inside round folders, there are dedicated folders for +raw (encrypted) data; de-identified data; cleaned data; and final (constructed) data. +There is a folder for raw results, as well as for final outputs. +The folders that hold code are organized in parallel to these, +so that the progression through the whole project can be followed by anyone new to the team. Additionally, \texttt{iefolder} creates \textbf{master do-files}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} + \url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} so all project code is reflected in a top-level script. \subsection{Task breakdown} -We divide the data work process that starts from the raw data -and builds on it to create final analysis outputs into three stages: -data cleaning, variable construction, and data analysis. -Though they are frequently implemented at the same time, -we find that creating separate scripts and data sets prevents mistakes. -It will be easier to understand this division as we discuss what each stage comprises. -What you should know for now is that each of these stages has well-defined inputs and outputs. -This makes it easier to track tasks across scripts, -and avoids duplication of code that could lead to inconsistent results. -For each stage, there should be a code folder and a corresponding data set. +We divide the data work process that starts from the raw data +and builds on it to create final analysis outputs into three stages: +data cleaning, variable construction, and data analysis. +Though they are frequently implemented at the same time, +we find that creating separate scripts and data sets prevents mistakes. +It will be easier to understand this division as we discuss what each stage comprises. +What you should know for now is that each of these stages has well-defined inputs and outputs. +This makes it easier to track tasks across scripts, +and avoids duplication of code that could lead to inconsistent results. +For each stage, there should be a code folder and a corresponding data set. The names of codes, data sets and outputs for each stage should be consistent, -making clear how they relate to one another. +making clear how they relate to one another. So, for example, a script called \texttt{clean-section-1} would create a data set called \texttt{cleaned-section-1}. @@ -104,26 +104,26 @@ \subsection{Task breakdown} \subsection{Master scripts} Master scripts allow users to execute all the project code from a single file. -They briefly describe what each code does, -and map the files they require and create. -They also connect code and folder structure through macros or objects. -In short, a master script is a human-readable map to the tasks, -files and folder structure that comprise a project. -Having a master script eliminates the need for complex instructions to replicate results. +They briefly describe what each code does, +and map the files they require and create. +They also connect code and folder structure through macros or objects. +In short, a master script is a human-readable map to the tasks, +files and folder structure that comprise a project. +Having a master script eliminates the need for complex instructions to replicate results. Reading it should be enough for anyone unfamiliar with the project to understand what are the main tasks, which scripts execute them, -and where different files can be found in the project folder. +and where different files can be found in the project folder. That is, it should contain all the information needed to interact with a project's data work. \subsection{Version control} -Finally, establishing a version control system is an incredibly useful +Finally, establishing a version control system is an incredibly useful and important step for documentation, collaboration and conflict-solving. Version control allows you to effectively track code edits, -including the addition and deletion of files. -This way you can delete code you no longer need, +including the addition and deletion of files. +This way you can delete code you no longer need, and still recover it easily if you ever need to get back previous work. -Everything that can be version-controlled should be. +Everything that can be version-controlled should be. Both analysis results and data sets will change with the code. Whenever possible, you should track have each of them with the code that created it. If you are writing code in Git/GitHub, @@ -131,10 +131,10 @@ \subsection{Version control} and metadata saved in \texttt{.txt} or \texttt{.csv} to that directory. Binary files that compile the tables, as well as the complete data sets, on the other hand, -should be stored in your team's shared folder. +should be stored in your team's shared folder. Whenever data cleaning or data construction codes are edited, use the master script to run all the code for your project. -Git will highlight the changes that were in data sets and results that they entail. +Git will highlight the changes that were in data sets and results that they entail. %------------------------------------------------ @@ -144,18 +144,18 @@ \section{De-identification} It should contain only materials that are received directly from the field. They will invariably come in a host of file formats and nearly always contain personally-identifying information.\index{personally-identifying information} These files should be retained in the raw data folder \textit{exactly as they were received}. -Be mindful of where they are stored. +Be mindful of where they are stored. Maintain a backup copy in a secure offsite location. Every other file is created from the raw data, and therefore can be recreated. The exception, of course, is the raw data itself, so it should never be edited directly. The rare and only case when the raw data can be edited directly is when it is encoded incorrectly and some non-English character is causing rows or columns to break at the wrong place -when the data is imported. +when the data is imported. In this scenario, you will have to remove the special character manually, save the resulting data set \textit{in a new file} and securely back up \textit{both} the broken and the fixed version of the raw data. -Note that no one who is not listed in the IRB should be able to access confidential data, +Note that no one who is not listed in the IRB should be able to access confidential data, not even the company providing file-sharing services. -Check if your organization has guidelines on how to store data securely, as they may offer an institutional solution. +Check if your organization has guidelines on how to store data securely, as they may offer an institutional solution. If that is not the case, you will need to encrypt the data, especially before sharing it, and make sure that only IRB-listed team members have the encryption key.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}}. @@ -163,12 +163,12 @@ \section{De-identification} Loading encrypted data frequently can be disruptive to the workflow. To facilitate the handling of the data, remove any personally identifiable information from the data set. -This will create a de-identified data set, that can be saved in a non-encrypted folder. +This will create a de-identified data set, that can be saved in a non-encrypted folder. De-identification,\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} at this stage, means stripping the data set of direct identifiers.\sidenote{\url{ https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf}} To be able to do so, you will need to go through your data set and find all the variables that contain identifying information. -Flagging all potentially identifying variables in the questionnaire design stage +Flagging all potentially identifying variables in the questionnaire design stage simplifies the initial de-identification process. If you haven't done that, that are a few tools that can help you with it. JPAL's \texttt{PII scan}, as indicated by its name, @@ -181,43 +181,43 @@ \section{De-identification} The \texttt{iefieldkit} command \texttt{iecodebook} lists all variables in a data set and exports an Excel sheet where you can easily select which variables to keep or drop.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Iecodebook}} + \url{https://dimewiki.worldbank.org/wiki/Iecodebook}} -Once you have a list of variables that contain PII, -assess them against the analysis plan and ask: +Once you have a list of variables that contain PII, +assess them against the analysis plan and ask: will this variable be needed for analysis? -If not, the variable should be dropped. -Don't be afraid to drop too many variables the first time, +If not, the variable should be dropped. +Don't be afraid to drop too many variables the first time, as you can always go back and remove variables from the list of variables to be dropped, -but you can not go back in time and drop a PII variable that was leaked +but you can not go back in time and drop a PII variable that was leaked because it was incorrectly kept. Examples include respondent names, enumerator names, interview date, respondent phone number. -If the variable is needed for analysis, ask: -can I encode or otherwise construct a variable to use for the analysis that masks the PII, +If the variable is needed for analysis, ask: +can I encode or otherwise construct a variable to use for the analysis that masks the PII, and drop the original variable? This is typically the case for most identifying information. -Examples include geocoordinates -(after constructing measures of distance or area, -drop the specific location), +Examples include geocoordinates +(after constructing measures of distance or area, +drop the specific location), and names for social network analysis (can be encoded to secret and unique IDs). -If PII variables are strictly required for the analysis itself, +If PII variables are strictly required for the analysis itself, it will be necessary to keep at least a subset of the data encrypted through the data analysis process. -If the answer is yes to either of these questions, -all you need to do is write a script to drop the variables that are not required for analysis, -encode or otherwise mask those that are required, +If the answer is yes to either of these questions, +all you need to do is write a script to drop the variables that are not required for analysis, +encode or otherwise mask those that are required, and save a working version of the data. The resulting de-identified data will be the underlying source for all cleaned and constructed data. This is the data set that you will interact with directly during the remaining tasks described in this chapter. -Because identifying information is typically only used during data collection, -to find and confirm the identity of interviewees, +Because identifying information is typically only used during data collection, +to find and confirm the identity of interviewees, de-identification should not affect the usability of the data. \section{Data cleaning} Data cleaning is the second stage in the transformation of data you received from the field into data that you can analyze.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} -The cleaning process involves (1) making the data set easily usable and understandable, +The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. The cleaned data set should contain only the variables collected in the field. @@ -240,11 +240,11 @@ \subsection{Correcting data entry errors} You want to make sure the data set has a unique ID variable that can be cross-referenced with other records, such as the Master Data Set\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Data\_Set}} and other rounds of data collection. -\texttt{ieduplicates} and \texttt{iecompdup}, -two Stata commands included in the \texttt{iefieldkit} +\texttt{ieduplicates} and \texttt{iecompdup}, +two Stata commands included in the \texttt{iefieldkit} package\index{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}} create an automated workflow to identify, correct and document -occurrences of duplicate entries. +occurrences of duplicate entries. As discussed in the previous chapter, looking for duplicated entries is usually part of data quality monitoring, @@ -263,9 +263,9 @@ \subsection{Labeling and annotating the raw data} On average, making corrections to primary data is more time-consuming than when using secondary data. But you should always check for possible issues in any data you are about to use. -The last step of data cleaning, however, +The last step of data cleaning, however, will most likely be necessary no matter what type of data is involved. -It consists of labeling and annotating the data, +It consists of labeling and annotating the data, so that its users have all the information needed to interact with it. This is a key step to making the data easy to use, but it can be quite repetitive. The \texttt{iecodebook} command suite, also part of \texttt{iefieldkit}, @@ -275,7 +275,7 @@ \subsection{Labeling and annotating the raw data} \index{iecodebook} We have a few recommendations on how to use this command, and how to approach data cleaning in general. -First, we suggest keeping the same variable names in the cleaned data set as in the survey instrument, +First, we suggest keeping the same variable names in the cleaned data set as in the survey instrument, so it's straightforward to link data points for a variable to the question that originated them. Second, don't skip the labeling. Applying labels makes it easier to understand what the data mean as you explore it, @@ -288,25 +288,25 @@ \subsection{Labeling and annotating the raw data} Open-ended responses stored as strings usually have a high-risk of being identifiers, so they should be dropped at this point. You can use the encrypted data as an input to a construction script -that categorizes these responses and merges them to the rest of the dataset. +that categorizes these responses and merges them to the rest of the dataset. Finally, any additional information collected only for quality monitoring purposes, such as notes and duration fields, can also be dropped. \subsection{Documenting data cleaning} -Throughout the data cleaning process, you will need inputs from the field, -including enumerator manuals, survey instruments, +Throughout the data cleaning process, you will need inputs from the field, +including enumerator manuals, survey instruments, supervisor notes, and data quality monitoring reports. These materials are essential for data documentation.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Data\_Documentation}} \index{Documentation} -They should be stored in the corresponding ``Documentation'' folder for easy access, +They should be stored in the corresponding ``Documentation'' folder for easy access, as you will probably need them during analysis, and they must be made available for publication. Include in the \texttt{Documentation} folder records of any corrections made to the data, including to duplicated entries, as well as communications from the field where theses issues are reported. -Be very careful not to include sensitive information in documentation that is not securely stored, +Be very careful not to include sensitive information in documentation that is not securely stored, or that you intend to release as part of a replication package or data publication. Another important component of data cleaning documentation are the results of data exploration. @@ -321,7 +321,7 @@ \subsection{Documenting data cleaning} \subsection{The cleaned data set} -The main output of data cleaning is the cleaned data set. +The main output of data cleaning is the cleaned data set. It should contain the same information as the raw data set, with no changes to data points. It should also be easily traced back to the survey instrument, @@ -330,7 +330,7 @@ \subsection{The cleaned data set} i.e. per survey instrument. Each row in the cleaned data set represents one survey entry or unit of observation.\sidenote{\cite{tidy-data}} If the raw data set is very large, or the survey instrument is very complex, -you may want to break the data cleaning into sub-steps, +you may want to break the data cleaning into sub-steps, and create intermediate cleaned data sets (for example, one per survey module). When dealing with complex surveys with multiple nested groups, @@ -340,7 +340,7 @@ \subsection{The cleaned data set} To make sure the cleaned data set file doesn't get too big to be handled, use commands such as \texttt{compress} in Stata to make sure the data is always stored in the most efficient format. -Once you have a cleaned, de-identified data set, and documentation to support it, +Once you have a cleaned, de-identified data set, and documentation to support it, you have created the first data output of your project: a publishable data set. The next chapter will get into the details of data publication. @@ -355,52 +355,52 @@ \section{Constructing final indicators} The third stage in the creation of analysis data is construction. Constructing variables means processing the data points as provided in the raw data to make them suitable for analysis. It is at this stage that the raw data is transformed into analysis data. -This is done by creating derived variables (dummies, indices, and interactions, to name a few), -as planned during research design\index{Research design}, +This is done by creating derived variables (dummies, indices, and interactions, to name a few), +as planned during research design\index{Research design}, and using the pre-analysis plan as a guide.\index{Pre-analysis plan} To understand why construction is necessary, let's take the example of a household survey's consumption module. -For each item in a context-specific bundle, +For each item in a context-specific bundle, this module will ask whether the household consumed any of it over a certain period of time. If they did, it will then ask about quantities, units and expenditure for each item. -However, it is difficult to run a meaningful regression +However, it is difficult to run a meaningful regression on the number of cups of milk and handfuls of beans that a household consumed over a week. You need to manipulate them into something that has \textit{economic} meaning, -such as caloric input or food expenditure per adult equivalent. -During this process, the data points will typically be reshaped and aggregated -so that level of the data set goes from the unit of observation +such as caloric input or food expenditure per adult equivalent. +During this process, the data points will typically be reshaped and aggregated +so that level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} + \url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} \subsection{Why construction?} % From cleaning -Construction is done separately from data cleaning for two reasons. -The first one is to clearly differentiate the data originally collected +Construction is done separately from data cleaning for two reasons. +The first one is to clearly differentiate the data originally collected from the result of data processing decisions. -The second is to ensure that variable definition is consistent across data sources. -Unlike cleaning, construction can create many outputs from many inputs. -Let's take the example of a project that has a baseline and an endline survey. -Unless the two instruments are exactly the same, -which is preferable but often not the case, -the data cleaning for them will require different steps, -and therefore will be done separately. +The second is to ensure that variable definition is consistent across data sources. +Unlike cleaning, construction can create many outputs from many inputs. +Let's take the example of a project that has a baseline and an endline survey. +Unless the two instruments are exactly the same, +which is preferable but often not the case, +the data cleaning for them will require different steps, +and therefore will be done separately. However, you still want the constructed variables to be calculated in the same way, so they are comparable. -To do this, you will at least two cleaning scripts, +To do this, you will at least two cleaning scripts, and a single one for construction -- we will discuss how to do this in practice in a bit. % From analysis -Ideally, indicator construction should be done right after data cleaning, +Ideally, indicator construction should be done right after data cleaning, according to the pre-analysis plan.\index{Pre-analysis plan} In practice, however, following this principle is not always easy. -As you analyze the data, different constructed variables will become necessary, +As you analyze the data, different constructed variables will become necessary, as well as subsets and other alterations to the data. -Still, constructing variables in a separate script from the analysis -will help you ensure consistency across different outputs. -If every script that creates a table starts by loading a data set, -subsetting it, and manipulating variables, -any edits to construction need to be replicated in all scripts. +Still, constructing variables in a separate script from the analysis +will help you ensure consistency across different outputs. +If every script that creates a table starts by loading a data set, +subsetting it, and manipulating variables, +any edits to construction need to be replicated in all scripts. This increases the chances that at least one of them will have a different sample or variable definition. Therefore, even if construction ends up coming before analysis only in the order the code is run, it's important to think of them as different steps. @@ -408,22 +408,22 @@ \subsection{Why construction?} \subsection{Construction tasks and how to approach them} The first thing that comes to mind when we talk about variable construction is, of course, creating new variables. -Do this by adding new variables to the data set instead of overwriting the original information, +Do this by adding new variables to the data set instead of overwriting the original information, and assign functional names to them. During cleaning, you want to keep all variables consistent with the survey instrument. But constructed variables were not present in the survey to start with, so making their names consistent with the survey form is not as crucial. -Of course, whenever possible, having variable names that are both intuitive -\textit{and} can be linked to the survey is ideal, +Of course, whenever possible, having variable names that are both intuitive +\textit{and} can be linked to the survey is ideal, but if you need to choose, prioritize functionality. -Ordering the data set so that related variables are together, +Ordering the data set so that related variables are together, and adding notes to each of them as necessary will also make your data set more user-friendly. -The most simple case of new variables to be created are aggregate indicators. -For example, you may want to add a household's income from different sources into a single total income variable, +The most simple case of new variables to be created are aggregate indicators. +For example, you may want to add a household's income from different sources into a single total income variable, or create a dummy for having at least one child in school. Jumping to the step where you actually create this variables seems intuitive, -but it can also cause you a lot of problems, +but it can also cause you a lot of problems, as overlooking details may affect your results. It is important to check and double-check the value-assignments of questions, as well as their scales, before constructing new variables based on them. @@ -432,36 +432,36 @@ \subsection{Construction tasks and how to approach them} Make sure there is consistency across constructed variables. It's possible that your questionnaire asked respondents to report some answers as percentages and others as proportions, -or that in one variable 0 means ``no'' and 1 means ``yes'', +or that in one variable 0 means ``no'' and 1 means ``yes'', while in another one the same answers were coded are 1 and 2. -We recommend coding yes/no questions as either 1/0 or TRUE/FALSE, +We recommend coding yes/no questions as either 1/0 or TRUE/FALSE, so they can be used numerically as frequencies in means and as dummies in regressions. -Check that non-binary categorical variables have the same value-assignment, i.e., +Check that non-binary categorical variables have the same value-assignment, i.e., that labels and levels have the same correspondence across variables that use the same options. -Finally, make sure that any numeric variables you are comparing are converted to the same scale or unit of measure. +Finally, make sure that any numeric variables you are comparing are converted to the same scale or unit of measure. You cannot add one hectare and twos acres into a meaningful number. -During construction, you will also need to address some of the issues -you identified in the data set as you were cleaning it. +During construction, you will also need to address some of the issues +you identified in the data set as you were cleaning it. The most common of them is the presence of outliers. -How to treat outliers is a research question, -but make sure to note what was the decision made by the research team, -and how you came to it. -Results can be sensitive to the treatment of outliers, +How to treat outliers is a research question, +but make sure to note what was the decision made by the research team, +and how you came to it. +Results can be sensitive to the treatment of outliers, so keeping the original variable in the data set will allow you to test how much it affects the estimates. All these points also apply to imputation of missing values and other distributional patterns. The more complex construction tasks involve changing the structure of the data: -adding new observations or variables by merging data sets, +adding new observations or variables by merging data sets, and changing the unit of observation through collapses or reshapes. There are always ways for things to go wrong that we never anticipated, -but two issues to pay extra attention to are missing values and dropped observations. +but two issues to pay extra attention to are missing values and dropped observations. Merging, reshaping and aggregating data sets can change both the total number of observations and the number of observations with missing values. -Make sure to read about how each command treats missing observations and, +Make sure to read about how each command treats missing observations and, whenever possible, add automated checks in the script that throw an error message if the result is changing. -If you are subsetting your data, -drop observations explicitly, +If you are subsetting your data, +drop observations explicitly, indicating why you are doing that and how the data set changed. Finally, primary panel data involves additional timing complexities. @@ -473,60 +473,60 @@ \subsection{Construction tasks and how to approach them} Then the first thing you should do is create a panel data set -- \texttt{iecodebook}'s \texttt{append} subcommand will help you reconcile and append survey rounds. After that, adapt the construction code so it can be used on the panel data set. -Apart from preventing inconsistencies, +Apart from preventing inconsistencies, this process will also save you time and give you an opportunity to review your original code. \subsection{Documenting indicators construction} -Because data construction involves translating concrete data points to more abstract measurements, +Because data construction involves translating concrete data points to more abstract measurements, it is important to document exactly how each variable is derived or calculated. Adding comments to the code explaining what you are doing and why is a crucial step both to prevent mistakes and to guarantee transparency. -To make sure that these comments can be more easily navigated, +To make sure that these comments can be more easily navigated, it is wise to start writing a variable dictionary as soon as you begin making changes to the data. Carefully record how specific variables have been combined, recoded, and scaled, -and refer to those records in the code. -This can be part of a wider discussion with your team about creating protocols for variable definition, +and refer to those records in the code. +This can be part of a wider discussion with your team about creating protocols for variable definition, which will guarantee that indicators are defined consistently across projects. -When all your final variables have been created, -you can use \texttt{iecodebook}'s \texttt{export} subcommand to list all variables in the data set, +When all your final variables have been created, +you can use \texttt{iecodebook}'s \texttt{export} subcommand to list all variables in the data set, and complement it with the variable definitions you wrote during construction to create a concise meta data document. Documentation is an output of construction as relevant as the code and the data. -Someone unfamiliar with the project should be able to understand the contents of the analysis data sets, -the steps taken to create them, +Someone unfamiliar with the project should be able to understand the contents of the analysis data sets, +the steps taken to create them, and the decision-making process through your documentation. The construction documentation will complement the reports and notes created during data cleaning. Together, they will form a detailed account of the data processing. \subsection{Constructed data sets} -The other set of construction outputs, as expected, +The other set of construction outputs, as expected, consists of the data sets that will be used for analysis. A constructed data set is built to answer an analysis question. -Since different pieces of analysis may require different samples, +Since different pieces of analysis may require different samples, or even different units of observation, -you may have one or multiple constructed data sets, +you may have one or multiple constructed data sets, depending on how your analysis is structured. So don't worry if you cannot create a single, ``canonical'' analysis data set. It is common to have many purpose-built analysis datasets. -Think of an agricultural intervention that was randomized across villages -and only affected certain plots within each village. -The research team may want to run household-level regressions on income, -test for plot-level productivity gains, +Think of an agricultural intervention that was randomized across villages +and only affected certain plots within each village. +The research team may want to run household-level regressions on income, +test for plot-level productivity gains, and check if village characteristics are balanced. -Having three separate datasets for each of these three pieces of analysis -will result in much cleaner do files than if they all started from the same file. +Having three separate datasets for each of these three pieces of analysis +will result in much cleaner do files than if they all started from the same file. %------------------------------------------------ \section{Writing data analysis code} % Intro -------------------------------------------------------------- -Data analysis is the stage when research outputs are created. +Data analysis is the stage when research outputs are created. \index{data analysis} Many introductions to common code skills and analytical frameworks exist, such as \textit{R for Data Science};\sidenote{\url{https://r4ds.had.co.nz/}} \textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical\_introduction\_to\_stata.pdf}} -\textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion}} +\textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion}} and \textit{Causal Inference: The Mixtape}.\sidenote{\url{http://scunning.com/mixtape.html}} This section will not include instructions on how to conduct specific analyses. That is a research question, and requires expertise beyond the scope of this book. @@ -536,49 +536,49 @@ \section{Writing data analysis code} \subsection{Organizing analysis code} The analysis stage usually starts with a process we call exploratory data analysis. -This is when you are trying different things and looking for patterns in your data. -It progresses into final analysis when your team starts to decide what are the main results, +This is when you are trying different things and looking for patterns in your data. +It progresses into final analysis when your team starts to decide what are the main results, those that will make it into the research output. The way you deal with code and outputs for exploratory and final analysis is different. -During exploratory data analysis, -you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. -It subtly encourages poor practices such as not clearing the workspace and not reloading the constructed data set before each analysis task. -To avoid mistakes, it's important to take the time +During exploratory data analysis, +you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. +It subtly encourages poor practices such as not clearing the workspace and not reloading the constructed data set before each analysis task. +To avoid mistakes, it's important to take the time to organize the code that you want to use again in a clean manner. -A well-organized analysis script starts with a completely fresh workspace +A well-organized analysis script starts with a completely fresh workspace and explicitly loads data before analyzing it. -This setup encourages data manipulation to be done earlier in the workflow +This setup encourages data manipulation to be done earlier in the workflow (that is, during construction). -It also and prevents you from accidentally writing pieces of analysis code that depend on one another +It also and prevents you from accidentally writing pieces of analysis code that depend on one another and require manual instructions for all necessary chuncks of code to be run in the right order. -Each script should run completely independently of all other code, +Each script should run completely independently of all other code, except for the master script. You can go as far as coding every output in a separate script. There is nothing wrong with code files being short and simple. -In fact, analysis scripts should be as simple as possible, +In fact, analysis scripts should be as simple as possible, so whoever is reading them can focus on the econometrics, not the coding. All research decisions should be very explicit in the code. -This includes clustering, sampling, and control variables, to name a few. -If you have multiple analysis data sets, +This includes clustering, sampling, and control variables, to name a few. +If you have multiple analysis data sets, each of them should have a descriptive name about its sample and unit of observation. -As your team comes to a decision about model specification, +As your team comes to a decision about model specification, you can create globals or objects in the master script to use across scripts. -This is a good way to make sure specifications are consistent throughout the analysis. +This is a good way to make sure specifications are consistent throughout the analysis. Using pre-specified globals or objects also makes your code more dynamic, so it is easy to update specifications and results without changing every script. -It is completely acceptable to have folders for each task, +It is completely acceptable to have folders for each task, and compartmentalize each analysis as much as needed. -To accomplish this, you will need to make sure that you have an effective data management system, +To accomplish this, you will need to make sure that you have an effective data management system, including naming, file organization, and version control. -Just like you did with each of the analysis datasets, +Just like you did with each of the analysis datasets, name each of the individual analysis files descriptively. -Code files such as \path{spatial-diff-in-diff.do}, -\path{matching-villages.R}, and \path{summary-statistics.py} +Code files such as \path{spatial-diff-in-diff.do}, +\path{matching-villages.R}, and \path{summary-statistics.py} are clear indicators of what each file is doing, and allow you to find code quickly. -If you intend to numerically order the code as they appear in a paper or report, +If you intend to numerically order the code as they appear in a paper or report, leave this to near publication time. \subsection{Visualizing data} @@ -597,24 +597,24 @@ \subsection{Visualizing data} Graphics tools like Stata are highly customizable. There is a fair amount of learning curve associated with extremely-fine-grained adjustment, but it is well worth reviewing the graphics manual\sidenote{\url{https://www.stata.com/manuals/g.pdf}} -For an easier way around it, Gray Kimbrough's \textit{Uncluttered Stata Graphs} +For an easier way around it, Gray Kimbrough's \textit{Uncluttered Stata Graphs} code is an excellent default replacement for Stata graphics that is easy to install. \sidenote{\url{https://graykimbrough.github.io/uncluttered-stata-graphs/}} -If you are a R user, the \textit{R Graphics Cookbook}\sidenote{\url{https://r-graphics.org/}} +If you are a R user, the \textit{R Graphics Cookbook}\sidenote{\url{https://r-graphics.org/}} is a great resource for the its popular visualization package, \texttt{ggplot}\sidenote{ - \url{https://ggplot2.tidyverse.org/}}. -But there are a variety of other visualization packages, -such as \texttt{highcharter}\sidenote{\url{http://jkunst.com/highcharter/}}, -\texttt{r2d3}\sidenote{\url{https://rstudio.github.io/r2d3/}}, -\texttt{leaflet}\sidenote{\url{https://rstudio.github.io/leaflet/}}, + \url{https://ggplot2.tidyverse.org/}}. +But there are a variety of other visualization packages, +such as \texttt{highcharter}\sidenote{\url{http://jkunst.com/highcharter/}}, +\texttt{r2d3}\sidenote{\url{https://rstudio.github.io/r2d3/}}, +\texttt{leaflet}\sidenote{\url{https://rstudio.github.io/leaflet/}}, and \texttt{plotly}\sidenote{\url{https://plot.ly/r/}}, to name a few. We have no intention of creating an exhaustive list, and this one is certainly missing very good references. But at least it is a place to start. -We attribute some of the difficulty of creating good data visualization +We attribute some of the difficulty of creating good data visualization to writing code to create them. -Making a visually compelling graph would already be hard enough if +Making a visually compelling graph would already be hard enough if you didn't have to go through many rounds of googling to understand a command. The trickiest part of using plot commands is to get the data in the right format. This is why we create the \textbf{Stata Visual Library}\sidenote{ @@ -622,86 +622,86 @@ \subsection{Visualizing data} has examples of graphs created in Stata and curated by us.\sidenote{ A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com/}} The Stata Visual Library includes example data sets to use with each do-file, -so you get a good sense of what your data should look like +so you get a good sense of what your data should look like before you can start writing code to create a visualization. \section{Exporting analysis outputs} -Our team has created a few products to automate common outputs and save you +Our team has created a few products to automate common outputs and save you precious research time. The \texttt{ietoolkit} package includes two commands to export nicely formatted tables. -\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} -creates and exports balance tables to excel or {\LaTeX}. -\texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} +\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} +creates and exports balance tables to excel or {\LaTeX}. +\texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} does the same for difference-in-differences regressions. It also includes a command, \texttt{iegraph}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Iegraph}}, to export pre-formatted impact evaluation results graphs -It's ok to not export each and every table and graph created during exploratory analysis. -Final analysis scripts, on the other hand, should export final outputs, +It's ok to not export each and every table and graph created during exploratory analysis. +Final analysis scripts, on the other hand, should export final outputs, which are ready to be included to a paper or report. -No manual edits, including formatting, should be necessary after exporting final outputs -- -those that require copying and pasting edited outputs, -in particular, are absolutely not advisable. -Manual edits are difficult to replicate, -and you will inevitably need to make changes to the outputs. -Automating them will save you time by the end of the process. +No manual edits, including formatting, should be necessary after exporting final outputs -- +those that require copying and pasting edited outputs, +in particular, are absolutely not advisable. +Manual edits are difficult to replicate, +and you will inevitably need to make changes to the outputs. +Automating them will save you time by the end of the process. However, don't spend too much time formatting tables and graphs until you are ready to publish.\sidenote{ For a more detailed discussion on this, including different ways to export tables from Stata, see \url{https://github.com/bbdaniels/stata-tables}} -Polishing final outputs can be a time-consuming process, +Polishing final outputs can be a time-consuming process, and you want to it as few times as possible. -We cannot stress this enough: +We cannot stress this enough: don't ever set a workflow that requires copying and pasting results. Copying results from excel to word is error-prone and inefficient. -Copying results from a software console is risk-prone, +Copying results from a software console is risk-prone, even more inefficient, and unnecessary. There are numerous commands to export outputs from both R and Stata to a myriad of formats.\sidenote{ - Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, -and \href{https://www.benjaminbdaniels.com/stata-code/outwrite/}{\texttt{outwrite}} in Stata, + Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, +and \href{https://www.benjaminbdaniels.com/stata-code/outwrite/}{\texttt{outwrite}} in Stata, and \href{https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf}{\texttt{stargazer}} and \href{https://ggplot2.tidyverse.org/reference/ggsave.html}{\texttt{ggsave}} in R.} Save outputs in accessible and, whenever possible, lightweight formats. Accessible means that it's easy for other people to open them. -In Stata, that would mean always using \texttt{graph export} to save images as -\texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., -instead of \texttt{graph save}, +In Stata, that would mean always using \texttt{graph export} to save images as +\texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., +instead of \texttt{graph save}, which creates a \texttt{.gph} file that can only be opened through a Stata installation. Some publications require ``lossless'' TIFF of EPS files, which are created by specifying the desired extension. Whichever format you decide to use, remember to always specify the file extension explicitly. For tables there are less options and more consideration to be made. -Exporting table to \texttt{.tex} should be preferred. -Excel \texttt{.xlsx} and \texttt{.csv} are also commonly used, +Exporting table to \texttt{.tex} should be preferred. +Excel \texttt{.xlsx} and \texttt{.csv} are also commonly used, but require the extra step of copying the tables into the final output. -The amount of work needed in a copy-paste workflow increases rapidly with the number of tables and figures included in a research output, +The amount of work needed in a copy-paste workflow increases rapidly with the number of tables and figures included in a research output, and do the chances of having the wrong version a result in your paper or report. -If you need to create a table with a very particular format, -that is not automated by any command you know, consider writing the it manually +If you need to create a table with a very particular format, +that is not automated by any command you know, consider writing the it manually (Stata's \texttt{filewrite}, for example, allows you to do that). -This will allow you to write a cleaner script that focuses on the econometrics, +This will allow you to write a cleaner script that focuses on the econometrics, and not on complicated commands to create and append intermediate matrices. To avoid cluttering your scripts with formatting and ensure that formatting is consistent across outputs, define formatting options in an R object or a Stata global and call them when needed. Keep in mind that final outputs should be self-standing. This means it should be easy to read and understand them with only the information they contain. -Make sure labels and notes cover all relevant information, such as sample, +Make sure labels and notes cover all relevant information, such as sample, unit of observation, unit of measurement and variable definition.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Reviewing\_Graphs} \\ \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Submit\_Table}} -If you follow the steps outlined in this chapter, -most of the data work involved in the last step of the research process +If you follow the steps outlined in this chapter, +most of the data work involved in the last step of the research process -- publication -- will already be done. -If you used de-identified data for analysis, -publishing the cleaned data set in a trusted repository will allow you to cite your data. +If you used de-identified data for analysis, +publishing the cleaned data set in a trusted repository will allow you to cite your data. Some of the documentation produced during cleaning and construction can be published even if your data is too sensitive to be published. Your analysis code will be organized in a reproducible way, so will need to do release a replication package is a last round of code review. -This will allow you to focus on what matters: +This will allow you to focus on what matters: writing up your results into a compelling story. %------------------------------------------------ From 881d7052cff8a98696467ad8561bd44cb174f76b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 18:23:35 -0500 Subject: [PATCH 653/854] Clean up sections a bit and change titles --- chapters/data-analysis.tex | 83 ++++++++++++++++++-------------------- manuscript.tex | 2 +- 2 files changed, 41 insertions(+), 44 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 74485fc35..fb447b484 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -28,7 +28,7 @@ %------------------------------------------------ -\section{Data management} +\section{Managing data effectively} The goal of data management is to organize the components of data work so it can traced back and revised without massive effort. @@ -44,7 +44,7 @@ \section{Data management} Smart use of version control also allows you to track how each edit affects other files in the project. -\subsection{Folder structure} +\subsection{Organizing your folder structure} There are many ways to organize research data. Our preferred scheme reflects the task breakdown that will be outlined in this chapter. @@ -77,7 +77,7 @@ \subsection{Folder structure} \url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} so all project code is reflected in a top-level script. -\subsection{Task breakdown} +\subsection{Breaking down tasks} We divide the data work process that starts from the raw data and builds on it to create final analysis outputs into three stages: @@ -101,7 +101,7 @@ \subsection{Task breakdown} Code review is a common quality assurance practice among data scientists. It helps to keep the level of the outputs high, and is also a great way to learn and improve your code. -\subsection{Master scripts} +\subsection{Writing master scripts} Master scripts allow users to execute all the project code from a single file. They briefly describe what each code does, @@ -115,7 +115,7 @@ \subsection{Master scripts} and where different files can be found in the project folder. That is, it should contain all the information needed to interact with a project's data work. -\subsection{Version control} +\subsection{Implementing version control} Finally, establishing a version control system is an incredibly useful and important step for documentation, collaboration and conflict-solving. @@ -138,7 +138,7 @@ \subsection{Version control} %------------------------------------------------ -\section{De-identification} +\section{De-identifying research data} The starting point for all tasks described in this chapter is the raw data. It should contain only materials that are received directly from the field. @@ -214,7 +214,7 @@ \section{De-identification} de-identification should not affect the usability of the data. -\section{Data cleaning} +\section{Cleaning data for analysis} Data cleaning is the second stage in the transformation of data you received from the field into data that you can analyze.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} The cleaning process involves (1) making the data set easily usable and understandable, @@ -259,7 +259,36 @@ \subsection{Correcting data entry errors} and you should keep a careful record of how they were identified, and how the correct value was obtained. -\subsection{Labeling and annotating the raw data} +\subsection{Labeling, annotating, and finalizing clean data} + +The main output of data cleaning is the cleaned data set. +It should contain the same information as the raw data set, +with no changes to data points. +It should also be easily traced back to the survey instrument, +and be accompanied by a dictionary or codebook. +Typically, one cleaned data set will be created for each data source, +i.e. per survey instrument. +Each row in the cleaned data set represents one survey entry or unit of observation.\sidenote{\cite{tidy-data}} +If the raw data set is very large, or the survey instrument is very complex, +you may want to break the data cleaning into sub-steps, +and create intermediate cleaned data sets +(for example, one per survey module). +When dealing with complex surveys with multiple nested groups, +is is also useful to have each cleaned data set at the smallest unit of observation inside a roster. +This will make the cleaning faster and the data easier to handle during construction. +But having a single cleaned data set will help you with sharing and publishing the data. + +To make sure the cleaned data set file doesn't get too big to be handled, +use commands such as \texttt{compress} in Stata to make sure the data +is always stored in the most efficient format. +Once you have a cleaned, de-identified data set, and documentation to support it, +you have created the first data output of your project: +a publishable data set. +The next chapter will get into the details of data publication. +For now, all you need to know is that your team should consider submitting the data set for publication at this point, +even if it will remain embargoed for some time. +This will help you organize your files and create a back up of the data, +and some donors require that the data be filed as an intermediate step of the project. On average, making corrections to primary data is more time-consuming than when using secondary data. But you should always check for possible issues in any data you are about to use. @@ -319,36 +348,6 @@ \subsection{Documenting data cleaning} then use it as a basis to discuss with your team how to address potential issues during data construction. This material will also be valuable during exploratory data analysis. -\subsection{The cleaned data set} - -The main output of data cleaning is the cleaned data set. -It should contain the same information as the raw data set, -with no changes to data points. -It should also be easily traced back to the survey instrument, -and be accompanied by a dictionary or codebook. -Typically, one cleaned data set will be created for each data source, -i.e. per survey instrument. -Each row in the cleaned data set represents one survey entry or unit of observation.\sidenote{\cite{tidy-data}} -If the raw data set is very large, or the survey instrument is very complex, -you may want to break the data cleaning into sub-steps, -and create intermediate cleaned data sets -(for example, one per survey module). -When dealing with complex surveys with multiple nested groups, -is is also useful to have each cleaned data set at the smallest unit of observation inside a roster. -This will make the cleaning faster and the data easier to handle during construction. -But having a single cleaned data set will help you with sharing and publishing the data. -To make sure the cleaned data set file doesn't get too big to be handled, -use commands such as \texttt{compress} in Stata to make sure the data -is always stored in the most efficient format. -Once you have a cleaned, de-identified data set, and documentation to support it, -you have created the first data output of your project: -a publishable data set. -The next chapter will get into the details of data publication. -For now, all you need to know is that your team should consider submitting the data set for publication at this point, -even if it will remain embargoed for some time. -This will help you organize your files and create a back up of the data, -and some donors require that the data be filed as an intermediate step of the project. - \section{Constructing final indicators} % What is construction ------------------------------------- @@ -372,8 +371,6 @@ \section{Constructing final indicators} (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} -\subsection{Why construction?} - % From cleaning Construction is done separately from data cleaning for two reasons. The first one is to clearly differentiate the data originally collected @@ -405,7 +402,7 @@ \subsection{Why construction?} Therefore, even if construction ends up coming before analysis only in the order the code is run, it's important to think of them as different steps. -\subsection{Construction tasks and how to approach them} +\subsection{Constructing analytical variables} The first thing that comes to mind when we talk about variable construction is, of course, creating new variables. Do this by adding new variables to the data set instead of overwriting the original information, @@ -476,7 +473,7 @@ \subsection{Construction tasks and how to approach them} Apart from preventing inconsistencies, this process will also save you time and give you an opportunity to review your original code. -\subsection{Documenting indicators construction} +\subsection{Documenting variable construction} Because data construction involves translating concrete data points to more abstract measurements, it is important to document exactly how each variable is derived or calculated. @@ -625,7 +622,7 @@ \subsection{Visualizing data} so you get a good sense of what your data should look like before you can start writing code to create a visualization. -\section{Exporting analysis outputs} +\subsection{Exporting analysis outputs} Our team has created a few products to automate common outputs and save you precious research time. diff --git a/manuscript.tex b/manuscript.tex index 496ca31b3..ca0d6aa4e 100644 --- a/manuscript.tex +++ b/manuscript.tex @@ -77,7 +77,7 @@ \chapter{Chapter 5: Collecting primary data} % CHAPTER 6 %---------------------------------------------------------------------------------------- -\chapter{Chapter 6: Analyzing survey data} +\chapter{Chapter 6: Analyzing research data} \label{ch:6} \input{chapters/data-analysis.tex} From 86e10a738410f6d7ae007c9519c0e847a5e857c4 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 18:50:06 -0500 Subject: [PATCH 654/854] Edits and clarifications --- chapters/data-analysis.tex | 146 ++++++++++++++++++------------------- 1 file changed, 72 insertions(+), 74 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index fb447b484..c75613cd5 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -31,7 +31,7 @@ \section{Managing data effectively} The goal of data management is to organize the components of data work -so it can traced back and revised without massive effort. +so the complete process can traced, understood, and revised without massive effort. In our experience, there are four key elements to good data management: folder structure, task breakdown, master scripts, and version control. A good folder structure organizes files so that any material can be found when needed. @@ -41,7 +41,7 @@ \section{Managing data effectively} It is a one-file summary of your whole project. Finally, version histories and backups enable the team to edit files without fear of losing information. -Smart use of version control also allows you to track +Smart use of version control allows you to track how each edit affects other files in the project. \subsection{Organizing your folder structure} @@ -51,37 +51,38 @@ \subsection{Organizing your folder structure} \index{data organization} Our team at DIME Analytics developed the \texttt{iefolder}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iefolder}} -package (part of \texttt{ietoolkit}\sidenote{ +command (part of \texttt{ietoolkit}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/ietoolkit}}) to automatize the creation of a folder following this scheme and to standardize folder structures across teams and projects. Standardizing folders greatly reduces the costs that PIs and RAs face when switching between projects, -because they are organized in the same way.\sidenote{ +because they are all organized in exactly the same way +and use the same filepaths, shortcuts, and macro references.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/DataWork\_Folder}} -We created the command based on our experience with primary data, +We created \texttt{iefolder} based on our experience with primary data, but it can be used for different types of data, and adapted to fit different needs. No matter what are your team's preference in terms of folder organization, -the principle of creating one standard remains. +the principle of creating a single unified standard remains. -At the first level of the structure created by \texttt{iefolder} are what we call survey round folders.\sidenote{ +At the top level of the structure created by \texttt{iefolder} are what we call survey round folders.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/DataWork\_Survey\_Round}} -You can think of a ``round'' as one source of data, -that will be cleaned in the same script. -Inside round folders, there are dedicated folders for +You can think of a ``round'' as a single source of data, +which will all be cleaned using a single script. +Inside each round folder, there are dedicated folders for: raw (encrypted) data; de-identified data; cleaned data; and final (constructed) data. There is a folder for raw results, as well as for final outputs. The folders that hold code are organized in parallel to these, so that the progression through the whole project can be followed by anyone new to the team. Additionally, \texttt{iefolder} creates \textbf{master do-files}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} -so all project code is reflected in a top-level script. +so the structure of all project code is reflected in a top-level script. \subsection{Breaking down tasks} We divide the data work process that starts from the raw data -and builds on it to create final analysis outputs into three stages: -data cleaning, variable construction, and data analysis. +and builds on it to create final analysis outputs into four stages: +de-identification, data cleaning, variable construction, and data analysis. Though they are frequently implemented at the same time, we find that creating separate scripts and data sets prevents mistakes. It will be easier to understand this division as we discuss what each stage comprises. @@ -91,15 +92,15 @@ \subsection{Breaking down tasks} For each stage, there should be a code folder and a corresponding data set. The names of codes, data sets and outputs for each stage should be consistent, making clear how they relate to one another. -So, for example, a script called \texttt{clean-section-1} would create -a data set called \texttt{cleaned-section-1}. +So, for example, a script called \texttt{section-1-cleaning} would create +a data set called \texttt{section-1-clean}. The division of a project in stages also helps the review workflow inside your team. The code, data and outputs of each of these stages should go through at least one round of code review. During the code review process, team members should read and run each other's codes. Doing this at the end of each stage helps prevent the amount of work to be reviewed to become too overwhelming. Code review is a common quality assurance practice among data scientists. -It helps to keep the level of the outputs high, and is also a great way to learn and improve your code. +It helps to keep the quality of the outputs high, and is also a great way to learn and improve your own code. \subsection{Writing master scripts} @@ -107,8 +108,8 @@ \subsection{Writing master scripts} They briefly describe what each code does, and map the files they require and create. They also connect code and folder structure through macros or objects. -In short, a master script is a human-readable map to the tasks, -files and folder structure that comprise a project. +In short, a master script is a human-readable map of the tasks, +files, and folder structure that comprise a project. Having a master script eliminates the need for complex instructions to replicate results. Reading it should be enough for anyone unfamiliar with the project to understand what are the main tasks, which scripts execute them, @@ -117,7 +118,7 @@ \subsection{Writing master scripts} \subsection{Implementing version control} -Finally, establishing a version control system is an incredibly useful +Establishing a version control system is an incredibly useful and important step for documentation, collaboration and conflict-solving. Version control allows you to effectively track code edits, including the addition and deletion of files. @@ -126,7 +127,7 @@ \subsection{Implementing version control} Everything that can be version-controlled should be. Both analysis results and data sets will change with the code. Whenever possible, you should track have each of them with the code that created it. -If you are writing code in Git/GitHub, +If you are writing code in Git or GitHub, you can output plain text files such as \texttt{.tex} tables and metadata saved in \texttt{.txt} or \texttt{.csv} to that directory. Binary files that compile the tables, @@ -223,8 +224,8 @@ \section{Cleaning data for analysis} The cleaned data set should contain only the variables collected in the field. No modifications to data points are made at this stage, except for corrections of mistaken entries. -Cleaning is probably the most time consuming of the stages discussed in this chapter. -This is the time when you obtain an extensive understanding of the contents and structure of the data that was collected. +Cleaning is probably the most time-consuming of the stages discussed in this chapter. +This is the time when you obtain an extensive understanding of the contents and structure of the data that was collected. Explore your data set using tabulations, summaries, and descriptive plots. You should use this time to understand the types of responses collected, both within each survey question and across respondents. Knowing your data set well will make it possible to do analysis. @@ -266,8 +267,8 @@ \subsection{Labeling, annotating, and finalizing clean data} with no changes to data points. It should also be easily traced back to the survey instrument, and be accompanied by a dictionary or codebook. -Typically, one cleaned data set will be created for each data source, -i.e. per survey instrument. +Typically, one cleaned data set will be created for each data source +or survey instrument. Each row in the cleaned data set represents one survey entry or unit of observation.\sidenote{\cite{tidy-data}} If the raw data set is very large, or the survey instrument is very complex, you may want to break the data cleaning into sub-steps, @@ -281,7 +282,7 @@ \subsection{Labeling, annotating, and finalizing clean data} To make sure the cleaned data set file doesn't get too big to be handled, use commands such as \texttt{compress} in Stata to make sure the data is always stored in the most efficient format. -Once you have a cleaned, de-identified data set, and documentation to support it, +Once you have a cleaned, de-identified data set and the documentation to support it, you have created the first data output of your project: a publishable data set. The next chapter will get into the details of data publication. @@ -329,7 +330,7 @@ \subsection{Documenting data cleaning} These materials are essential for data documentation.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Data\_Documentation}} \index{Documentation} -They should be stored in the corresponding ``Documentation'' folder for easy access, +They should be stored in the corresponding \texttt{Documentation} folder for easy access, as you will probably need them during analysis, and they must be made available for publication. Include in the \texttt{Documentation} folder records of any @@ -371,6 +372,21 @@ \section{Constructing final indicators} (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} +A constructed data set is built to answer an analysis question. +Since different pieces of analysis may require different samples, +or even different units of observation, +you may have one or multiple constructed data sets, +depending on how your analysis is structured. +So don't worry if you cannot create a single, ``canonical'' analysis data set. +It is common to have many purpose-built analysis datasets. +Think of an agricultural intervention that was randomized across villages +and only affected certain plots within each village. +The research team may want to run household-level regressions on income, +test for plot-level productivity gains, +and check if village characteristics are balanced. +Having three separate datasets for each of these three pieces of analysis +will result in much cleaner do files than if they all started from the same file. + % From cleaning Construction is done separately from data cleaning for two reasons. The first one is to clearly differentiate the data originally collected @@ -383,7 +399,7 @@ \section{Constructing final indicators} the data cleaning for them will require different steps, and therefore will be done separately. However, you still want the constructed variables to be calculated in the same way, so they are comparable. -To do this, you will at least two cleaning scripts, +To do this, you will require at least two cleaning scripts, and a single one for construction -- we will discuss how to do this in practice in a bit. @@ -429,14 +445,16 @@ \subsection{Constructing analytical variables} Make sure there is consistency across constructed variables. It's possible that your questionnaire asked respondents to report some answers as percentages and others as proportions, -or that in one variable 0 means ``no'' and 1 means ``yes'', -while in another one the same answers were coded are 1 and 2. -We recommend coding yes/no questions as either 1/0 or TRUE/FALSE, +or that in one variable \texttt{0} means ``no'' and \texttt{1} means ``yes'', +while in another one the same answers were coded are \texttt{1} and \texttt{2}. +We recommend coding yes/no questions as either \texttt{1} and \texttt{0} or \texttt{TRUE} and \texttt{FALSE}, so they can be used numerically as frequencies in means and as dummies in regressions. +(Note that this implies that categorical variables like \texttt{gender} +should be re-expressed as binary variables like \texttt{female}.) Check that non-binary categorical variables have the same value-assignment, i.e., that labels and levels have the same correspondence across variables that use the same options. Finally, make sure that any numeric variables you are comparing are converted to the same scale or unit of measure. -You cannot add one hectare and twos acres into a meaningful number. +You cannot add one hectare and two acres into a meaningful number. During construction, you will also need to address some of the issues you identified in the data set as you were cleaning it. @@ -467,10 +485,11 @@ \subsection{Constructing analytical variables} Having a well-established definition for each constructed variable helps prevent that mistake, but the best way to guarantee it won't happen is to create the indicators for all rounds in the same script. Say you constructed variables after baseline, and are now receiving midline data. -Then the first thing you should do is create a panel data set --- \texttt{iecodebook}'s \texttt{append} subcommand will help you reconcile and append survey rounds. -After that, adapt the construction code so it can be used on the panel data set. -Apart from preventing inconsistencies, +Then the first thing you should do is create a cleaned panel data set, +ignoring the previous constructed version of the baseline data. +The \texttt{iecodebook append} subcommand will help you reconcile and append the cleaned survey rounds. +After that, adapt a single variable construction script so it can be used on the panel data set as a whole. +In addition to preventing inconsistencies, this process will also save you time and give you an opportunity to review your original code. \subsection{Documenting variable construction} @@ -485,8 +504,8 @@ \subsection{Documenting variable construction} This can be part of a wider discussion with your team about creating protocols for variable definition, which will guarantee that indicators are defined consistently across projects. When all your final variables have been created, -you can use \texttt{iecodebook}'s \texttt{export} subcommand to list all variables in the data set, -and complement it with the variable definitions you wrote during construction to create a concise meta data document. +you can use the \texttt{iecodebook export} subcommand to list all variables in the data set, +and complement it with the variable definitions you wrote during construction to create a concise metadata document. Documentation is an output of construction as relevant as the code and the data. Someone unfamiliar with the project should be able to understand the contents of the analysis data sets, the steps taken to create them, @@ -494,25 +513,6 @@ \subsection{Documenting variable construction} The construction documentation will complement the reports and notes created during data cleaning. Together, they will form a detailed account of the data processing. -\subsection{Constructed data sets} - -The other set of construction outputs, as expected, -consists of the data sets that will be used for analysis. -A constructed data set is built to answer an analysis question. -Since different pieces of analysis may require different samples, -or even different units of observation, -you may have one or multiple constructed data sets, -depending on how your analysis is structured. -So don't worry if you cannot create a single, ``canonical'' analysis data set. -It is common to have many purpose-built analysis datasets. -Think of an agricultural intervention that was randomized across villages -and only affected certain plots within each village. -The research team may want to run household-level regressions on income, -test for plot-level productivity gains, -and check if village characteristics are balanced. -Having three separate datasets for each of these three pieces of analysis -will result in much cleaner do files than if they all started from the same file. - %------------------------------------------------ \section{Writing data analysis code} @@ -528,7 +528,7 @@ \section{Writing data analysis code} This section will not include instructions on how to conduct specific analyses. That is a research question, and requires expertise beyond the scope of this book. Instead, we will outline the structure of writing analysis code, -assuming you have completed the process of data cleaning and construction. +assuming you have completed the process of data cleaning and variable construction. \subsection{Organizing analysis code} @@ -544,19 +544,20 @@ \subsection{Organizing analysis code} to organize the code that you want to use again in a clean manner. A well-organized analysis script starts with a completely fresh workspace -and explicitly loads data before analyzing it. +and explicitly loads data before analyzing it, for each output it creates. This setup encourages data manipulation to be done earlier in the workflow (that is, during construction). It also and prevents you from accidentally writing pieces of analysis code that depend on one another -and require manual instructions for all necessary chuncks of code to be run in the right order. -Each script should run completely independently of all other code, +and require manual instructions for all necessary chunks of code to be run in the right order. +Each chunk of analysis code should run completely independently of all other code, except for the master script. -You can go as far as coding every output in a separate script. +You could go as far as coding every output in a separate script (although you usually won't). There is nothing wrong with code files being short and simple. In fact, analysis scripts should be as simple as possible, so whoever is reading them can focus on the econometrics, not the coding. -All research decisions should be very explicit in the code. +All research questions and statistical decisions should be very explicit in the code, +and should be very easy to detect from the way the code is written. This includes clustering, sampling, and control variables, to name a few. If you have multiple analysis data sets, each of them should have a descriptive name about its sample and unit of observation. @@ -605,19 +606,16 @@ \subsection{Visualizing data} \texttt{r2d3}\sidenote{\url{https://rstudio.github.io/r2d3/}}, \texttt{leaflet}\sidenote{\url{https://rstudio.github.io/leaflet/}}, and \texttt{plotly}\sidenote{\url{https://plot.ly/r/}}, to name a few. -We have no intention of creating an exhaustive list, and this one is certainly missing very good references. -But at least it is a place to start. - +We have no intention of creating an exhaustive list, and this one is certainly missing very good references; but it is a good place to start. We attribute some of the difficulty of creating good data visualization to writing code to create them. Making a visually compelling graph would already be hard enough if you didn't have to go through many rounds of googling to understand a command. The trickiest part of using plot commands is to get the data in the right format. -This is why we create the \textbf{Stata Visual Library}\sidenote{ +This is why we created the \textbf{Stata Visual Library}\sidenote{ \url{https://worldbank.github.io/Stata-IE-Visual-Library/}}, -has examples of graphs created in Stata and curated by us.\sidenote{ - A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com/}} +which has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com/}} The Stata Visual Library includes example data sets to use with each do-file, so you get a good sense of what your data should look like before you can start writing code to create a visualization. @@ -633,9 +631,9 @@ \subsection{Exporting analysis outputs} does the same for difference-in-differences regressions. It also includes a command, \texttt{iegraph}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Iegraph}}, -to export pre-formatted impact evaluation results graphs +to export pre-formatted impact evaluation results graphs. -It's ok to not export each and every table and graph created during exploratory analysis. +It's okay to not export each and every table and graph created during exploratory analysis. Final analysis scripts, on the other hand, should export final outputs, which are ready to be included to a paper or report. No manual edits, including formatting, should be necessary after exporting final outputs -- @@ -650,10 +648,10 @@ \subsection{Exporting analysis outputs} and you want to it as few times as possible. We cannot stress this enough: -don't ever set a workflow that requires copying and pasting results. -Copying results from excel to word is error-prone and inefficient. +don't ever set up a workflow that requires copying and pasting results. +Copying results from Excel to Word is error-prone and inefficient. Copying results from a software console is risk-prone, -even more inefficient, and unnecessary. +even more inefficient, and totally unnecessary. There are numerous commands to export outputs from both R and Stata to a myriad of formats.\sidenote{ Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, and \href{https://www.benjaminbdaniels.com/stata-code/outwrite/}{\texttt{outwrite}} in Stata, @@ -665,7 +663,7 @@ \subsection{Exporting analysis outputs} \texttt{.jpg}, \texttt{.png}, \texttt{.pdf}, etc., instead of \texttt{graph save}, which creates a \texttt{.gph} file that can only be opened through a Stata installation. -Some publications require ``lossless'' TIFF of EPS files, which are created by specifying the desired extension. +Some publications require ``lossless'' TIFF or EPS files, which are created by specifying the desired extension. Whichever format you decide to use, remember to always specify the file extension explicitly. For tables there are less options and more consideration to be made. Exporting table to \texttt{.tex} should be preferred. From e6322306a87e9bc6ae2b57165877ef796d76af18 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 18:51:38 -0500 Subject: [PATCH 655/854] Accept suggestion MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index b92b60fe0..0a58ea534 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -8,7 +8,7 @@ and how the design affects data work. Without going into too much technical detail, as there are many excellent resources on impact evaluation design, -this section presents a brief overview +this chapter presents a brief overview of the most common causal inference methods, focusing on implications for data structure and analysis. The intent of this chapter is for you to obtain an understanding of From 22ab8f485d55dc97dd1ec6a1b7577c1ae8ec47d6 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 18:53:30 -0500 Subject: [PATCH 656/854] De-weirdify Chp7 --- chapters/publication.tex | 27 +++++++++------------------ 1 file changed, 9 insertions(+), 18 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 7175eadbe..f9b088e84 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -7,18 +7,6 @@ These represent an intellectual contribution in their own right, because they enable others to learn from your process and better understand the results you have obtained. -Holding code and data to the same standards a written work -is a new practice for many researchers. -In this chapter, we provide guidelines that will help you -prepare a functioning and informative replication package. -Ideally, if you have organized your analytical work -according to the general principles outlined throughout this book, -then preparing to release materials will not require -substantial reorganization of the work you have already done. -Hence, this step represents the conclusion of the system -of transparent, reproducible, and credible research we introduced -from the very first chapter of this book. - Typically, various contributors collaborate on both code and writing, manuscripts go through many iterations and revisions, and the final package for publication includes not just a manuscript @@ -32,14 +20,17 @@ collectively referred to as dynamic documents -- for managing the process of collaboration on any technical product. -For most research projects, completing a manuscript is not the end of the task. -Academic journals increasingly require submission of a replication package, -which contains the code and materials needed to create the results. -These represent an intellectual contribution in their own right, -because they enable others to learn from your process -and better understand the results you have obtained. Holding code and data to the same standards a written work is a new practice for many researchers. +In this chapter, we provide guidelines that will help you +prepare a functioning and informative replication package. +Ideally, if you have organized your analytical work +according to the general principles outlined throughout this book, +then preparing to release materials will not require +substantial reorganization of the work you have already done. +Hence, this step represents the conclusion of the system +of transparent, reproducible, and credible research we introduced +from the very first chapter of this book. In this chapter, we first discuss tools and workflows for collaborating on technical writing. Next, we turn to publishing data, noting that the data can itself be a significant contribution in addition to analytical results. From 10abdabf90ecdae78a82b41d0d0289282716a4cc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 11 Feb 2020 18:57:15 -0500 Subject: [PATCH 657/854] Later chapters a guide --- chapters/handling-data.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 149711319..00be9111a 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -29,6 +29,7 @@ In this chapter, we outline a set of practices that help to ensure research participants are appropriately protected and research consumers can be confident in the conclusions reached. + Later chapters will provide more hands-on guides to implementing those practices. \end{fullwidth} From f074ea2333b8f9b863b473debbae11cab4ba36c5 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 12 Feb 2020 09:53:06 -0500 Subject: [PATCH 658/854] Accept suggestion Co-Authored-By: Luiza Andrade --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index f36acc681..6232d52d7 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -5,7 +5,7 @@ However, credible development research often depends, first and foremost, on the quality of the raw data. This is because, when you are collecting the data yourself, or it is provided only to you through a unique partnership, -there is no way for others to validate that it actually reflects the field reality +there is no way for others to validate that it accurately reflects the reality and that the indicators you have based your analysis on are meaningful. This chapter details the necessary components for a high-quality data acquisition process, no matter whether you are recieving large amounts of unique data from partners From 3cc4d1167b077c8cd459f650abb34f59ca3c9a8c Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 12 Feb 2020 09:53:28 -0500 Subject: [PATCH 659/854] Accept suggestion Co-Authored-By: Luiza Andrade --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index f9b088e84..33759c594 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -31,7 +31,7 @@ Hence, this step represents the conclusion of the system of transparent, reproducible, and credible research we introduced from the very first chapter of this book. -In this chapter, we first discuss tools and workflows for collaborating on technical writing. +We start the chapter with a discussion about tools and workflows for collaborating on technical writing. Next, we turn to publishing data, noting that the data can itself be a significant contribution in addition to analytical results. Finally, we provide guidelines that will help you to prepare a functioning and informative replication package. From 7b72e6812f93d6a3649bdd122c8dc0c531ef64da Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 12 Feb 2020 09:54:02 -0500 Subject: [PATCH 660/854] Accept suggestion Co-Authored-By: Luiza Andrade --- chapters/research-design.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 0a58ea534..a7852dbaf 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -5,7 +5,7 @@ that will be used to answer a specific research question. You don't need to be an expert in research design to do effective data work, but it is essential that you understand the design of the study you are working on, -and how the design affects data work. +and how it affects the data work. Without going into too much technical detail, as there are many excellent resources on impact evaluation design, this chapter presents a brief overview From a9d8df4237be76cc42b70550bd1746f73a68abaa Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 12 Feb 2020 12:26:40 -0500 Subject: [PATCH 661/854] [ch6] referring to previous chapter, small changes to broken raw data --- chapters/data-analysis.tex | 41 +++++++++++++++++++------------------- 1 file changed, 21 insertions(+), 20 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index c75613cd5..cd379a195 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -141,26 +141,27 @@ \subsection{Implementing version control} \section{De-identifying research data} -The starting point for all tasks described in this chapter is the raw data. -It should contain only materials that are received directly from the field. -They will invariably come in a host of file formats and nearly always contain personally-identifying information.\index{personally-identifying information} -These files should be retained in the raw data folder \textit{exactly as they were received}. -Be mindful of where they are stored. -Maintain a backup copy in a secure offsite location. -Every other file is created from the raw data, and therefore can be recreated. -The exception, of course, is the raw data itself, so it should never be edited directly. -The rare and only case when the raw data can be edited directly is when it is encoded incorrectly -and some non-English character is causing rows or columns to break at the wrong place -when the data is imported. -In this scenario, you will have to remove the special character manually, save the resulting data set \textit{in a new file} and securely back up \textit{both} the broken and the fixed version of the raw data. - -Note that no one who is not listed in the IRB should be able to access confidential data, -not even the company providing file-sharing services. -Check if your organization has guidelines on how to store data securely, as they may offer an institutional solution. -If that is not the case, you will need to encrypt the data, especially before -sharing it, and make sure that only IRB-listed team members have the -encryption key.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}}. -Secure storage of the raw data means access to it will be restricted even inside the research team.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Security}} +The starting point for all tasks described in this chapter is the raw data +which should contain only materials that are received directly from the field. +The raw data will invariably come in a host of file formats and these files +should be retained in the raw data folder \textit{exactly as they were +received}. Be mindful of how and where they are stored as they can not be +re-created and nearly always contain confidential data such as +personally-identifying information\index{personally-identifying information}. +As described in the previous chapter, confidential data must always be +encrypted\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} and be +properly backed up since every other data file you will use is created from the +raw data. The only data sets that can not be re-created are the raw data +themselves. + +The raw data sets should never be edited directly. This is true even in the +rare case when the raw data cannot be opened due to incorrect encoding and +non-English character is causing rows or columns to break at the wrong place +when the data is imported. In this scenario, you should create a copy of +the raw data where you manually remove the special characters and securely back +up \textit{both} the broken and the fixed copy of the raw data. You will only +keep working from fixed copy, but you keep both copies in case you later +realize that the manual fix was done incorrectly. Loading encrypted data frequently can be disruptive to the workflow. To facilitate the handling of the data, remove any personally identifiable information from the data set. From 216a6c59b97f46e6d520f4016a1a82e56d25bb7c Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 12 Feb 2020 12:27:26 -0500 Subject: [PATCH 662/854] [ch6] data file instead of data set here --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index cd379a195..4083e6822 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -154,7 +154,7 @@ \section{De-identifying research data} raw data. The only data sets that can not be re-created are the raw data themselves. -The raw data sets should never be edited directly. This is true even in the +The raw data files should never be edited directly. This is true even in the rare case when the raw data cannot be opened due to incorrect encoding and non-English character is causing rows or columns to break at the wrong place when the data is imported. In this scenario, you should create a copy of From 9ad9323382acd592ca7d5da02759302eafd19f63 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 12 Feb 2020 12:29:52 -0500 Subject: [PATCH 663/854] [ch6] indicate it is an example of this, and missing "the" --- chapters/data-analysis.tex | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 4083e6822..5f2ca6398 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -155,13 +155,13 @@ \section{De-identifying research data} themselves. The raw data files should never be edited directly. This is true even in the -rare case when the raw data cannot be opened due to incorrect encoding and -non-English character is causing rows or columns to break at the wrong place -when the data is imported. In this scenario, you should create a copy of -the raw data where you manually remove the special characters and securely back -up \textit{both} the broken and the fixed copy of the raw data. You will only -keep working from fixed copy, but you keep both copies in case you later -realize that the manual fix was done incorrectly. +rare case when the raw data cannot be opened due to, for example, incorrect +encoding where non-English character is causing rows or columns to break at the +wrong place when the data is imported. In this scenario, you should create a +copy of the raw data where you manually remove the special characters and +securely back up \textit{both} the broken and the fixed copy of the raw data. +You will only keep working from the fixed copy, but you keep both copies in +case you later realize that the manual fix was done incorrectly. Loading encrypted data frequently can be disruptive to the workflow. To facilitate the handling of the data, remove any personally identifiable information from the data set. From c5ce3a70f7b91b85f1b62255930c7a42b23283e7 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 12 Feb 2020 12:45:21 -0500 Subject: [PATCH 664/854] [ch6] fix \cite{} not \sidenote{\cite{}} #312 --- chapters/data-analysis.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 5f2ca6398..b347dff1e 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -270,7 +270,8 @@ \subsection{Labeling, annotating, and finalizing clean data} and be accompanied by a dictionary or codebook. Typically, one cleaned data set will be created for each data source or survey instrument. -Each row in the cleaned data set represents one survey entry or unit of observation.\sidenote{\cite{tidy-data}} +Each row in the cleaned data set represents one survey entry or unit of +observation.\cite{tidy-data} If the raw data set is very large, or the survey instrument is very complex, you may want to break the data cleaning into sub-steps, and create intermediate cleaned data sets From d6c456cb6529f159efbb025ba6f5726dfea33fd9 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 12 Feb 2020 13:17:13 -0500 Subject: [PATCH 665/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index b347dff1e..1d23ffd00 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -143,7 +143,7 @@ \section{De-identifying research data} The starting point for all tasks described in this chapter is the raw data which should contain only materials that are received directly from the field. -The raw data will invariably come in a host of file formats and these files +The raw data will invariably come in a variety of file formats and these files should be retained in the raw data folder \textit{exactly as they were received}. Be mindful of how and where they are stored as they can not be re-created and nearly always contain confidential data such as From 917a55f05e8cc932eb86cf5d5c9c25af85c75920 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 12 Feb 2020 14:24:35 -0500 Subject: [PATCH 666/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 1d23ffd00..60bec2422 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -172,7 +172,7 @@ \section{De-identifying research data} find all the variables that contain identifying information. Flagging all potentially identifying variables in the questionnaire design stage simplifies the initial de-identification process. -If you haven't done that, that are a few tools that can help you with it. +If you did not do that or you received the raw data from someone else, that are a few tools that can help you with it. JPAL's \texttt{PII scan}, as indicated by its name, scans variable names and labels for common string patterns associated with identifying information.\sidenote{ \url{https://github.com/J-PAL/PII-Scan}} From 96e2709df3eea881c39d04b5d723aeac8e87714b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 12 Feb 2020 14:50:25 -0500 Subject: [PATCH 667/854] Roadmap for Ch2 --- chapters/planning-data-work.tex | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 0a2a76d55..0774564b8 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -25,6 +25,11 @@ this makes working together on outputs much easier from the very first discussion. This chapter will guide you on preparing a collaborative work environment, and structuring your data work to be well-organized and clearly documented. +It outline hows to set up your working environment +and prepare to collaborate on technical tasks with others, +as well as how to document tasks and decisions. +It then discusses how to keep code, data, and outputs organized so that others +will be able to locate and work with materials easily. \end{fullwidth} From f7e572ac3894a89eec2eef340868a33919c422e0 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 12 Feb 2020 18:21:39 -0500 Subject: [PATCH 668/854] [ch6] two questions for PII vars --- chapters/data-analysis.tex | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 60bec2422..8be028b40 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -186,28 +186,30 @@ \section{De-identifying research data} \url{https://dimewiki.worldbank.org/wiki/Iecodebook}} Once you have a list of variables that contain PII, -assess them against the analysis plan and ask: -will this variable be needed for analysis? +assess them against the analysis plan and first ask yourself for each variable: +\textit{will this variable be needed for the analysis?} If not, the variable should be dropped. Don't be afraid to drop too many variables the first time, as you can always go back and remove variables from the list of variables to be dropped, but you can not go back in time and drop a PII variable that was leaked because it was incorrectly kept. Examples include respondent names, enumerator names, interview date, respondent phone number. -If the variable is needed for analysis, ask: -can I encode or otherwise construct a variable to use for the analysis that masks the PII, -and drop the original variable? +For each PII variable that is needed in the analysis, ask yourself: +\textit{can I encode or otherwise construct a variable that masks the PII, and +then drop this variable?} This is typically the case for most identifying information. Examples include geocoordinates (after constructing measures of distance or area, drop the specific location), and names for social network analysis (can be encoded to secret and unique IDs). -If PII variables are strictly required for the analysis itself, -it will be necessary to keep at least a subset of the data encrypted through the data analysis process. -If the answer is yes to either of these questions, +If the answer to either of two questions above is yes, all you need to do is write a script to drop the variables that are not required for analysis, encode or otherwise mask those that are required, and save a working version of the data. +If PII variables are strictly required for the analysis itself and can not be +masked or encoded, +it will be necessary to keep at least a subset of the data encrypted through +the data analysis process. The resulting de-identified data will be the underlying source for all cleaned and constructed data. This is the data set that you will interact with directly during the remaining tasks described in this chapter. From 17a163f974edf50c4966068a41f37fadbb75e460 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 13 Feb 2020 18:24:02 -0500 Subject: [PATCH 669/854] [stata app] typos --- appendix/stata-guide.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index bcd0586ac..ae6a0997c 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -24,7 +24,7 @@ This appendix begins with a short section containing instructions on how to access and use the code examples shared in this book. The second section contains the DIME Analytics style guide for Stata code. -we believe these resources can help anyone write more understandable code, +We believe these resources can help anyone write more understandable code, no matter how proficient they are in writing Stata code. Widely accepted and used style guides are common in most programming languages, and we think that using such a style guide greatly improves the quality @@ -325,7 +325,7 @@ \subsection{Writing file paths} \textbf{Absolute} means that all file paths start at the root folder of the computer, often \texttt{C:/} on a PC or \texttt{/Users/} on a Mac. -This makes ensures that you always get the correct file in the correct folder. +This ensures that you always get the correct file in the correct folder. \textbf{Do not use \texttt{cd} unless there is a function that \textit{requires} it.} When using \texttt{cd}, it is easy to overwrite a file in another project folder. Many Stata functions use \texttt{cd} and therefore the current directory may change without warning. From f3dfb2f43f710d5ddb3991d6208037b67859062d Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 13 Feb 2020 18:24:46 -0500 Subject: [PATCH 670/854] [stata app] no indentation after code example --- appendix/stata-guide.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index ae6a0997c..4bc0ae576 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -247,7 +247,7 @@ \subsection{Using whitespace} \codeexample{stata-whitespace-columns.do}{./code/stata-whitespace-columns.do} -Indentation is another type of whitespace that makes code more readable. +\noindent Indentation is another type of whitespace that makes code more readable. Any segment of code that is repeated in a loop or conditional on an \texttt{if}-statement should have indentation of 4 spaces relative to both the loop or conditional statement as well as the closing curly brace. @@ -271,7 +271,7 @@ \subsection{Writing conditional expressions} \codeexample{stata-conditional-expressions1.do}{./code/stata-conditional-expressions1.do} -Use \texttt{if-else} statements when applicable +\noindent Use \texttt{if-else} statements when applicable even if you can express the same thing with two separate \texttt{if} statements. When using \texttt{if-else} statements you are communicating to anyone reading your code that the two cases are mutually exclusive, which makes your code more readable. From 265b165eabf8460bc8c089d82e0c8ff1e7b7e3ea Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 13 Feb 2020 18:25:19 -0500 Subject: [PATCH 671/854] [stata app] absolute and dynamic next to where they are explained --- appendix/stata-guide.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 4bc0ae576..d58bf8530 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -311,8 +311,7 @@ \subsection{Using macros} \subsection{Writing file paths} -All file paths should be absolute and dynamic, -should always be enclosed in double quotes, +All file paths should always be enclosed in double quotes, and should \textbf{always use forward slashes} for folder hierarchies (\texttt{/}). File names should be written in lower case with dashes (\texttt{my-file.dta}). Mac and Linux computers cannot read file paths with backslashes, @@ -323,7 +322,8 @@ \subsection{Writing file paths} if another file with the same name is created (even if there is a default file type). -\textbf{Absolute} means that all file paths start at the root folder of the computer, +File paths should also be absolute and dynamic. \textbf{Absolute} means that all +file paths start at the root folder of the computer, often \texttt{C:/} on a PC or \texttt{/Users/} on a Mac. This ensures that you always get the correct file in the correct folder. \textbf{Do not use \texttt{cd} unless there is a function that \textit{requires} it.} From 573548519d12388cd11942059eb53cf6db2a73a2 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 13 Feb 2020 18:38:43 -0500 Subject: [PATCH 672/854] [stata app] 3 is several rather than many --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index d58bf8530..3747dc813 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -40,7 +40,7 @@ \section{Using the code examples in this book} -You can access the raw code used in examples in this book in many ways. +You can access the raw code used in examples in this book in several ways. We use GitHub to version control everything in this book, the code included. To see the code on GitHub, go to: \url{https://github.com/worldbank/d4di/tree/master/code}. If you are familiar with GitHub you can fork the repository and clone your fork. From c07f3ae10d284ea35c43956359e2efa04581c046 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 13 Feb 2020 18:40:16 -0500 Subject: [PATCH 673/854] [stata app] i think "alerady" should be in this --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 3747dc813..cc48a9567 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -46,7 +46,7 @@ \section{Using the code examples in this book} If you are familiar with GitHub you can fork the repository and clone your fork. We only use Stata's built-in datasets in our code examples, so you do not need to download any data from anywhere. -If you have Stata installed on your computer, then you will have the data files used in the code. +If you have Stata installed on your computer, then you will already have the data files used in the code. A less technical way to access the code is to click the individual file in the URL above, then click the button that says \textbf{Raw}. You will then get to a page that looks like the one at: From 922e786c04bba5fb67ae14edf0373ad8c7049a3a Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 13 Feb 2020 18:43:29 -0500 Subject: [PATCH 674/854] [stata app] side note url to SSC --- appendix/stata-guide.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index cc48a9567..163f60802 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -76,7 +76,8 @@ \subsection{Understanding Stata code} and you will not be able to read their help files until you have installed the commands. Two examples of these in our code are \texttt{randtreat} or \texttt{ieboilstart}. The most common place to distribute user-written commands for Stata -is the Boston College Statistical Software Components (SSC) archive. +is the Boston College Statistical Software Components (SSC) archive.\sidenote{ +\url{https://ideas.repec.org/s/boc/bocode.html}} In our code examples, we only use either Stata's built-in commands or commands available from the SSC archive. So, if your installation of Stata does not recognize a command in our code, for example From a3d76b03c66af55d856f99ca1260b883ef5f0f72 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Thu, 13 Feb 2020 20:22:32 -0500 Subject: [PATCH 675/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 8be028b40..10c9e5c30 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -142,7 +142,7 @@ \subsection{Implementing version control} \section{De-identifying research data} The starting point for all tasks described in this chapter is the raw data -which should contain only materials that are received directly from the field. +which should contain only information that are received directly from the field. The raw data will invariably come in a variety of file formats and these files should be retained in the raw data folder \textit{exactly as they were received}. Be mindful of how and where they are stored as they can not be From c9c9a0896843499aca1b8fbd5d76ec8e5c93092c Mon Sep 17 00:00:00 2001 From: Luiza Date: Thu, 13 Feb 2020 21:37:03 -0500 Subject: [PATCH 676/854] [ch6] sentence clarification --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 8be028b40..c42dee017 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -214,7 +214,7 @@ \section{De-identifying research data} The resulting de-identified data will be the underlying source for all cleaned and constructed data. This is the data set that you will interact with directly during the remaining tasks described in this chapter. Because identifying information is typically only used during data collection, -to find and confirm the identity of interviewees, +when teams need to find and confirm the identity of interviewees, de-identification should not affect the usability of the data. From f608e678f9d576f601373b6413cc3daf0eb889fd Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Thu, 13 Feb 2020 21:41:22 -0500 Subject: [PATCH 677/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 8215e2560..e655e381d 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -144,7 +144,7 @@ \section{De-identifying research data} The starting point for all tasks described in this chapter is the raw data which should contain only information that are received directly from the field. The raw data will invariably come in a variety of file formats and these files -should be retained in the raw data folder \textit{exactly as they were +should be saved in the raw data folder \textit{exactly as they were received}. Be mindful of how and where they are stored as they can not be re-created and nearly always contain confidential data such as personally-identifying information\index{personally-identifying information}. From 097ae3e6767f952f7bdd1e663c5017f2ec306b54 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 14 Feb 2020 15:17:22 -0500 Subject: [PATCH 678/854] Update appendix/stata-guide.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 163f60802..a6a42e108 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -428,7 +428,7 @@ \subsection{Miscellaneous notes} \texttt{ - z ///} -\texttt{ + a*(b-c)} +\texttt{ + a * (b - c)} \noindent Make sure your code doesn't print very much to the results window as this is slow. This can be accomplished by using \texttt{run file.do} rather than \texttt{do file.do}. From 8b6a182ef20eca1e7548dfe237c473b4cefe75f4 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 14 Feb 2020 15:17:59 -0500 Subject: [PATCH 679/854] Update appendix/stata-guide.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index a6a42e108..8467f6c1b 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -9,7 +9,7 @@ their first years after graduating. Recent Masters' program graduates that have joined our team tended to have very good knowledge in the theory of our -trade, but also to require a lot of training in its practical skills. +trade, but have required a lot of training in its practical skills. To us, this is like graduating in architecture having learned how to sketch, describe, and discuss the concepts and requirements of a new building very well, From ff53b9938d580fed51733f8c1b1ad9743a6fab4a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 14 Feb 2020 15:18:13 -0500 Subject: [PATCH 680/854] Update appendix/stata-guide.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 8467f6c1b..7487e91e3 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -295,7 +295,7 @@ \subsection{Using macros} All globals should be referenced using both the the dollar sign and curly brackets around their name (\texttt{\$\{\}}); otherwise, they can cause readability issues when the endpoint of the macro name is unclear. -You should use descriptive names for all macros (up to 32 characters; prefer fewer). +You should use descriptive names for all macros (up to 32 characters; preferably fewer). Simple prefixes are useful and encouraged such as \texttt{thisParam}, \texttt{allParams}, \texttt{theLastParam}, \texttt{allParams}, or \texttt{nParams}. There are several naming conventions you can use for macros with long or multi-word names. From b31415985cc805d2968f0307546286cbf7739aca Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 14 Feb 2020 15:18:26 -0500 Subject: [PATCH 681/854] Update appendix/stata-guide.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 7487e91e3..39a7c04dd 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -313,7 +313,7 @@ \subsection{Using macros} \subsection{Writing file paths} All file paths should always be enclosed in double quotes, -and should \textbf{always use forward slashes} for folder hierarchies (\texttt{/}). +and should always use forward slashes for folder hierarchies (\texttt{/}). File names should be written in lower case with dashes (\texttt{my-file.dta}). Mac and Linux computers cannot read file paths with backslashes, and backslashes cannot be removed with find-and-replace. From 9f83ca2461baaa7489f562122cbdffd192ce545d Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 14 Feb 2020 15:18:57 -0500 Subject: [PATCH 682/854] Update appendix/stata-guide.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 39a7c04dd..d53968d4f 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -351,7 +351,7 @@ \subsection{Line breaks} A common line breaking length is around 80 characters. Stata and other code editors provide a visible ``guide line''. Around that length, start a new line using \texttt{///}. -You can and should write comments after \texttt{///} just as with \texttt{//}. +You can write comments after \texttt{///} just as with \texttt{//}, and that is usually a good thing. (The \texttt{\#delimit} command is only acceptable for advanced function programming and is officially discouraged in analytical code.\cite{cox2005styleguide} Never, for any reason, use \texttt{/* */} to wrap a line.) From d911ad2352ed8c2c25e7a5d8f176da53e1a3a5bd Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 14 Feb 2020 15:19:08 -0500 Subject: [PATCH 683/854] Update appendix/stata-guide.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index d53968d4f..f5e5673f1 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -352,7 +352,7 @@ \subsection{Line breaks} Stata and other code editors provide a visible ``guide line''. Around that length, start a new line using \texttt{///}. You can write comments after \texttt{///} just as with \texttt{//}, and that is usually a good thing. -(The \texttt{\#delimit} command is only acceptable for advanced function programming +(The \texttt{\#delimit} command should only be used for advanced function programming and is officially discouraged in analytical code.\cite{cox2005styleguide} Never, for any reason, use \texttt{/* */} to wrap a line.) Using \texttt{///} breaks the line in the code editor, From f49079886d6be8adb3acfd6c440bbd433277bf09 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Fri, 14 Feb 2020 15:19:19 -0500 Subject: [PATCH 684/854] Update appendix/stata-guide.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index f5e5673f1..821faac8e 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -349,7 +349,7 @@ \subsection{Line breaks} Long lines of code are difficult to read if you have to scroll left and right to see the full line of code. When your line of code is wider than text on a regular paper, you should introduce a line break. A common line breaking length is around 80 characters. -Stata and other code editors provide a visible ``guide line''. +Stata's do-file editor and other code editors provide a visible ``guide line''. Around that length, start a new line using \texttt{///}. You can write comments after \texttt{///} just as with \texttt{//}, and that is usually a good thing. (The \texttt{\#delimit} command should only be used for advanced function programming From 78d33972ec9713f6f25bb1f7f1f37a3db32aa49f Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 17 Feb 2020 16:11:26 -0500 Subject: [PATCH 685/854] Intro cleanup --- appendix/stata-guide.tex | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 821faac8e..a4cbe76fa 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -7,18 +7,18 @@ spend a disproportionately small amount of time teaching their students coding skills in relation to the share of their professional time they will spend writing code their first years after graduating. -Recent Masters' program graduates that have joined our team -tended to have very good knowledge in the theory of our -trade, but have required a lot of training in its practical skills. -To us, this is like graduating in architecture having learned +Recent masters-level graduates that have joined our team +tended to have very good theoretical knowledge, +but have required a lot of training in practical skills. +To us, this is like an architecture graduate having learned how to sketch, describe, and discuss -the concepts and requirements of a new building very well, -but without having the technical skill-set -to actually contribute to a blueprint following professional standards -that can be used and understood by other professionals during construction. -The reasons for this are probably a topic for another book, +the concepts and requirements of a new building very well -- +but without having the technical skills +to contribute to a blueprint following professional standards +that can be used and understood by other professionals. +The reasons for this are a topic for another book, but in today's data-driven world, -people working in quantitative economics research must be proficient programmers, +people working in quantitative development research must be proficient collaborative programmers, and that includes more than being able to compute the correct numbers. This appendix begins with a short section containing instructions @@ -29,8 +29,8 @@ Widely accepted and used style guides are common in most programming languages, and we think that using such a style guide greatly improves the quality of research projects coded in Stata. -We hope that this guide can help to increase the emphasis -given to using, improving, sharing and standardizing code style among the Stata community. +We hope that this guide can help increase the emphasis +given to using, improving, sharing, and standardizing code style among the Stata community. Style guides are the most important tool in how you, like an architect, can draw a blueprint that can be understood and used by everyone in your trade. From 9c11f9d922b6ec4d02f70ac3c7439e9b3f14ccef Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 17 Feb 2020 16:26:04 -0500 Subject: [PATCH 686/854] Style guide explanation cleanup --- appendix/stata-guide.tex | 34 +++++++++++++++++++++++----------- 1 file changed, 23 insertions(+), 11 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index a4cbe76fa..fa5b3b8f7 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -60,17 +60,20 @@ \section{Using the code examples in this book} \subsection{Understanding Stata code} -Regardless of being new to Stata or having used it for decades, you will always run into commands that -you have not seen before or whose purpose you do not remember. -Every time that happens, you should always look that command up in the help file. -For some reason, we often encounter the conception that help files are only for beginners. +Whether you are new to Stata or have used it for decades, +you will always run into commands that +you have not seen before or whose function you do not remember. +Every time that happens, +you should always look up the help file for that command. +We often encounter the conception that help files are only for beginners. We could not disagree with that conception more, as the only way to get better at Stata is to constantly read help files. So if there is a command that you do not understand in any of our code examples, for example \texttt{isid}, then write \texttt{help isid}, and the help file for the command \texttt{isid} will open. - -We cannot emphasize enough how important we think it is that you get into the habit of reading help files. +We cannot emphasize enough how important it is +that you get into the habit of reading help files. +Most of us have a help file window open at all times. Sometimes, you will encounter code that employs user-written commands, and you will not be able to read their help files until you have installed the commands. @@ -97,7 +100,8 @@ \subsection{Understanding Stata code} people's work that has been made publicly available, and once you get used to installing commands like this it will not be confusing at all. All code with user-written commands, furthermore, is best written when it installs such commands -at the beginning of the master do-file, so that the user does not have to search for packages manually. +at the beginning of the master do-file, +so that the user does not have to search for packages manually. \subsection{Why we use a Stata style guide} @@ -105,13 +109,21 @@ \subsection{Why we use a Stata style guide} Sometimes they are official guides that are universally agreed upon, such as PEP8 for Python.\sidenote{\url{https://www.python.org/dev/peps/pep-0008/}} More commonly, there are well-recognized but non-official style guides like the JavaScript Standard Style\sidenote{\url{https://standardjs.com/\#the-rules}} for -JavaScript or Hadley Wickham's\sidenote{\url{http://adv-r.had.co.nz/Style.html}} style guide for R. +JavaScript or Hadley Wickham's style guide for R.\sidenote{\url{http://adv-r.had.co.nz/Style.html}} +Google, for example, maintains style guides for all languages +that are used in its projects.\sidenote{ + \url{https://github.com/google/styleguide}} Aesthetics is an important part of style guides, but not the main point. -The existence of style guides improves the quality of the code in that language that is produced by all programmers in the community. -It is through a style guide that unexperienced programmers can learn from more experienced programmers +The important function is to allow programmers who are likely to work together +to share conventions and understandings of what the code is doing. +Style guides therefore help improve the quality of the code +in that language that is produced by all programmers in a community. +It is through a shared style that newer programmers can learn from more experienced programmers how certain coding practices are more or less error-prone. -Broadly-accepted style guides make it easier to borrow solutions from each other and from examples online without causing bugs that might only be found too late. +Broadly-accepted style conventions make it easier to borrow solutions +from each other and from examples online +without causing bugs that might only be found too late. Similarly, globally standardized style guides make it easier to solve each others' problems and to collaborate or move from project to project, and from team to team. From cc4e04b6e6d9af8e091aaf42c30a9564c50cf0ae Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 17 Feb 2020 16:29:03 -0500 Subject: [PATCH 687/854] Remove extra comments stuff --- appendix/stata-guide.tex | 17 ----------------- 1 file changed, 17 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index fa5b3b8f7..519049629 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -163,23 +163,6 @@ \subsection{Commenting code} unless the advanced method is a widely accepted one. There are three types of comments in Stata and they have different purposes: -\begin{enumerate} - \item \texttt{/*} - - \texttt{COMMENT} - - \texttt{*/} - - is used to insert narrative, multi-line comments at the beginning of files or sections. - \item \texttt{* COMMENT} - - \texttt{* POSSIBLY MORE COMMENT} - - indicates a change in task or a code sub-section and should be multi-line only if necessary. - \item \texttt{// COMMENT} - - is used for inline clarification after a single line of code. -\end{enumerate} \codeexample{stata-comments.do}{./code/stata-comments.do} From 6b5abc8f9083246763ff9c6863f55a2345960315 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 17 Feb 2020 16:59:36 -0500 Subject: [PATCH 688/854] Correct quotes for code --- appendix/stata-guide.tex | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 519049629..927c595a1 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -171,6 +171,8 @@ \subsection{Abbreviating commands} Stata commands can often be abbreviated in the code. You can tell if a command can be abbreviated if the help file indicates an abbreviation by underlining part of the name in the syntax section at the top. Only built-in commands can be abbreviated; user-written commands cannot. +(Many commands additionally allow abbreviations of options: +these are always acceptable at the shortest allowed abbreviation.) Although Stata allows some commands to be abbreviated to one or two characters, this can be confusing -- two-letter abbreviations can rarely be ``pronounced'' in an obvious way that connects them to the functionality of the full command. @@ -178,7 +180,7 @@ \subsection{Abbreviating commands} with the exception of \texttt{tw} for \texttt{twoway} and \texttt{di} for \texttt{display}, and abbreviations should only be used when widely a accepted abbreviation exists. We do not abbreviate \texttt{local}, \texttt{global}, \texttt{save}, \texttt{merge}, \texttt{append}, or \texttt{sort}. -Here is our non-exhaustive list of widely accepted abbreviations of common Stata commands. +The following is a list of accepted abbreviations of common Stata commands: \begin{center} \begin{tabular}{ c | l } @@ -211,6 +213,14 @@ \subsection{Abbreviating variables} \texttt{ieboilstart} executes the command \texttt{set varabbrev off} by default, and will therefore break any code using variable abbreviations. +Using wildcards and lists in Stata for variable lists +(texttt{*}, texttt{?}, and texttt{-}) is also discouraged, +because the functionality of the code may change +if the dataset is changed or even simply reordered. +If you intend explicitly to capture all variables of a certain type, +prefer texttt{unab} or texttt{lookfor} to build that list in a local macro, +which can then be checked to have the right variables in the right order. + \subsection{Writing loops} In Stata examples and other code languages, it is common for the name of the local generated by \texttt{foreach} or \texttt{forvalues} @@ -220,9 +230,11 @@ \subsection{Writing loops} for looping through \textbf{iterations} with \texttt{i}; and for looping across matrices with \texttt{i}, \texttt{j}. Other typical index names are \texttt{obs} or \texttt{var} when looping over observations or variables, respectively. -But since Stata does not have arrays, such abstract syntax should not be used in Stata code otherwise. +But since Stata does not have arrays, +such abstract syntax should not be used in Stata code otherwise. Instead, index names should describe what the code is looping over -- for example household members, crops, or medicines. +Even counters should be explicitly named. This makes code much more readable, particularly in nested loops. \codeexample{stata-loops.do}{./code/stata-loops.do} @@ -261,8 +273,10 @@ \subsection{Writing conditional expressions} All conditional (true/false) expressions should be within at least one set of parentheses. The negation of logical expressions should use bang (\texttt{!}) and not tilde (\texttt{\~}). -Always use explicit truth checks (\texttt{if `value'==1}) rather than implicits (\texttt{if `value'}). -Always use the \texttt{missing(`var')} function instead of arguments like (\texttt{if `var'<=.}), +Always use explicit truth checks (\texttt{if \`{}value\textquotesingle==1}) +rather than implicits (\texttt{if \`{}value\textquotesingle}). +Always use the \texttt{missing(=\`{}var\textquotesingle)} function +instead of arguments like (\texttt{if \`{}var\textquotesingle<=.}), and always consider whether missing values will affect the evaluation conditional expressions. \codeexample{stata-conditional-expressions1.do}{./code/stata-conditional-expressions1.do} From 174131982ad200f9a4625ecef32f8709f82925e6 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 17 Feb 2020 17:00:19 -0500 Subject: [PATCH 689/854] Typo --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 927c595a1..7deea3a5d 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -277,7 +277,7 @@ \subsection{Writing conditional expressions} rather than implicits (\texttt{if \`{}value\textquotesingle}). Always use the \texttt{missing(=\`{}var\textquotesingle)} function instead of arguments like (\texttt{if \`{}var\textquotesingle<=.}), -and always consider whether missing values will affect the evaluation conditional expressions. +and always consider whether missing values will affect the evaluation of conditional expressions. \codeexample{stata-conditional-expressions1.do}{./code/stata-conditional-expressions1.do} From 90818a9707935903b5e2d531a7d252d078fe62df Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 17 Feb 2020 17:02:11 -0500 Subject: [PATCH 690/854] Typo --- appendix/stata-guide.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 7deea3a5d..619e5dc27 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -275,9 +275,9 @@ \subsection{Writing conditional expressions} The negation of logical expressions should use bang (\texttt{!}) and not tilde (\texttt{\~}). Always use explicit truth checks (\texttt{if \`{}value\textquotesingle==1}) rather than implicits (\texttt{if \`{}value\textquotesingle}). -Always use the \texttt{missing(=\`{}var\textquotesingle)} function -instead of arguments like (\texttt{if \`{}var\textquotesingle<=.}), -and always consider whether missing values will affect the evaluation of conditional expressions. +Always use the \texttt{missing(\`{}var\textquotesingle)} function +instead of arguments like (\texttt{if \`{}var\textquotesingle<=.}). +Always consider whether missing values will affect the evaluation of conditional expressions. \codeexample{stata-conditional-expressions1.do}{./code/stata-conditional-expressions1.do} From 035f7914341ce7efa2411d48c13e58f104e51a07 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 17 Feb 2020 17:15:31 -0500 Subject: [PATCH 691/854] Fix spacing --- code/stata-conditional-expressions2.do | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/code/stata-conditional-expressions2.do b/code/stata-conditional-expressions2.do index 8d6476b40..bd3f6c8f7 100644 --- a/code/stata-conditional-expressions2.do +++ b/code/stata-conditional-expressions2.do @@ -1,17 +1,17 @@ GOOD: if (`sampleSize' <= 100) { - * do something + * do something } else { - * do something else + * do something else } BAD: if (`sampleSize' <= 100) { - * do something + * do something } if (`sampleSize' > 100) { - * do something else + * do something else } From e1d9badafb128ee3f05791cd42ca336996d08f36 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 17 Feb 2020 17:23:35 -0500 Subject: [PATCH 692/854] Simpler globals --- appendix/stata-guide.tex | 10 ++++++---- code/stata-macros.do | 11 ++--------- 2 files changed, 8 insertions(+), 13 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 619e5dc27..1306d903a 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -305,17 +305,19 @@ \subsection{Using macros} otherwise, they can cause readability issues when the endpoint of the macro name is unclear. You should use descriptive names for all macros (up to 32 characters; preferably fewer). -Simple prefixes are useful and encouraged such as \texttt{thisParam}, \texttt{allParams}, -\texttt{theLastParam}, \texttt{allParams}, or \texttt{nParams}. There are several naming conventions you can use for macros with long or multi-word names. Which one you use is not as important as whether you and your team are consistent in how you name then. You can use all lower case (\texttt{mymacro}), underscores (\texttt{my\_macro}), or ``camel case'' (\texttt{myMacro}), as long as you are consistent. -Nested locals are also possible for a variety of reasons when looping. -Finally, if you need a macro to hold a literal macro name, +Simple prefixes are useful and encouraged such as \texttt{this_estimate} or \texttt{current_var}, +or (using texttt{camelCase}) \texttt{lastValue}, \texttt{allValues}, or \texttt{nValues}. +Nested locals (\texttt{if \`{}\`{}value\textquotesingle\textquotesingle}) +are also possible for a variety of reasons when looping, and should be indicated in comments. +If you need a macro to hold a literal macro name, it can be done using the backslash escape character; this causes the stored macro to be evaluated at the usage of the macro rather than at its creation. +This function should be used sparingly and commented extensively. \codeexample{stata-macros.do}{./code/stata-macros.do} diff --git a/code/stata-macros.do b/code/stata-macros.do index e421e75ae..c59047467 100644 --- a/code/stata-macros.do +++ b/code/stata-macros.do @@ -1,15 +1,8 @@ GOOD: global myGlobal = "A string global" - local myLocal1 = length("${myGlobal}") - local myLocal2 = "\${myGlobal}" - - display "${myGlobal}" - global myGlobal = "A different string" - - forvalues i = 1/2 { - display "`myLocal`i''" - } + local myLocal1 = length("${myGlobal}") + local myLocal2 = "${myGlobal}" BAD: From 627521b730fade830717a937dfb8bd1efe16f84e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 17 Feb 2020 17:29:16 -0500 Subject: [PATCH 693/854] Update line breaks --- appendix/stata-guide.tex | 23 ++++++++++++++--------- code/stata-linebreak.do | 12 ++++++------ 2 files changed, 20 insertions(+), 15 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 1306d903a..9fea0678c 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -327,18 +327,20 @@ \subsection{Writing file paths} and should always use forward slashes for folder hierarchies (\texttt{/}). File names should be written in lower case with dashes (\texttt{my-file.dta}). Mac and Linux computers cannot read file paths with backslashes, -and backslashes cannot be removed with find-and-replace. -File paths should also always include the file extension +and backslashes cannot easily be removed with find-and-replace +because the character has other functional uses in code. +File paths should always include the file extension (\texttt{.dta}, \texttt{.do}, \texttt{.csv}, etc.). Omitting the extension causes ambiguity if another file with the same name is created (even if there is a default file type). -File paths should also be absolute and dynamic. \textbf{Absolute} means that all +File paths should also be absolute and dynamic. +\textbf{Absolute} means that all file paths start at the root folder of the computer, often \texttt{C:/} on a PC or \texttt{/Users/} on a Mac. This ensures that you always get the correct file in the correct folder. -\textbf{Do not use \texttt{cd} unless there is a function that \textit{requires} it.} +Do not use \texttt{cd} unless there is a function that requires it. When using \texttt{cd}, it is easy to overwrite a file in another project folder. Many Stata functions use \texttt{cd} and therefore the current directory may change without warning. Relative file paths are common in many other programming languages, @@ -360,17 +362,20 @@ \subsection{Line breaks} Long lines of code are difficult to read if you have to scroll left and right to see the full line of code. When your line of code is wider than text on a regular paper, you should introduce a line break. A common line breaking length is around 80 characters. -Stata's do-file editor and other code editors provide a visible ``guide line''. +Stata's do-file editor and other code editors provide a visible guide line. Around that length, start a new line using \texttt{///}. -You can write comments after \texttt{///} just as with \texttt{//}, and that is usually a good thing. -(The \texttt{\#delimit} command should only be used for advanced function programming -and is officially discouraged in analytical code.\cite{cox2005styleguide} -Never, for any reason, use \texttt{/* */} to wrap a line.) Using \texttt{///} breaks the line in the code editor, while telling Stata that the same line of code continues on the next line. The \texttt{///} breaks do not need to be horizontally aligned in code, although you may prefer to if they have comments that read better aligned, since indentations should reflect that the command continues to a new line. +Break lines where it makes functional sense. +You can write comments after \texttt{///} just as with \texttt{//}, and that is usually a good thing. +(The \texttt{\#delimit} command should only be used for advanced function programming +and is officially discouraged in analytical code.\cite{cox2005styleguide} +Never, for any reason, use \texttt{/* */} to wrap a line: +it is distracting and difficult to follow compared to the use +of those characters to write regular comments. Line breaks and indentations may be used to highlight the placement of the \textbf{option comma} or other functional syntax in Stata commands. diff --git a/code/stata-linebreak.do b/code/stata-linebreak.do index a44886af8..7974974b7 100644 --- a/code/stata-linebreak.do +++ b/code/stata-linebreak.do @@ -1,10 +1,10 @@ GOOD: - graph hbar invil /// Proportion in village - if (priv == 1) /// Private facilities only - , over(statename, sort(1) descending) /// Order states by values - blabel(bar, format(%9.0f)) /// Label the bars - ylab(0 "0%" 25 "25%" 50 "50%" 75 "75%" 100 "100%") /// - ytit("Share of private primary care visits made in own village") + graph hbar invil /// Proportion in village + if (priv == 1) /// Private facilities only + , over(statename, sort(1) descending) /// Order states by values + blabel(bar, format(%9.0f)) /// Label the bars + ylab(0 "0%" 25 "25%" 50 "50%" 75 "75%" 100 "100%") /// + ytit("Share of private primary care visits made in own village") BAD: #delimit ; From 3e4607edf8133b2fe06115f1564461ce11afa2b8 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 17 Feb 2020 17:36:16 -0500 Subject: [PATCH 694/854] Stuff --- appendix/stata-guide.tex | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 9fea0678c..21de6f1a7 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -309,7 +309,7 @@ \subsection{Using macros} Which one you use is not as important as whether you and your team are consistent in how you name then. You can use all lower case (\texttt{mymacro}), underscores (\texttt{my\_macro}), or ``camel case'' (\texttt{myMacro}), as long as you are consistent. -Simple prefixes are useful and encouraged such as \texttt{this_estimate} or \texttt{current_var}, +Simple prefixes are useful and encouraged such as \texttt{this\_estimate} or \texttt{current\_var}, or (using texttt{camelCase}) \texttt{lastValue}, \texttt{allValues}, or \texttt{nValues}. Nested locals (\texttt{if \`{}\`{}value\textquotesingle\textquotesingle}) are also possible for a variety of reasons when looping, and should be indicated in comments. @@ -383,16 +383,20 @@ \subsection{Line breaks} \subsection{Using boilerplate code} -Boilerplate code is a few lines of code that always comes at the top of the code file, -and its purpose is to harmonize settings across users running the same code to the greatest degree possible. There is no way in Stata to guarantee that any two installations of Stata +\textbf{Boilerplate} code is a few lines of code that always come at the top of the code file, +and its purpose is to harmonize settings across users +running the same code to the greatest degree possible. +There is no way in Stata to guarantee that any two installations will always run code in exactly the same way. In the vast majority of cases it does, but not always, -and boilerplate code can mitigate that risk (although not eliminate it). -We have developed a command that runs many commonly used boilerplate settings +and boilerplate code can mitigate that risk. +We have developed the \texttt{ieboilstart} command +to implement many commonly-used boilerplate settings that are optimized given your installation of Stata. It requires two lines of code to execute the \texttt{version} -setting that avoids differences in results due to different versions of Stata. -Among other things, it turns the \texttt{more} flag off so code never hangs; +setting, which avoids differences in results due to different versions of Stata. +Among other things, it turns the \texttt{more} flag off +so code never hangs while waiting to display more output; it turns \texttt{varabbrev} off so abbrevated variable names are rejected; and it maximizes the allowed memory usage and matrix size so that code is not rejected on other machines for violating system limits. From acef7a25fd8de35d4b12819127571064ec23e770 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Mon, 17 Feb 2020 17:47:40 -0500 Subject: [PATCH 695/854] Finish revising --- appendix/stata-guide.tex | 26 +++++++++++++------------- code/stata-before-saving.do | 3 +-- 2 files changed, 14 insertions(+), 15 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 21de6f1a7..184b96ae0 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -433,27 +433,27 @@ \subsection{Saving data} \subsection{Miscellaneous notes} -Use wildcards in variable names (\texttt{xx\_?\_*\_xx}) sparingly, -as they may change results if the dataset changes. Write multiple graphs as \texttt{tw (xx)(xx)(xx)}, not \texttt{tw xx||xx||xx}. -Put spaces around each binary operator except \texttt{\^}. + +\bigskip\noindent In simple expressions, put spaces around each binary operator except \texttt{\^}. Therefore write \texttt{gen z = x + y} and \texttt{x\^}\texttt{2}. -When order of operations applies, use spacing and parentheses: -\texttt{hours + (minutes/60) + (seconds/3600)}, not \texttt{hours + minutes / 60 + seconds / 3600}. -For long expressions, the operator starts the new line, so: -\texttt{gen sumvar = x ///} +\bigskip\noindent When order of operations applies, you may adjust spacing and parentheses: write +\texttt{hours + (minutes/60) + (seconds/3600)}, not \texttt{hours + minutes / 60 + seconds / 3600}. +For long expressions, \texttt{+} and \texttt{-} operators should start the new line, +but \texttt{*} and \texttt{/} should be used inline. For example: -\texttt{ + y ///} +\texttt{gen newvar = x ///} -\texttt{ - z ///} +\texttt{ - (y/2) ///} \texttt{ + a * (b - c)} -\noindent Make sure your code doesn't print very much to the results window as this is slow. +\bigskip\noindent Make sure your code doesn't print very much to the results window as this is slow. This can be accomplished by using \texttt{run file.do} rather than \texttt{do file.do}. -Run outputs like \texttt{reg} using the \texttt{qui} prefix. -Never use interactive commands like \texttt{sum} or \texttt{tab} in dofiles, -unless they are combined with \texttt{qui} for the purpose of getting \texttt{r()}-statistics. +Therefore, it is faster to run outputs from commands like \texttt{reg} using the \texttt{qui} prefix. +Interactive commands like \texttt{sum} or \texttt{tab} should be used sparingly in dofiles, +unless they are for the purpose of getting \texttt{r()}-statistics. +In that case, consider using the \texttt{qui} prefix to prevent printing output. \mainmatter diff --git a/code/stata-before-saving.do b/code/stata-before-saving.do index f770187dd..49a268bac 100644 --- a/code/stata-before-saving.do +++ b/code/stata-before-saving.do @@ -17,6 +17,5 @@ * Save data - save "${myProject}/myDataFile.dta" , replace // The folder global is set in master do-file + save "${myProject}/myDataFile.dta" , replace // The folder global is set in master do-file saveold "${myProject}/myDataFile-13.dta" , replace v(13) // For others - use "${myProject}/myDataFile.dta" , clear // It is useful to be able to recall the data quickly From a8198d46af8858953898995e3cb53c32d67b5ed6 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 18 Feb 2020 09:15:51 -0500 Subject: [PATCH 696/854] No corollary --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 184b96ae0..8212ea74d 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -158,7 +158,7 @@ \subsection{Commenting code} It will also take you a much longer time to edit code you wrote in the past if you did not comment it well. So, comment a lot: do not only write \textit{what} your code is doing but also \textit{why} you wrote it like the way you did. -As a corollary, try to write simpler code that needs less explanation, +In general, try to write simpler code that needs less explanation, even if you could use an elegant and complex method in less space, unless the advanced method is a widely accepted one. From 262585957f223ee49d6e72565f8f859bd59316b6 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Feb 2020 09:33:07 -0500 Subject: [PATCH 697/854] [ch6] remove extra output sentence --- chapters/data-analysis.tex | 1 - 1 file changed, 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 6a09a95e1..438ca1666 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -49,7 +49,6 @@ \section{Managing data effectively} folder structure, task breakdown, master scripts, and version control. A good folder structure organizes files so that any material can be found when needed. It reflects a task breakdown into steps with well-defined inputs, tasks, and outputs. -This breakdown is applied to code, data sets, and outputs. A master script connects folder structure and code. It is a one-file summary of your whole project. Finally, version histories and backups enable the team From 28cc3f469ffc3ad1c2e7d38e9f08c88185970dc7 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Feb 2020 09:42:34 -0500 Subject: [PATCH 698/854] [ch6] version control code and outputs --- chapters/data-analysis.tex | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 438ca1666..c7bdf7844 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -136,9 +136,8 @@ \subsection{Implementing version control} including the addition and deletion of files. This way you can delete code you no longer need, and still recover it easily if you ever need to get back previous work. -Everything that can be version-controlled should be. -Both analysis results and data sets will change with the code. -Whenever possible, you should track have each of them with the code that created it. +The focus in version control is often code, but changes to analysis output should, when possible, be version controlled together with the edit to the code that caused the change. +This way you know what edit in the code led to what edit in the analysis. If you are writing code in Git or GitHub, you can output plain text files such as \texttt{.tex} tables and metadata saved in \texttt{.txt} or \texttt{.csv} to that directory. From 18cc092f80c6fb7f5c10e505c2393c229aeee15b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Tue, 18 Feb 2020 09:47:24 -0500 Subject: [PATCH 699/854] [ch6] language Co-Authored-By: Luiza Andrade --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index c7bdf7844..5b83369ea 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -136,7 +136,7 @@ \subsection{Implementing version control} including the addition and deletion of files. This way you can delete code you no longer need, and still recover it easily if you ever need to get back previous work. -The focus in version control is often code, but changes to analysis output should, when possible, be version controlled together with the edit to the code that caused the change. +The focus in version control is often code, but changes to analysis outputs should, when possible, be version controlled together with the code edits. This way you know what edit in the code led to what edit in the analysis. If you are writing code in Git or GitHub, you can output plain text files such as \texttt{.tex} tables From 8fd69b0e4628053f834819929c5f25aaaebf4f10 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Tue, 18 Feb 2020 09:50:05 -0500 Subject: [PATCH 700/854] [ch6] language Co-Authored-By: Luiza Andrade --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 5b83369ea..8e25203e0 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -137,7 +137,7 @@ \subsection{Implementing version control} This way you can delete code you no longer need, and still recover it easily if you ever need to get back previous work. The focus in version control is often code, but changes to analysis outputs should, when possible, be version controlled together with the code edits. -This way you know what edit in the code led to what edit in the analysis. +This way you know which edits in the code led to which changes in the outputs. If you are writing code in Git or GitHub, you can output plain text files such as \texttt{.tex} tables and metadata saved in \texttt{.txt} or \texttt{.csv} to that directory. From b4a69a9469543303ae213d36a116d0a5ced571e0 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Feb 2020 10:03:20 -0500 Subject: [PATCH 701/854] [stata] agreed on whitespace rec --- code/stata-whitespace-columns.do | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/code/stata-whitespace-columns.do b/code/stata-whitespace-columns.do index adcb8eec3..6f799f1c1 100644 --- a/code/stata-whitespace-columns.do +++ b/code/stata-whitespace-columns.do @@ -1,17 +1,17 @@ ACCEPTABLE: * Create dummy for being employed - generate employed = 1 - replace employed = 0 if (_merge == 2) - lab var employed "Person exists in employment data" - lab def yesno 1 "Yes" 0 "No" - lab val employed yesno + gen employed = 1 + replace employed = 0 if (_merge == 2) + lab var employed "Person exists in employment data" + lab def yesno 1 "Yes" 0 "No" + lab val employed yesno BETTER: * Create dummy for being employed - generate employed = 1 - replace employed = 0 if (_merge == 2) - lab var employed "Person exists in employment data" - lab def yesno 1 "Yes" 0 "No" - lab val employed yesno + gen employed = 1 + replace employed = 0 if (_merge == 2) + lab var employed "Person exists in employment data" + lab def yesno 1 "Yes" 0 "No" + lab val employed yesno From 3a7e295f25c7b3666dab39f1a97c33e21fa7fb82 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 18 Feb 2020 10:06:12 -0500 Subject: [PATCH 702/854] Update appendix/stata-guide.tex --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 8212ea74d..4c3d9a56e 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -310,7 +310,7 @@ \subsection{Using macros} You can use all lower case (\texttt{mymacro}), underscores (\texttt{my\_macro}), or ``camel case'' (\texttt{myMacro}), as long as you are consistent. Simple prefixes are useful and encouraged such as \texttt{this\_estimate} or \texttt{current\_var}, -or (using texttt{camelCase}) \texttt{lastValue}, \texttt{allValues}, or \texttt{nValues}. +or (using \texttt{camelCase}) \texttt{lastValue}, \texttt{allValues}, or \texttt{nValues}. Nested locals (\texttt{if \`{}\`{}value\textquotesingle\textquotesingle}) are also possible for a variety of reasons when looping, and should be indicated in comments. If you need a macro to hold a literal macro name, From 4e234888371fb846e0f3315cc1fdcceaa88aa7e2 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Feb 2020 10:13:50 -0500 Subject: [PATCH 703/854] [stata] agreed on macros examples --- code/stata-macros.do | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/code/stata-macros.do b/code/stata-macros.do index c59047467..b6889c98c 100644 --- a/code/stata-macros.do +++ b/code/stata-macros.do @@ -1,10 +1,9 @@ GOOD: - global myGlobal = "A string global" - local myLocal1 = length("${myGlobal}") - local myLocal2 = "${myGlobal}" + global myGlobal = "A string global" + display "${myGlobal}" BAD: - global myglobal "A string global" - local my_Local = length($myGlobal) + global my_Global = "A string global" // Do not mix naming styles + display "$myGlobal" // Always use ${} for globals From b54dd4dbbe3296da377ab42061318b11b27731e0 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 18 Feb 2020 15:04:43 -0500 Subject: [PATCH 704/854] Capitalize "Style Guide" in contents --- manuscript.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/manuscript.tex b/manuscript.tex index 63d892fb2..38e7c82bb 100644 --- a/manuscript.tex +++ b/manuscript.tex @@ -103,7 +103,7 @@ \chapter{Bringing it all together} % APPENDIX : Stata Style Guide %---------------------------------------------------------------------------------------- -\chapter{Appendix: The DIME Analytics Stata style guide} +\chapter{Appendix: The DIME Analytics Stata Style Guide} \label{ap:1} \input{appendix/stata-guide.tex} From 99616b6db8c5c1ce6388664c0040b3be4bfc6a22 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 18 Feb 2020 15:11:26 -0500 Subject: [PATCH 705/854] Introduction: combine redundant sections --- chapters/introduction.tex | 164 ++++++++++++++++++-------------------- 1 file changed, 76 insertions(+), 88 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 8f1e55435..e6702a261 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -1,34 +1,33 @@ \begin{fullwidth} Welcome to \textit{Data for Development Impact}. -This book is intended to teach all users of development data -how to handle data effectively, efficiently, and ethically. -An empirical revolution has changed the face of research economics rapidly over the last decade. +This book is intended to teach all users of development data +how to handle data effectively, efficiently, and ethically. +An empirical revolution has changed the face of research economics rapidly over the last decade. %had to remove cite {\cite{angrist2017economic}} because of full page width -Today, especially in the development subfield, working with raw data -- -whether collected through surveys or acquired from ``big'' data sources like sensors, satellites, or call data records -- -is a key skill for researchers and their staff. -At the same time, the scope and scale of empirical research projects is expanding: -more people are working on the same data over longer timeframes. +Today, especially in the development subfield, working with raw data -- +whether collected through surveys or acquired from ``big'' data sources like sensors, satellites, or call data records -- +is a key skill for researchers and their staff. +At the same time, the scope and scale of empirical research projects is expanding: +more people are working on the same data over longer timeframes. As the ambition of development researchers grows, so too has the complexity of the data -on which they rely to make policy-relevant research conclusions. -Yet there are few guides to the conventions, standards, and best practices +on which they rely to make policy-relevant research conclusions. +Yet there are few guides to the conventions, standards, and best practices that are fast becoming a necessity for empirical research. -This book aims to fill that gap, providing guidance on how to handle data efficiently, transparently and collaboratively. +This book aims to fill that gap. -This book is targeted to everyone who interacts with development data: -graduate students, research assistants, policymakers, and empirical researchers. +This book is targeted to everyone who interacts with development data: +graduate students, research assistants, policymakers, and empirical researchers. It covers data workflows at all stages of the research process: design, data acquisition, and analysis. -This book is not sector-specific; it will not teach you econometrics, or how to design an impact evaluation. -There are many excellent existing resources on those topics. -Instead, this book will teach you how to think about all aspects of your research from a data perspective, -how to structure research projects to maximize data quality, -and how to institute transparent and reproducible workflows. +This book is not sector-specific; it will not teach you econometrics, or how to design an impact evaluation. +There are many excellent existing resources on those topics. +Instead, this book will teach you how to think about all aspects of your research from a data perspective, +how to structure research projects to maximize data quality, +and how to institute transparent and reproducible workflows. The central premise of this book is that data work is a ``social process'', in which many people need to have the same idea about what is to be done, and when and where and by whom, so that they can collaborate effectively on large, long-term research projects. -It aims to be a highly practical resource: we provide code snippets, links to checklists and other practical tools, -and references to primary resources that allow the reader to immediately put recommended processes into practice. - +It aims to be a highly practical resource: we provide code snippets, links to checklists and other practical tools, +and references to primary resources that allow the reader to immediately put recommended processes into practice. \end{fullwidth} @@ -43,51 +42,32 @@ \section{Doing credible research at scale} at the World Bank's \textbf{Development Economics (DEC) Vice Presidency}.\sidenote{ \url{https://www.worldbank.org/en/about/unit/unit-dec}} -DIME generates high-quality and operationally relevant data and research -to transform development policy, help reduce extreme poverty, and secure shared prosperity. -It develops customized data and evidence ecosystems to produce actionable information +DIME generates high-quality and operationally relevant data and research +to transform development policy, help reduce extreme poverty, and secure shared prosperity. +It develops customized data and evidence ecosystems to produce actionable information and recommend specific policy pathways to maximize impact. -DIME conducts research in 60 countries with 200 agencies, leveraging a -US\$180 million research budget to shape the design and implementation of -US\$18 billion in development finance. -DIME also provides advisory services to 30 multilateral and bilateral development agencies. -Finally, DIME invests in public goods (such as this book) to improve the quality and reproducibility of development research around the world. - -DIME Analytics was created to take advantage of the concentration and scale of research at DIME to develop and test solutions, -to ensure high quality of data collection and research across the DIME portfolio, -and to make public training and tools available to the larger community of development researchers. -\textit{Data for Development Impact} compiles the ideas, best practices and software tools Analytics -has developed while supporting DIME's global impact evaluation portfolio. +DIME conducts research in 60 countries with 200 agencies, leveraging a +US\$180 million research budget to shape the design and implementation of +US\$18 billion in development finance. +DIME also provides advisory services to 30 multilateral and bilateral development agencies. +Finally, DIME invests in public goods (such as this book) to improve the quality and reproducibility of development research around the world. + +DIME Analytics was created to take advantage of the concentration and scale of research at DIME to develop and test solutions, +to ensure high quality data collection and research across the DIME portfolio, +and to make training and tools publicly available to the larger community of development researchers. +\textit{Data for Development Impact} compiles the ideas, best practices and software tools Analytics +has developed while supporting DIME's global impact evaluation portfolio. + The \textbf{DIME Wiki} is one of our flagship products, a free online collection of our resources and best practices.\sidenote{ \url{http://dimewiki.worldbank.org/}} - -This book complements the DIME Wiki by providing a structured narrative of the data workflow for a typical research project. +This book complements the DIME Wiki by providing a structured narrative of the data workflow for a typical research project. We will not give a lot of highly specific details in this text, -but we will point you to where they can be found.\sidenote{Like this: +but we will point you to where they can be found.\sidenote{Like this: \url{https://dimewiki.worldbank.org/wiki/Primary_Data_Collection}} Each chapter focuses on one task, providing a primarily narrative account of: what you will be doing; where in the workflow this task falls; when it should be done; and how to implement it according to best practices. - - -\section{Outline of this book} -The book covers each stage of an empirical research project, from design to publication. -We start with ethical principles to guide empirical research, -focusing on research transparency and the right to privacy. -The second chapter discusses the importance of planning data work at the outset of the research project - -long before any data is acquired - and provide suggestions for collaborative workflows and tools. -Next, we turn to common research designs for -\textbf{causal inference}{\sidenote{causal inference: identifying the change in outcome -\textit{caused} by a particular intervention}}, and consider their implications for data structure. -The fourth chapter covers how to implement sampling and randomization to ensure research credibility, -and includes details on power calculation and randomization inference. -The fifth chapter provides guidance on high quality primary data collection, particularly for projects that use surveys. -The sixth chapter turns to data processing, -focusing on how to organize data work so that it is easy to code the desired analysis. -In the final chapter, we discuss publishing collaborative research- -both the research paper and the code and materials needed to recreate the results. - We will use broad terminology throughout this book to refer to research team members: \textbf{principal investigators (PIs)} who are responsible for the overall design and stewardship of the study; @@ -117,32 +97,31 @@ \section{Adopting reproducible workflows} \section{Writing reproducible code in a collaborative environment} -Throughout the book, we refer to the importance of good coding practices. +Throughout the book, we refer to the importance of good coding practices. These are the foundation of reproducible and credible data work, and a core part of the new data science of development research. Code today is no longer a means to an end (such as a research paper), rather it is part of the output itself: a means for communicating how something was done, in a world where the credibility and transparency of data cleaning and analysis is increasingly important. -As this is fundamental to the remainder of the book's content, +As this is fundamental to the remainder of the book's content, we provide here a brief introduction to ``good'' code and standardized practices. - ``Good'' code has two elements: \begin{itemize} -\item it is correct (doesn't produce any errors along the way) -\item it is useful and comprehensible to someone who hasn't seen it before (or even yourself a few weeks, months or years later) +\item It is correct (doesn't produce any errors along the way) +\item It is useful and comprehensible to someone who hasn't seen it before (or even yourself a few weeks, months or years later) \end{itemize} -Many researchers have been trained to code correctly. +Many researchers have been trained to code correctly. However, when your code runs on your computer and you get the correct results, you are only half-done writing \textit{good} code. Good code is easy to read and replicate, making it easier to spot mistakes. -Good code reduces noise due to sampling, randomization, and cleaning errors. +Good code reduces sampling, randomization, and cleaning errors. Good code can easily be reviewed by others before it's published and replicated afterwards. Process standardization means that there is little ambiguity about how something ought to be done, and therefore the tools to do it can be set in advance. Standard processes for code help other people to ready your code.\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Stata_Coding_Practices}} +\url{https://dimewiki.worldbank.org/wiki/Stata_Coding_Practices}} Code should be well-documented, contain extensive comments, and be readable in the sense that others can: (1) quickly understand what a portion of code is supposed to be doing; (2) evaluate whether or not it does that thing correctly; and @@ -163,59 +142,68 @@ \section{Writing reproducible code in a collaborative environment} it should not require arcane reverse-engineering to figure out what a code chunk is trying to do. \textbf{Style}, finally, is the way that the non-functional elements of your code convey its purpose. -Elements like spacing, indentation, and naming (or lack thereof) can make your code much more +Elements like spacing, indentation, and naming (or lack thereof) can make your code much more (or much less) accessible to someone who is reading it for the first time and needs to understand it quickly and correctly. For some implementation portions where precise code is particularly important, we will provide minimal code examples either in the book or on the DIME Wiki. -All code guidance is software-agnostic, but code examples are provided in Stata. +All code guidance is software-agnostic, but code examples are provided in Stata. In the book, code examples will be presented like the following: \codeexample{code.do}{./code/code.do} -We ensure that each code block runs independently, is well-formatted, +We ensure that each code block runs independently, is well-formatted, and uses built-in functions as much as possible. We will point to user-written functions when they provide important tools. In particular, we point to two suites of Stata commands developed by DIME Analytics, -\texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietoolkit}} and -\texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}}, +\texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietoolkit}} and +\texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}} which standardize our core data collection, management, and analysis workflows. -We will not explain Stata commands unless the command is rarely used -or the feature we are using is outside common use case of that command. We will comment the code generously (as you should), -but you should reference Stata help-files \texttt{h [command]} +but you should reference Stata help-files by writing \texttt{help [command]} whenever you do not understand the command that is being used. We hope that these snippets will provide a foundation for your code style. Providing some standardization to Stata code style is also a goal of this team; -we provide our guidance on this in the DIME Analytics Stata Style Guide Appendix. +we provide our guidance on this in the DIME Analytics Stata Style Guide in the Appendix. +\section{Outline of this book} -The book proceeds as follows: +This book covers each stage of an empirical research project, from design to publication. In Chapter 1, we outline a set of practices that help to ensure research participants are appropriately protected and research consumers can be confident in the conclusions reached. +We start with ethical principles to guide empirical research, +focusing on research transparency and the right to privacy. Chapter 2 will teach you to structure your data work to be efficient, collaborative and reproducible. +It discusses the importance of planning data work at the outset of the research project -- +long before any data is acquired -- and provides suggestions for collaborative workflows and tools. In Chapter 3, we turn to research design, -focusing specifically on how to measure treatment effects +focusing specifically on how to measure treatment effects and structure data for common experimental and quasi-experimental research methods. -Chapter 4 concerns sampling and randomization: -how to implement both simple and complex designs reproducibly, +We present outlines of common research designs for +causal inference, and consider their implications for data structure. +Chapter 4 concerns sampling and randomization: +how to implement both simple and complex designs reproducibly, and how to use power calculations and randomization inference -to critically and quantitatively assess +to critically and quantitatively assess sampling and randomization designs to make optimal choices when planning studies. + Chapter 5 covers data acquisition. We start with the legal and institutional frameworks for data ownership and licensing, dive in depth on collecting high-quality survey data, -and finally discuss secure data handling during transfer, sharing, and storage. +and finally discuss secure data handling during transfer, sharing, and storage. +It provides guidance on high-quality data collection +and handling for development projects. Chapter 6 teaches reproducible and transparent workflows for data processing and analysis, -and provides guidance on de-identification of personally-identified data. -In Chapter 7, we turn to publication. You will learn -how to effectively collaborate on technical writing, -how and why to publish data, -and guidelines for preparing functional and informative replication packages. -We hope that by the end of the book, -you will have learned how to handle data more efficiently, effectively and ethically -at all stages of the research process. +and provides guidance on de-identification of personally-identified data, +focusing on how to organize data work so that it is easy to code the desired analysis. +In Chapter 7, we turn to publication. You will learn +how to effectively collaborate on technical writing, +how and why to publish data, +and guidelines for preparing functional and informative replication packages. +We hope that by the end of the book, +you will have learned how to handle data more efficiently, effectively and ethically +at all stages of the research process. \mainmatter From 5c070632716322018874179d3e217fb11e02592b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 18 Feb 2020 15:15:22 -0500 Subject: [PATCH 706/854] Chapter 1 small fixes --- chapters/handling-data.tex | 49 ++++++++++++++++++-------------------- 1 file changed, 23 insertions(+), 26 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 00be9111a..b4a247902 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -25,7 +25,7 @@ in the way that code and data are handled as part of research. Neither privacy nor transparency is an all-or-nothing objective: the most important thing is to report the transparency and privacy measures you have taken - and always strive to to the best that you are capable of with current technology. + and always strive to do the best that you are capable of with current technology. In this chapter, we outline a set of practices that help to ensure research participants are appropriately protected and research consumers can be confident in the conclusions reached. @@ -79,12 +79,9 @@ \subsection{Research reproducibility} based on the valuable work you have already done.\sidenote{ \url{https://blogs.worldbank.org/opendata/making-analytics-reusable}} Services that log your research process are valuable resources here -- -GitHub is one of many that can do so.\sidenote{ - \url{https://github.com}} - \index{GitHub} Such services can show things like modifications made in response to referee comments, by having tagged version histories at each major revision. -They also allow you to use issue trackers and abandoned work branches +They also allow you to use issue trackers to document the research paths and questions you may have tried to answer as a resource to others who have similar questions. @@ -117,7 +114,7 @@ \subsection{Research transparency} Expecting process transparency is also an incentive for researchers to make better decisions, be skeptical and thorough about their assumptions, and, as we hope to convince you, make the process easier for themselves, -because it requires methodical organization that is labor-saving and efficient over the complete course of a project. +because it requires methodical organization that is labor-saving over the complete course of a project. Tools like \textbf{pre-registration}\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Pre-Registration}}, @@ -133,8 +130,8 @@ \subsection{Research transparency} This is meant to combat the ``file-drawer problem'',\cite{simonsohn2014p} and ensure that researchers are transparent in the additional sense that all the results obtained from registered studies are actually published. -In no way should this be viewed as binding the hands of the researcher:\cite{olken2015promises} -anything outside the original plan is just as interesting and valuable +In no way should this be viewed as binding the hands of the researcher.\cite{olken2015promises} +Anything outside the original plan is just as interesting and valuable as it would have been if the the plan was never published; but having pre-committed to any particular inquiry makes its results immune to a wide range of criticisms of specification searching or multiple testing. @@ -158,8 +155,6 @@ \subsection{Research transparency} not a one-time requirement or retrospective task. New decisions are always being made as the plan begins contact with reality, and there is nothing wrong with sensible adaptation so long as it is recorded and disclosed. -(Email is \textit{not} a note-taking service, because communications are rarely well-ordered, -can be easily deleted, and are not available for future team members.) There are various software solutions for building documentation over time. The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution,\index{Open Science Framework} @@ -175,6 +170,8 @@ \subsection{Research transparency} and the exact shape of this process can be molded to the team's needs, but it should be agreed upon prior to project launch. This way, you can start building a project's documentation as soon as you start making decisions. +Email, however, is \textit{not} a note-taking service, because communications are rarely well-ordered, +can be easily deleted, and are not available for future team members. \subsection{Research credibility} @@ -189,10 +186,10 @@ \subsection{Research credibility} all experimental and observational studies should be pre-registered simply to create a record of the fact that the study was undertaken.\sidenote{\url{http://datacolada.org/12}} This is increasingly required by publishers and can be done very quickly -using the \textbf{AEA} database\sidenote{\url{https://www.socialscienceregistry.org/}}, -the \textbf{3ie} database\sidenote{\url{http://ridie.3ieimpact.org/}}, -the \textbf{eGAP} database\sidenote{\url{http://egap.org/content/registration/}}, -or the \textbf{OSF} registry\sidenote{\url{https://osf.io/registries}} as appropriate. +using the \textbf{AEA} database,\sidenote{\url{https://www.socialscienceregistry.org/}} +the \textbf{3ie} database,\sidenote{\url{http://ridie.3ieimpact.org/}} +the \textbf{eGAP} database,\sidenote{\url{http://egap.org/content/registration/}} +or the \textbf{OSF} registry,\sidenote{\url{https://osf.io/registries}} as appropriate. \index{pre-registration} Common research standards from journals and funders feature both ex ante @@ -202,7 +199,7 @@ \subsection{Research credibility} and their quality meet some minimum standard. Ex post policies require that authors make certain materials available to the public, but their quality is not a direct condition for publication. -Still, others have suggested ``guidance'' policies that would offer checklists +Still others have suggested ``guidance'' policies that would offer checklists for which practices to adopt, such as reporting on whether and how various practices were implemented.\cite{nosek2015promoting} @@ -233,10 +230,10 @@ \section{Ensuring privacy and security in research data} Anytime you are collecting primary data in a development research project, you are almost certainly handling data that include \textbf{personally-identifying - information (PII)}\index{personally-identifying information}\index{primary data}\sidenote{ + information (PII)}.\index{personally-identifying information}\index{primary data}\sidenote{ \textbf{Personally-identifying information:} any piece or set of information that can be used to identify an individual research subject. - \url{https://dimewiki.worldbank.org/wiki/De-identification\#Personally\_Identifiable\_Information}}. + \url{https://dimewiki.worldbank.org/wiki/De-identification\#Personally\_Identifiable\_Information}} PII data contains information that can, without any transformation, be used to identify individual people, households, villages, or firms that were part of data collection. \index{data collection} @@ -274,7 +271,7 @@ \section{Ensuring privacy and security in research data} from recently advanced data rights and regulations, these considerations are critically important. Check with your organization if you have any legal questions; -in general, you are responsible for avoiding any action that +in general, you are responsible for any action that knowingly or recklessly ignores these considerations. \subsection{Obtaining ethical approval and consent} @@ -334,10 +331,10 @@ \subsection{Transmitting and storing data securely} need to be protected by strong and unique passwords. There are several services that create and store these passwords for you, and some provide utilities for sharing passwords with others -inside that secure environment if multiple users share accounts. +inside that secure environment. However, password-protection alone is not sufficient, because if the underlying data is obtained through a leak the information itself remains usable. -Data sets that confidential information +Data sets that include confidential information \textit{must} therefore be \textbf{encrypted}\sidenote{ \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked, or any other type of unauthorized access is obtained. \url{https://dimewiki.worldbank.org/wiki/encryption}} @@ -346,7 +343,7 @@ \subsection{Transmitting and storing data securely} To protect information in transit to field staff, some key steps are: (a) ensure that all devices that store confidential data have hard drive encryption and password-protection; -(b) never send confidential data over e-mail, WhatsApp, etc. +(b) never send confidential data over email, WhatsApp, or other chat services. without encrypting the information first; and (c) train all field staff on the adequate privacy standards applicable to their work. @@ -357,7 +354,7 @@ \subsection{Transmitting and storing data securely} although this usually needs to be actively enabled and administered.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_at\_Rest}} When files are properly encrypted, -The information they contain will be completely unreadable and unusable +the information they contain will be completely unreadable and unusable even if they were to be intercepted my a malicious ``intruder'' or accidentally made public. When the proper data security precautions are taken, @@ -384,7 +381,7 @@ \subsection{Transmitting and storing data securely} Data security is important not only for identifying, but also sensitive information, especially when a worst-case scenario could potentially lead to re-identifying subjects. Extremely sensitive information may be required to be held in a ``cold'' machine -which does not have Internet access -- this is most often the case with +which does not have internet access -- this is most often the case with government records such as granular tax information. Each of these tools and requirements will vary in level of security and ease of use, and sticking to a standard practice will make your life much easier, @@ -424,7 +421,7 @@ \subsection{De-identifying and anonymizing information} Therefore, once data is securely collected and stored, the first thing you will generally do is \textbf{de-identify} it, -that is, to remove direct identifiers of the individuals in the dataset.\sidenote{ +that is, remove direct identifiers of the individuals in the dataset.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/De-identification}} \index{de-identification} Note, however, that it is in practice impossible to \textbf{anonymize} data. @@ -435,11 +432,11 @@ \subsection{De-identifying and anonymizing information} For this reason, we recommend de-identification in two stages. The \textbf{initial de-identification} process strips the data of direct identifiers to create a working de-identified dataset that -can be \textit{within the research team} without the need for encryption. +can be shared \textit{within the research team} without the need for encryption. The \textbf{final de-identification} process involves making a decision about the trade-off between risk of disclosure and utility of the data before publicly releasing a dataset.\sidenote{ \url{https://sdcpractice.readthedocs.io/en/latest/SDC\_intro.html\#need-for-sdc}} We will provide more detail about the process and tools available -for initial and final de-identification in chapters 6 and 7, respectively. +for initial and final de-identification in Chapters 6 and 7, respectively. From a39045809799d6eed48b5266d727b5a08fa308d2 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 18 Feb 2020 15:23:55 -0500 Subject: [PATCH 707/854] Chapter 2 fixes (inc #329) --- chapters/planning-data-work.tex | 62 +++++++++++++++++---------------- 1 file changed, 32 insertions(+), 30 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 0774564b8..ae4beb317 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -6,7 +6,7 @@ and the collaboration platforms and processes for your team. In order to be prepared to work on the data you receive with a group, you need to structure your workflow in advance. -This means knowing which data sets and output you need at the end of the process, +This means knowing which data sets and outputs you need at the end of the process, how they will stay organized, what types of data you'll handle, and whether the data will require special handling due to size or privacy considerations. Identifying these details will help you map out the data needs for your project, @@ -74,8 +74,9 @@ \subsection{Setting up your computer} \index{password protection} All machines should have \textbf{hard disk encryption} enabled. \index{encryption} -Disk encryption is built-in on most modern operating systems; -the service is currently called BitLocker on Windows or FileVault on MacOS. +Disk encryption is available on many modern operating systems; +you should determine whether your computer implements this +or whether you need to ensure encryption at the individual file level. Disk encryption prevents your files from ever being accessed without first entering the system password. This is different from file-level encryption, @@ -119,7 +120,8 @@ \subsection{Setting up your computer} some kind of \textbf{file sharing} software. \index{file sharing} The exact services you use will depend on your tasks, -but in general, there are different approaches to file sharing, and the three discussed here are the most common. +but in general, there are several approaches to file sharing, +and the three discussed here are the most common. \textbf{File syncing} is the most familiar method, and is implemented by software like Dropbox and OneDrive. \index{file syncing} @@ -145,7 +147,7 @@ \subsection{Setting up your computer} and you should review the types of data work that you will be doing, and plan which types of files will live in which types of sharing services. -It is important to note that they are, in general, not interoperable: +It is important to note that they are, in general, not interoperable, meaning you should not have version-controlled files inside a syncing service, or vice versa, without setting up complex workarounds, and you cannot shift files between them without losing historical information. @@ -186,7 +188,7 @@ \subsection{Documenting decisions and tasks} Other tools which currently offer similar features (but are not explicitly Kanban-based) are GitHub Issues and Dropbox Paper. Any specific list of software will quickly be outdated; -we mention these two as an example of one that is technically-organized and one that is chronologial. +we mention these two as an example of one that is technically-organized and one that is chronological. Choosing the right tool for the right needs is essential to being satisfied with the workflow. What is important is that your team chooses its systems and stick to those choices, so that decisions, discussions, and tasks are easily reviewable long after they are completed. @@ -243,8 +245,8 @@ \subsection{Choosing software} \url{https://dimewiki.worldbank.org/wiki/ieboilstart}}) Next, think about how and where you write and execute code. -This book focuses mainly on primary survey data, -so we are going to broadly assume that you are using ``small'' datasets +This book is intended to be agnostic to the size or origin of your data, +but we are going to broadly assume that you are using desktop-sized datasets in one of the two most popular desktop-based packages: R or Stata. (If you are using another language, like Python, or working with big data projects on a server installation, @@ -272,7 +274,7 @@ \subsection{Choosing software} Stata is currently the most commonly used statistical software, and the built-in do-file editor the most common editor for programming Stata. We focus on Stata-specific tools and instructions in this book. -Hence, we will use the terms `script' and `do-file' +Hence, we will use the terms ``script'' and ``do-file'' interchangeably to refer to Stata code throughout. This is only in part due to its popularity. Stata is primarily a scripting language for statistics and data, @@ -281,7 +283,7 @@ \subsection{Choosing software} We believe that this must change somewhat: in particular, we think that practitioners of Stata must begin to think about their code and programming workflows -just as methodologically as they think about their research workflows. +just as methodologically as they think about their research workflows, and that people who adopt this approach will be dramatically more capable in their analytical ability. This means that they will be more productive when managing teams, @@ -293,7 +295,7 @@ \subsection{Choosing software} that we use in our work, which provides some new standards for coding so that code styles can be harmonized across teams for easier understanding and reuse of code. -Stata also has relatively few resources of this type available, +Stata has relatively few resources of this type available, and the ones that we have created and shared here we hope will be an asset to all its users. @@ -375,11 +377,12 @@ \subsection{Organizing files and folder structures} that move the data through this progression, and for the files that manage final analytical work. The command also has some flexibility for the addition of -folders for non-primary data sources, although this is less well developed. +folders for other types of data sources, although this is less well developed +as the needs for larger data sets tend to be very specific. The \texttt{ietoolkit} package also includes the \texttt{iegitaddmd} command, which can place \texttt{README.md} placeholder files in your folders so that -your folder structure can be shared using Git. Since these placeholder files are in -\textbf{Markdown} they also provide an easy way +your folder structure can be shared using Git. Since these placeholder files are +written in a plaintext language called \textbf{Markdown}, they also provide an easy way to document the contents of every folder in the structure. \index{Markdown} @@ -398,7 +401,7 @@ \subsection{Organizing files and folder structures} and will almost always create undesired functionality if combined.) Nearly all code files and raw outputs (not datasets) are best managed this way. This is because code files are always \textbf{plaintext} files, -and non-technical files are usually \textbf{binary} files.\index{plaintext}\index{binary files} +and non-code-compatiable files are usually \textbf{binary} files.\index{plaintext}\index{binary files} It's also becoming more and more common for written outputs such as reports, presentations and documentations to be written using plaintext tools such as {\LaTeX} and dynamic documents.\index{{\LaTeX}}\index{dynamic documents} @@ -410,7 +413,7 @@ \subsection{Organizing files and folder structures} Setting up the \texttt{DataWork} folder folder in a version-controlled directory also enables you to use Git and GitHub for version control on your code files. -A \textbf{version control system} is required to manage changes to any technical file. +A \textbf{version control system} is required to manage changes to any code-compatiable file. A good version control system tracks who edited each file and when, and additionally provides a protocol for ensuring that conflicting versions are avoided. This is important, for example, for your team @@ -434,9 +437,9 @@ \subsection{Organizing files and folder structures} Once the \texttt{DataWork} folder's directory structure is set up, you should adopt a file naming convention. You will generally be working with two types of files: -``technical'' files, which are those that are accessed by code processes, -and ``non-technical'' files, which will not be accessed by code processes. -The former takes precedent: an Excel file is a technical file +``code-compatiable'' files, which are those that are accessed by code processes, +and ``non-code-compatiable'' files, which will not be accessed by code processes. +The former takes precedent: an Excel file is a code-compatiable file even if it is a field log, because at some point it will be used by code. We will not give much emphasis to files that are not linked to code here; but you should make sure to name them in an orderly fashion that works for your team. @@ -454,8 +457,8 @@ \subsection{Organizing files and folder structures} a related do file would have a name like \texttt{sampling-endline.do}. Adding timestamps to binary files as in the example above can be useful, as it is not straightforward to track changes using version control software. -However, for plaintext files tracked using Git, timestamps are an unnecessary distraction. -Similarly, technical files should never include capital letters, +However, for plaintext files version-controlled using Git, timestamps are an unnecessary distraction. +Similarly, code-compatiable files should never include capital letters, as strings and file paths are case-sensitive in some software. Finally, one organizational practice that takes some getting used to is the fact that the best names from a coding perspective @@ -495,8 +498,7 @@ \subsection{Documenting and organizing code} Otherwise, you should include it in the header. Finally, use the header to track the inputs and outputs of the script. When you are trying to track down which code creates which data set, this will be very helpful. -While there are other ways to document decisions related to creating code -(GitHub offers a lot of different documentation options, for example), +While there are other ways to document decisions related to creating code, the information that is relevant to understand the code should always be written in the code file. In the script, alongside the code, are two types of comments that should be included. @@ -504,7 +506,7 @@ \subsection{Documenting and organizing code} This might be easy to understand from the code itself if you know the language well enough and the code is clear, but often it is still a great deal of work to reverse-engineer the code's intent. -Writing the task in plain English (or whichever language you communicate with your team on) +Writing the task in plain English (or whichever language you communicate with your team in) will make it easier for everyone to read and understand the code's purpose -- and also for you to think about your code as you write it. The second type of comment explains why the code is performing a task in a particular way. @@ -530,7 +532,7 @@ \subsection{Documenting and organizing code} In Stata, you can use comments to create section headers, though they're just there to make the reading easier and don't have functionality. You should also add an index in the code header by copying and pasting section titles. -You can then add and navigate through them using the \texttt{find} command. +You can then add and navigate through them using the \texttt{find} functionality. Since Stata code is harder to navigate, as you will need to scroll through the document, it's particularly important to avoid writing very long scripts. Therefore, in Stata at least, you should also consider breaking code tasks down @@ -596,14 +598,14 @@ \subsection{Documenting and organizing code} If you wait for a long time to have your code reviewed, and it gets too complex, preparation and code review will require more time and work, and that is usually the reason why this step is skipped. -One other important advantage of code review if that +One other important advantage of code review is that making sure that the code is running properly on other machines, and that other people can read and understand the code easily, is the easiest way to be prepared in advance for a smooth project handover or for release of the code to the general public. % ---------------------------------------------------------------------------------------------- -\subsection{Output management} +\subsection{Managing outputs} The final task that needs to be discussed with your team is the best way to manage output files. A great number of outputs will be created during the course of a project, @@ -638,8 +640,8 @@ \subsection{Output management} Once you are happy with a result or output, it should be named and moved to a dedicated location. It's typically desirable to have the names of outputs and scripts linked, -so, for example, \texttt{factor-analysis.do} creates \texttt{factor-analysis-f1.eps} and so on. -Document output creation in the Master script that runs these files, +so, for example, \texttt{factor-analysis.do} creates \texttt{f1-factor-analysis.eps} and so on. +Document output creation in the master script that runs these files, so that before the line that runs a particular analysis script there are a few lines of comments listing data sets and functions that are necessary for it to run, @@ -666,7 +668,7 @@ \subsection{Output management} Another option is to use the statistical software's dynamic document engines. This means you can write both text (in Markdown) and code in the script, -and the result will usually be a PDF or \texttt{html} file including code, text, and outputs. +and the result will usually be a PDF or HTML file including code, text, and outputs. Dynamic document tools are better for including large chunks of code and dynamically created graphs and tables, but formatting these can be much trickier and less full-featured than other editors. So dynamic documents can be great for creating appendices From 7f4816fb6786d6076bc04191d3ba533d82054cb1 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 18 Feb 2020 15:29:24 -0500 Subject: [PATCH 708/854] Chapter 3 small fixes --- chapters/research-design.tex | 42 +++++++++++++++++++----------------- 1 file changed, 22 insertions(+), 20 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index a7852dbaf..61dfa7da7 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -34,8 +34,8 @@ Intuitive knowledge of your project's chosen approach will make you much more effective at the analytical part of your work. This chapter first covers causal inference methods. -Next we discuss how to measure treatment effects and structure data for specific methods, -including: cross-sectional randomized control trials, difference-in-difference designs, +Next it discusses how to measure treatment effects and structure data for specific methods, +including cross-sectional randomized control trials, difference-in-difference designs, regression discontinuity, instrumental variables, matching, and synthetic controls. \end{fullwidth} @@ -47,7 +47,7 @@ \section{Causality, inference, and identification} When we are discussing the types of inputs -- ``treatments'' -- commonly referred to as ``programs'' or ``interventions'', we are typically attempting to obtain estimates -of program-specific \textbf{treatment effects} +of program-specific \textbf{treatment effects}. These are the changes in outcomes attributable to the treatment.\cite{abadie2018econometric} \index{treatment effect} The primary goal of research design is to establish \textbf{causal identification} for an effect. @@ -62,7 +62,7 @@ \section{Causality, inference, and identification} Without identification, we cannot say that the estimate would be accurate, even with unlimited data, and therefore cannot attribute it to the treatment in the small samples that we typically have access to. -Conversely, more data is not a substitute for a well-identified experimental design. +More data is not a substitute for a well-identified experimental design. Therefore it is important to understand how exactly your study identifies its estimate of treatment effects, so you can calculate and interpret those estimates appropriately. @@ -74,7 +74,7 @@ \section{Causality, inference, and identification} and \textbf{quasi-experimental} designs, in which the team identifies a ``natural'' source of variation and uses it for identification. Neither type is implicitly better or worse, -and both types are capable of achieving effect identification under different contexts. +and both types are capable of achieving causal identification in different contexts. %----------------------------------------------------------------------------------------------- \subsection{Estimating treatment effects using control groups} @@ -159,9 +159,11 @@ \subsection{Experimental and quasi-experimental research designs} is the \textbf{randomized control trial (RCT)}.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Randomized_Control_Trials}} \index{randomized control trials} -In randomized control trials, the control group is randomized -- +In randomized control trials, the treatment group is randomized -- that is, from an eligible population, -a random subset of units are not given access to the treatment, +a random group of units are given the treatment. +Another way to think about these designs is how they establish the control group: +a random subset of units are \textit{not} given access to the treatment, so that they may serve as a counterfactual for those who are. A randomized control group, intuitively, is meant to represent how things would have turned out for the treated group @@ -209,7 +211,8 @@ \subsection{Experimental and quasi-experimental research designs} or having the ability to collect data in a time and place where an event that produces causal identification occurred. Therefore, these methods often use either secondary data, -or they use primary data in a cross-sectional retrospective method. +or they use primary data in a cross-sectional retrospective method, +including administrative data or other new classes of routinely-collected information. Quasi-experimental designs therefore can access a much broader range of questions, and with much less effort in terms of executing an intervention. @@ -233,25 +236,25 @@ \section{Obtaining treatment effects from specific research designs} \subsection{Cross-sectional designs} A cross-sectional research design is any type of study -that collects data in only one time period +that observes data in only one time period and directly compares treatment and control groups. This type of data is easy to collect and handle because -you do not need track individual across time or across data sets. +you do not need to track individuals across time. If this point in time is after a treatment has been fully delivered, then the outcome values at that point in time already reflect the effect of the treatment. -If the study is an RCT, the control group is randomly constructed +If the study is experimental, the treatment and control groups are randomly constructed from the population that is eligible to receive each treatment. If it is a non-randomized observational study, we present other evidence that a similar equivalence holds. -Therefore, by construction, each unit's receipt of the treatment +In either case, by construction, each unit's receipt of the treatment is unrelated to any of its other characteristics and the ordinary least squares (OLS) regression of outcome on treatment, without any control variables, is an unbiased estimate of the average treatment effect. -For cross-sectional RCTs, what needs to be carefully maintained in data +For cross-sectional designs, what needs to be carefully maintained in data is the treatment randomization process itself, -as well as detailed field data about differences +as well as detailed information about differences in data quality and loss to follow-up across groups.\cite{athey2017econometrics} Only these details are needed to construct the appropriate estimator: clustering of the standard errors is required at the level @@ -317,7 +320,7 @@ \subsection{Difference-in-differences} but the treatment effect estimate corresponds to an interaction variable for treatment and time: it indicates the group of observations for which the treatment is active. -This model critically depends on the assumption that, +This model depends on the assumption that, in the absense of the treatment, the outcome of the two groups would have changed at the same rate over time, typically referred to as the \textbf{parallel trends} assumption.\sidenote{ @@ -338,8 +341,7 @@ \subsection{Difference-in-differences} both before and after they have received treatment (or not).\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/what-are-we-estimating-when-we-estimate-difference-differences}} This allows each unit's baseline outcome (the outcome before the intervention) to be used -as an additional control for its endline outcome (the last outcome observation in the data), -a \textbf{fixed effects} design often referred to as an ANCOVA model, +as an additional control for its endline outcome, which can provide large increases in power and robustness.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/another-reason-prefer-ancova-dealing-changes-measurement-between-baseline-and-follow}} When tracking individuals over time for this purpose, @@ -375,8 +377,8 @@ \subsection{Difference-in-differences} \subsection{Regression discontinuity} \textbf{Regression discontinuity (RD)} designs exploit sharp breaks or limits -in policy designs to separate a group of potentially eligible recipients -into comparable gorups of individuals who do and do not receive a treatment.\sidenote{ +in policy designs to separate a single group of potentially eligible recipients +into comparable groups of individuals who do and do not receive a treatment.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Regression_Discontinuity}} These designs differ from cross-sectional and diff-in-diff designs in that the group eligible to receive treatment is not defined directly, @@ -444,7 +446,7 @@ \subsection{Instrumental variables} \textbf{Instrumental variables (IV)} designs, unlike the previous approaches, begin by assuming that the treatment delivered in the study in question is -linked to the outcome, so its effect is not directly identifiable. +linked to the outcome in a pattern such that its effect is not directly identifiable. Instead, similar to regression discontinuity designs, IV attempts to focus on a subset of the variation in treatment uptake and assesses that limited window of variation that can be argued From 6cd297dccaf23f2d6f3e45a27b24f6358fcd3cdc Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 18 Feb 2020 15:34:06 -0500 Subject: [PATCH 709/854] Chapter 4 small fixes (inc #321) --- bibliography.bib | 10 ------ chapters/sampling-randomization-power.tex | 43 ++++++++++++----------- code/randtreat-strata.do | 4 +-- code/simple-multi-arm-randomization.do | 2 +- code/simple-sample.do | 2 +- 5 files changed, 26 insertions(+), 35 deletions(-) diff --git a/bibliography.bib b/bibliography.bib index 47cefaf8c..8c4ee42a1 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -17,16 +17,6 @@ @book{glewwe2000designing publisher={World Bank} } -@MISC{88491, - TITLE = {What is meant by the standard error of a maximum likelihood estimate?}, - AUTHOR = {{Alecos Papadopoulos (\url{https://stats.stackexchange.com/users/28746/alecos-papadopoulos})}}, - HOWPUBLISHED = {Cross Validated}, - NOTE = {\url{https://stats.stackexchange.com/q/88491} (version: 2014-03-04)}, - year={2014}, - EPRINT = {https://stats.stackexchange.com/q/88491}, - URL = {https://stats.stackexchange.com/q/88491} -} - @article{king2019propensity, title={Why propensity scores should not be used for matching}, author={King, Gary and Nielsen, Richard}, diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 1b676f022..d778bbb11 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -29,12 +29,12 @@ In this chapter, we first cover the necessary practices to ensure that random processes are reproducible. We next turn to how to implement sampling and randomized assignment, -both for simple, uniform probability cases, and more complex designs, -such as those that require clustering or stratification. -We include code examples so the guidance is concrete and applicable. +both for simple, uniform probability cases, and more complex designs, +such as those that require clustering or stratification. +We include code examples so the guidance is concrete and applicable. The last section discusses power calculations and randomization inference, and how both are important tools to critically and quantitatively assess different -sampling and randomization designs and to make optimal choices when planning studies. +sampling and randomization designs and to make optimal choices when planning studies. \end{fullwidth} @@ -75,7 +75,7 @@ \section{Random processes in Stata} be more complex than anything we present here, and you will need to recombine these lessons to match your project's needs. -\subsection{Reproducibility in random Stata processes} +\subsection{Ensuring reproducibility in random Stata processes} Any process that includes a random component is a random process, including sampling, randomization, power calculation simulations, and algorithms like bootstrapping. @@ -119,7 +119,7 @@ \subsection{Reproducibility in random Stata processes} Since the exact order must be unchanged, the underlying data itself must be unchanged as well between runs. This means that if you expect the number of observations to change (for example increase during ongoing data collection) your randomization will not be stable unless you split your data up into -smaller fixed data set where the number of observations does not change. You can combine all +smaller fixed data sets where the number of observations does not change. You can combine all those smaller data sets after your randomization. In Stata, the only way to guarantee a unique sorting order is to use \texttt{isid [id\_variable], sort}. (The \texttt{sort, stable} command is insufficient.) @@ -127,10 +127,10 @@ \subsection{Reproducibility in random Stata processes} data is unchanged. \textbf{Seeding} means manually setting the start-point in the list of random numbers. -The seed is a number that should be at least six digits long and you should use exactly -one unique, different, and randomly created seed per randomization process.\sidenote{You -can draw a uniformly distributed six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. -(This link is a just shortcut to request such a random seed on \url{https://www.random.org}.) +The seed is a number that should be at least six digits long and you should use exactly +one unique, different, and randomly created seed per randomization process.\sidenote{You +can draw a uniformly distributed six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. +(This link is a just shortcut to request such a random seed on \url{https://www.random.org}.) There are many more seeds possible but this is a large enough set for most purposes.} In Stata, \texttt{set seed [seed]} will set the generator to that start-point. In R, the \texttt{set.seed} function does the same. To be clear: you should not set a single seed once in the master do-file, @@ -167,8 +167,9 @@ \section{Sampling and randomization} In reality, you have to work with exactly one of them, so we put a lot of effort into making sure that one is a good one -by reducing the probability that we observe nonexistent, or ``spurious'', results. -In large studies, we can use what are called \textbf{asymptotic standard errors}\cite{88491} +by reducing the probability of the data indicating that results or correlations are true when they are not. +In large studies, we can use what are called \textbf{asymptotic standard errors}\sidenote{ + \url{https://stats.stackexchange.com/q/88491}} to express how far away from the true population parameters our estimates are likely to be. These standard errors can be calculated with only two datapoints: the sample size and the standard deviation of the value in the chosen sample. @@ -182,7 +183,7 @@ \section{Sampling and randomization} \subsection{Sampling} \textbf{Sampling} is the process of randomly selecting units of observation -from a master list of individuals to be surveyed for data collection.\sidenote{ +from a master list of individuals to be included in data collection.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Sampling_\%26_Power_Calculations}} \index{sampling} That master list may be called a \textbf{sampling universe}, a \textbf{listing frame}, or something similar. @@ -200,7 +201,7 @@ \subsection{Sampling} The most explicit method of implementing this process is to assign random numbers to all your potential observations, order them by the number they are assigned, -and mark as `sampled' those with the lowest numbers, up to the desired proportion. +and mark as ``sampled'' those with the lowest numbers, up to the desired proportion. (In general, we will talk about sampling proportions rather than numbers of observations. Sampling specific numbers of observations is complicated and should be avoided, because it will make the probability of selection very hard to calculate.) @@ -216,7 +217,7 @@ \subsection{Sampling} since otherwise the sampling process will not be clear, and the interpretation of measurements is directly linked to who is included in them. Often, data collection can be designed to keep complications to a minimum, -so long as it are carefully thought through from this perspective. +so long as it is carefully thought through from this perspective. Ex post changes to the study scope using a sample drawn for a different purpose usually involve tedious calculations of probabilities and should be avoided. @@ -249,7 +250,7 @@ \subsection{Randomization} Complexity can therefore grow very quickly in randomization and it is doubly important to fully understand the conceptual process that is described in the experimental design, -and fill in any gaps in the process before implementing it in Stata. +and fill in any gaps in the process before implementing it in code. Some types of experimental designs necessitate that randomization results be revealed during data collection. It is possible to do this using survey software or live events. @@ -285,7 +286,7 @@ \section{Clustering and stratification} \subsection{Clustering} -Many studies collect data at a different level of observation than the randomization unit.\sidenote{ +Many studies observe data at a different level than the randomization unit.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} For example, a policy may only be able to affect an entire village, but the study is interested in household behavior. @@ -320,7 +321,7 @@ \subsection{Stratification} This has the effect of ensuring that members of each subgroup are included in all groups of the randomization process, since it is possible that a global randomization -would put all the members of a subgroup into just one of the outcomes. +would put all the members of a subgroup into just one of the treatment arms. In this context, the subgroups are called \textbf{strata}. Manually implementing stratified randomization in Stata is prone to error. @@ -424,7 +425,7 @@ \subsection{Power calculations} so such a study would not be able to say anything about the effect size that is practically relevant. Conversely, the \textbf{minimum sample size} pre-specifies expected effect sizes and tells you how large a study's sample would need to be to detect that effect, -which can tell you what resources you would need to avoid that exact problem. +which can tell you what resources you would need to avoid that problem. Stata has some commands that can calculate power analytically for very simple designs -- \texttt{power} and \texttt{clustersampsi} -- @@ -435,7 +436,7 @@ \subsection{Power calculations} since the interactions of experimental design, sampling and randomization, clustering, stratification, and treatment arms -quickly becomes very complex. +quickly becomes complex. Furthermore, you should use real data on the population of interest whenever it is available, or you will have to make assumptions about the distribution of outcomes. @@ -491,7 +492,7 @@ \subsection{Randomization inference} and calculate empirical p-values for the effect size in our sample. After analyzing the actual treatment assignment, \texttt{ritest} illustrates the distribution of false correlations -that this randomization approach can produce by chance +that this randomization could produce by chance between outcomes and treatments. The randomization-inference p-value is the number of times that a false effect was larger than the one you measured, diff --git a/code/randtreat-strata.do b/code/randtreat-strata.do index ace5a9419..c1d896b99 100644 --- a/code/randtreat-strata.do +++ b/code/randtreat-strata.do @@ -2,7 +2,7 @@ cap which randtreat if _rc ssc install randtreat -* Set up reproducbilitiy - VERSIONING, SORTTING and SEEDING +* Set up reproducbilitiy - VERSIONING, SORTING and SEEDING ieboilstart , v(13.1) // Version `r(version)' // Version sysuse bpwide.dta, clear // Load data @@ -22,7 +22,7 @@ * example, we use the "global" misfit strategy, meaning that the misfits will * be randomized into treatment groups so that the sizes of the treatment * groups are as balanced as possible globally (read the help file for details -* on this and other strategies for misfits). This way we have 6 treatment +* on this and other strategies for misfits). This way we have 6 treatment * groups with exactly 20 observations in each, and it is randomized which * group that has an extra observation in each treatment arm. randtreat, /// diff --git a/code/simple-multi-arm-randomization.do b/code/simple-multi-arm-randomization.do index 846b5bbb4..ea5e23f32 100644 --- a/code/simple-multi-arm-randomization.do +++ b/code/simple-multi-arm-randomization.do @@ -1,4 +1,4 @@ -* Set up reproducbilitiy - VERSIONING, SORTTING and SEEDING +* Set up reproducbilitiy - VERSIONING, SORTING and SEEDING ieboilstart , v(13.1) // Version `r(version)' // Version sysuse bpwide.dta, clear // Load data diff --git a/code/simple-sample.do b/code/simple-sample.do index f5205a7a5..d34d85023 100644 --- a/code/simple-sample.do +++ b/code/simple-sample.do @@ -1,4 +1,4 @@ -* Set up reproducbilitiy - VERSIONING, SORTTING and SEEDING +* Set up reproducbilitiy - VERSIONING, SORTING and SEEDING ieboilstart , v(13.1) // Version `r(version)' // Version sysuse bpwide.dta, clear // Load data From c639946148c9e3769db065e991ce9291fe12ef42 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 18 Feb 2020 15:39:38 -0500 Subject: [PATCH 710/854] Chapter 5 small fixes --- chapters/data-collection.tex | 43 ++++++++++++++++++++---------------- 1 file changed, 24 insertions(+), 19 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 6232d52d7..fdcf967e6 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -18,8 +18,11 @@ so that the ownership and licensing of all information is established and the privacy rights of the people it describes are respected. -We then dive specifically into survey data, providing guidance on the data generation workflow, -from questionnaire design to programming electronic survey instruments and monitoring data quality. +While the principles of data governance and data quality apply to all types of data, +there are additional considerations to ensuring data quality if you are +collecting data yourself through an instrument like a field survey. +This chapter provides detailed guidance on the data generation workflow, +from questionnaire design to programming electronic instruments and monitoring data quality. While surveys remain popular, the rise of electronic data collection instruments means that there are additional workflow considerations needed to ensure that your data is accurate and usable in statistical software. @@ -161,12 +164,12 @@ \subsection{Receiving data from development partners} As soon as the requisite pieces of information are stored together, think about which ones are the components of what you would call a dataset. -This is, as many things are, more of an art than a science: +This is more of an art than a science: you want to keep things together that belong together, but you also want to keep things apart that belong apart. -There usually won't be a precise way to tell the answer to this question, +There usually won't be a precise way to answer this question, so consult with others about what is the appropriate level of aggregation -for the data project you have endeavored to obtain. +for the data you have endeavored to obtain. This is the object you will think about cataloging, releasing, and licensing as you move towards the publication part of the research process. This may require you to re-check with the provider @@ -183,15 +186,16 @@ \section{Collecting primary data using electronic surveys} have greatly accelerated our ability to bring in high-quality data using purpose-built survey instruments, and therefore improved the precision of research. -At the same time, electronic surveys create some pitfalls to avoid. +At the same time, electronic surveys create new pitfalls to avoid. Programming surveys efficiently requires a very different mindset than simply designing them in word processing software, and ensuring that they flow correctly and produce data that can be used in statistical software requires careful organization. This section will outline the major steps and technical considerations -you will need to follow whenever you field a custom survey instrument. +you will need to follow whenever you field a custom survey instrument, +no matter the scale. -\subsection{Developing a survey instrument} +\subsection{Developing a data collection instrument} A well-designed questionnaire results from careful planning, consideration of analysis and indicators, close review of existing questionnaires, @@ -217,7 +221,7 @@ \subsection{Developing a survey instrument} is especially useful for training data collection staff, by focusing on the survey content and structure before diving into the technical component. It is much easier for enumerators to understand the range of possible participant responses -and how to hand them correctly on a paper survey than on a tablet, +and how to handle them correctly on a paper survey than on a tablet, and it is much easier for them to translate that logic to digital functionality later. Finalizing this version of the questionnaire before beginning any programming also avoids version control concerns that arise from concurrent work @@ -237,9 +241,9 @@ \subsection{Developing a survey instrument} \url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} Use the list of key outcomes to create an outline of questionnaire \textit{modules}. -Do not number the modules yet; instead use a short prefix so they can be easily reordered. +Do not number the modules; instead use a short prefix so they can be easily reordered. For each module, determine if the module is applicable to the full sample, -the appropriate respondent, and whether or how often, the module should be repeated. +an appropriate respondent, and whether or how often the module should be repeated. A few examples: a module on maternal health only applies to household with a woman who has children, a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. @@ -281,7 +285,7 @@ \subsection{Designing surveys for electronic deployment} Electronic data collection has great potential to simplify survey implementation and improve data quality. Electronic questionnaires are typically created in a spreadsheet (e.g. Excel or Google Sheets) -or software-specific form builder, which are accessible even to novice users.\sidenote{ +or a software-specific form builder, all of which are accessible even to novice users.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} We will not address software-specific form design in this book; rather, we focus on coding conventions that are important to follow @@ -338,7 +342,8 @@ \subsection{Designing surveys for electronic deployment} (we prefer all-lowercase naming). Take special care with the length: very long names will be cut off in some softwares, which could result in a loss of uniqueness and lots of manual work to restore compatibility. -We further discourage explicit question numbering, as it discourages re-ordering, +We further discourage explicit question numbering, +at least at first, as it discourages re-ordering questions, which is a common recommended change after the pilot. In the case of follow-up surveys, numbering can quickly become convoluted, too often resulting in uninformative variables names like @@ -515,7 +520,7 @@ \subsection{Conducting back-checks and data validation} \textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and other validation audits help ensure that data collection is following established protocols, and that data is not fasified, incomplete, or otherwise suspect. -For back-checks and validation audies, a random subset of the main data is selected, +For back-checks and validation audits, a random subset of the main data is selected, and a subset of information from the full survey is verified through a brief targeted survey with the original respondent or a cross-referenced data set from another source (if the original data is not a field survey). @@ -572,7 +577,7 @@ \section{Collecting and sharing data securely} Proper encryption is rarely just a single method, as the data will travel through many servers, devices, and computers from the source of the data to the final analysis. -So encryption should be seen as a system that is only as secure as its weakest link. +Encryption should be seen as a system that is only as secure as its weakest link. This section recommends a workflow with as few parts as possible, so that it is easy as possible to make sure the weakest link is still sufficiently secure. @@ -615,7 +620,7 @@ \subsection{Collecting data securely} Therefore, as long as you are using an established survey software, this step is largely taken care of. However, the research team must ensure that all computers, tablets, -and accounts that are used in data collection have secure a logon +and accounts that are used in data collection have a secure logon password and are never left unlocked. Even though your data is therefore usually safe while it is being transmitted, @@ -628,7 +633,7 @@ \subsection{Collecting data securely} If you do not, the raw data will be accessible by individuals who are not approved by your IRB, such as tech support personnel, -server administrators and other third-party staff. +server administrators, and other third-party staff. Encryption at rest must be used to make data files completely unusable without access to a security key specific to that data -- a higher level of security than password-protection. @@ -691,7 +696,7 @@ \subsection{Storing data securely} private key used during data collection to be able to download the data, \textit{and} you will need the key used when you created the secure folder to save it there. This your first copy of your raw data, and the copy you will use in your cleaning and analysis. - \item Create a secure folder on a pen-drive or a external hard drive that you can keep in your office. + \item Create a secure folder on a flash drive or a external hard drive that you can keep in your office. Copy the data you just downloaded to this second secure folder. This is your ``master'' copy of your raw data. (Instead of creating a second secure folder, you can simply copy the first secure folder.) @@ -711,7 +716,7 @@ \subsection{Storing data securely} \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} However, you still need to keep track of your encryption keys as without them your data is lost. If you remain lucky, you will never have to access your ``master'' or ``golden master'' copies -- -you just want to know it is out there, safe, if you need it. +you just want to know it is there, safe, if you need it. \subsection{Sharing data securely} You and your team will use your first copy of the raw data From 08b0add50e056df4b694085d00bcf6f896e49622 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 18 Feb 2020 15:47:26 -0500 Subject: [PATCH 711/854] Chapter 6 small fixes --- chapters/data-analysis.tex | 80 +++++++++++++++++++------------------- 1 file changed, 40 insertions(+), 40 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 8e25203e0..d7564648d 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -14,7 +14,7 @@ with a clear system for version control, and analytical scripts structured such that any member of the research team can run them. Putting in time upfront to structure data work well -pays off substantial dividends throughout the process. +pays substantial dividends throughout the process. In this chapter, we first cover data management: how to organize your data work at the start of a project @@ -74,10 +74,10 @@ \subsection{Organizing your folder structure} We created \texttt{iefolder} based on our experience with primary data, but it can be used for different types of data, and adapted to fit different needs. -No matter what are your team's preference in terms of folder organization, +No matter what your team's preferences in terms of folder organization are, the principle of creating a single unified standard remains. -At the top level of the structure created by \texttt{iefolder} are what we call survey round folders.\sidenote{ +At the top level of the structure created by \texttt{iefolder} are what we call ``round'' folders.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/DataWork\_Survey\_Round}} You can think of a ``round'' as a single source of data, which will all be cleaned using a single script. @@ -153,21 +153,21 @@ \subsection{Implementing version control} \section{De-identifying research data} The starting point for all tasks described in this chapter is the raw data -which should contain only information that are received directly from the field. +which should contain only information that is received directly from the field. The raw data will invariably come in a variety of file formats and these files -should be saved in the raw data folder \textit{exactly as they were +should be saved in the raw data folder \textit{exactly as they were received}. Be mindful of how and where they are stored as they can not be re-created and nearly always contain confidential data such as personally-identifying information\index{personally-identifying information}. As described in the previous chapter, confidential data must always be encrypted\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} and be properly backed up since every other data file you will use is created from the -raw data. The only data sets that can not be re-created are the raw data +raw data. The only data sets that can not be re-created are the raw data themselves. The raw data files should never be edited directly. This is true even in the rare case when the raw data cannot be opened due to, for example, incorrect -encoding where non-English character is causing rows or columns to break at the +encoding where a non-English character is causing rows or columns to break at the wrong place when the data is imported. In this scenario, you should create a copy of the raw data where you manually remove the special characters and securely back up \textit{both} the broken and the fixed copy of the raw data. @@ -204,22 +204,22 @@ \section{De-identifying research data} as you can always go back and remove variables from the list of variables to be dropped, but you can not go back in time and drop a PII variable that was leaked because it was incorrectly kept. -Examples include respondent names, enumerator names, interview date, respondent phone number. +Examples include respondent names, enumerator names, interview dates, and respondent phone numbers. For each PII variable that is needed in the analysis, ask yourself: -\textit{can I encode or otherwise construct a variable that masks the PII, and +\textit{can I encode or otherwise construct a variable that masks the PII, and then drop this variable?} This is typically the case for most identifying information. Examples include geocoordinates (after constructing measures of distance or area, -drop the specific location), +drop the specific location) and names for social network analysis (can be encoded to secret and unique IDs). -If the answer to either of two questions above is yes, +If the answer to either of the two questions above is yes, all you need to do is write a script to drop the variables that are not required for analysis, encode or otherwise mask those that are required, and save a working version of the data. -If PII variables are strictly required for the analysis itself and can not be +If PII variables are strictly required for the analysis itself and can not be masked or encoded, -it will be necessary to keep at least a subset of the data encrypted through +it will be necessary to keep at least a subset of the data encrypted through the data analysis process. The resulting de-identified data will be the underlying source for all cleaned and constructed data. @@ -230,8 +230,7 @@ \section{De-identifying research data} \section{Cleaning data for analysis} -Data cleaning is the second stage in the transformation of data you received from -the field into data that you can analyze.\sidenote{\url{ +Data cleaning is the second stage in the transformation of data you received into data that you can analyze.\sidenote{\url{ https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. @@ -254,7 +253,7 @@ \subsection{Correcting data entry errors} Modern survey tools create unique observation identifiers. That, however, is not the same as having a unique ID variable for each individual in the sample. You want to make sure the data set has a unique ID variable -that can be cross-referenced with other records, such as the Master Data Set\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Data\_Set}} +that can be cross-referenced with other records, such as the master data set\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Data\_Set}} and other rounds of data collection. \texttt{ieduplicates} and \texttt{iecompdup}, two Stata commands included in the \texttt{iefieldkit} @@ -284,7 +283,7 @@ \subsection{Labeling, annotating, and finalizing clean data} and be accompanied by a dictionary or codebook. Typically, one cleaned data set will be created for each data source or survey instrument. -Each row in the cleaned data set represents one survey entry or unit of +Each row in the cleaned data set represents one survey entry or unit of observation.\cite{tidy-data} If the raw data set is very large, or the survey instrument is very complex, you may want to break the data cleaning into sub-steps, @@ -322,6 +321,7 @@ \subsection{Labeling, annotating, and finalizing clean data} such as renaming, relabeling, and value labeling, much easier.\sidenote{ \url{https://dimewiki.worldbank.org/wiki/iecodebook}} \index{iecodebook} + We have a few recommendations on how to use this command, and how to approach data cleaning in general. First, we suggest keeping the same variable names in the cleaned data set as in the survey instrument, @@ -337,7 +337,7 @@ \subsection{Labeling, annotating, and finalizing clean data} Third, recodes should be used to turn codes for ``Don't know'', ``Refused to answer'', and other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} String variables that correspond to categorical variables need to be encoded. -Open-ended responses stored as strings usually have a high-risk of being identifiers, +Open-ended responses stored as strings usually have a high risk of being identifiers, so they should be dropped at this point. You can use the encrypted data as an input to a construction script that categorizes these responses and merges them to the rest of the dataset. @@ -391,9 +391,9 @@ \section{Constructing final indicators} such as caloric input or food expenditure per adult equivalent. During this process, the data points will typically be reshaped and aggregated so that level of the data set goes from the unit of observation -(one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ +(one item in the bundle) in the survey to the unit of analysis (the household),\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} -so that level of the data set goes from the unit of observation (one item in the bundle) +so that the level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ \url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} @@ -402,7 +402,7 @@ \section{Constructing final indicators} or even different units of observation, you may have one or multiple constructed data sets, depending on how your analysis is structured. -So don't worry if you cannot create a single, ``canonical'' analysis data set. +Don't worry if you cannot create a single, ``canonical'' analysis data set. It is common to have many purpose-built analysis datasets. Think of an agricultural intervention that was randomized across villages and only affected certain plots within each village. @@ -410,7 +410,7 @@ \section{Constructing final indicators} test for plot-level productivity gains, and check if village characteristics are balanced. Having three separate datasets for each of these three pieces of analysis -will result in much cleaner do files than if they all started from the same file. +will result in much cleaner do files than if they all started from the same data set. % From cleaning Construction is done separately from data cleaning for two reasons. @@ -434,7 +434,7 @@ \section{Constructing final indicators} In practice, however, following this principle is not always easy. As you analyze the data, different constructed variables will become necessary, as well as subsets and other alterations to the data. -Still, constructing variables in a separate script from the analysis +Constructing variables in a separate script from the analysis will help you ensure consistency across different outputs. If every script that creates a table starts by loading a data set, subsetting it, and manipulating variables, @@ -471,12 +471,12 @@ \subsection{Constructing analytical variables} Make sure there is consistency across constructed variables. It's possible that your questionnaire asked respondents to report some answers as percentages and others as proportions, or that in one variable \texttt{0} means ``no'' and \texttt{1} means ``yes'', -while in another one the same answers were coded are \texttt{1} and \texttt{2}. +while in another one the same answers were coded as \texttt{1} and \texttt{2}. We recommend coding yes/no questions as either \texttt{1} and \texttt{0} or \texttt{TRUE} and \texttt{FALSE}, so they can be used numerically as frequencies in means and as dummies in regressions. -(Note that this implies that categorical variables like \texttt{gender} +(Note that this implies that categorical variables like \texttt{sex} should be re-expressed as binary variables like \texttt{female}.) -Check that non-binary categorical variables have the same value-assignment, i.e., +Check that non-binary categorical variables have the same value assignment, i.e., that labels and levels have the same correspondence across variables that use the same options. Finally, make sure that any numeric variables you are comparing are converted to the same scale or unit of measure. You cannot add one hectare and two acres into a meaningful number. @@ -619,18 +619,18 @@ \subsection{Visualizing data} \url{http://socvis.co}} Graphics tools like Stata are highly customizable. There is a fair amount of learning curve associated with extremely-fine-grained adjustment, -but it is well worth reviewing the graphics manual\sidenote{\url{https://www.stata.com/manuals/g.pdf}} +but it is well worth reviewing the graphics manual.\sidenote{\url{https://www.stata.com/manuals/g.pdf}} For an easier way around it, Gray Kimbrough's \textit{Uncluttered Stata Graphs} -code is an excellent default replacement for Stata graphics that is easy to install. -\sidenote{\url{https://graykimbrough.github.io/uncluttered-stata-graphs/}} -If you are a R user, the \textit{R Graphics Cookbook}\sidenote{\url{https://r-graphics.org/}} -is a great resource for the its popular visualization package, \texttt{ggplot}\sidenote{ +code is an excellent default replacement for Stata graphics that is easy to install.\sidenote{ + \url{https://graykimbrough.github.io/uncluttered-stata-graphs/}} +If you are an R user, the \textit{R Graphics Cookbook}\sidenote{\url{https://r-graphics.org/}} +is a great resource for the its popular visualization package \texttt{ggplot}\sidenote{ \url{https://ggplot2.tidyverse.org/}}. But there are a variety of other visualization packages, -such as \texttt{highcharter}\sidenote{\url{http://jkunst.com/highcharter/}}, -\texttt{r2d3}\sidenote{\url{https://rstudio.github.io/r2d3/}}, -\texttt{leaflet}\sidenote{\url{https://rstudio.github.io/leaflet/}}, -and \texttt{plotly}\sidenote{\url{https://plot.ly/r/}}, to name a few. +such as \texttt{highcharter},\sidenote{\url{http://jkunst.com/highcharter/}} +\texttt{r2d3},\sidenote{\url{https://rstudio.github.io/r2d3/}} +\texttt{leaflet},\sidenote{\url{https://rstudio.github.io/leaflet/}} +and \texttt{plotly},\sidenote{\url{https://plot.ly/r/}} to name a few. We have no intention of creating an exhaustive list, and this one is certainly missing very good references; but it is a good place to start. We attribute some of the difficulty of creating good data visualization @@ -654,8 +654,8 @@ \subsection{Exporting analysis outputs} creates and exports balance tables to excel or {\LaTeX}. \texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} does the same for difference-in-differences regressions. -It also includes a command, \texttt{iegraph}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Iegraph}}, +It also includes a command, \texttt{iegraph},\sidenote{ + \url{https://dimewiki.worldbank.org/wiki/Iegraph}} to export pre-formatted impact evaluation results graphs. It's okay to not export each and every table and graph created during exploratory analysis. @@ -695,10 +695,10 @@ \subsection{Exporting analysis outputs} Excel \texttt{.xlsx} and \texttt{.csv} are also commonly used, but require the extra step of copying the tables into the final output. The amount of work needed in a copy-paste workflow increases rapidly with the number of tables and figures included in a research output, -and do the chances of having the wrong version a result in your paper or report. +and so do the chances of having the wrong version a result in your paper or report. -If you need to create a table with a very particular format, -that is not automated by any command you know, consider writing the it manually +If you need to create a table with a very particular format +that is not automated by any command you know, consider writing it manually (Stata's \texttt{filewrite}, for example, allows you to do that). This will allow you to write a cleaner script that focuses on the econometrics, and not on complicated commands to create and append intermediate matrices. From 551d748fcf631fd2e297265ecfd4bf1109797591 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 18 Feb 2020 15:52:56 -0500 Subject: [PATCH 712/854] Chapter 7 fixes and some redundancies --- chapters/publication.tex | 155 +++++++++++++++++++-------------------- code/sample.bib | 2 +- 2 files changed, 77 insertions(+), 80 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 33759c594..aa68254bc 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -2,7 +2,7 @@ \begin{fullwidth} For most research projects, completing a manuscript is not the end of the task. -Academic journals increasingly require submission of a replication package, +Academic journals increasingly require submission of a replication package which contains the code and materials needed to create the results. These represent an intellectual contribution in their own right, because they enable others to learn from your process @@ -60,7 +60,7 @@ \section{Collaborating on technical writing} such that there is no risk of materials being compiled with out-of-date results, or of completed work being lost or redundant. -\subsection{Dynamic documents} +\subsection{Preparing dynamic documents} Dynamic documents are a broad class of tools that enable a streamlined, reproducible workflow. The term ``dynamic'' can refer to any document-creation technology @@ -86,7 +86,7 @@ \subsection{Dynamic documents} There are a number of tools that can be used for dynamic documents. Some are code-based tools such as R's RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com/}} -and Stata's \texttt{dyndoc}\sidenote{\url{https://www.stata.com/manuals/rptdyndoc.pdf}}. +and Stata's \texttt{dyndoc}.\sidenote{\url{https://www.stata.com/manuals/rptdyndoc.pdf}} These tools ``knit'' or ``weave'' text and code together, and are programmed to insert code outputs in pre-specified locations. Documents called ``notebooks'' (such as Jupyter\sidenote{\url{https://jupyter.org/}}) work similarly, @@ -101,7 +101,7 @@ \subsection{Dynamic documents} These can be useful for working on informal outputs, such as blogposts, with collaborators who do not code. An example of this is Dropbox Paper, -a free online writing tool that allows linkages to files in Dropbox, +a free online writing tool that allows linkages to files in Dropbox which are automatically updated anytime the file is replaced. However, the most widely utilized software @@ -135,12 +135,13 @@ \subsection{Technical writing with \LaTeX} manages tables and figures dynamically, and includes commands for simple markup like font styles, paragraph formatting, section headers and the like. -It also includes special controls for including tables and figures, +It includes special controls for including tables and figures, footnotes and endnotes, complex mathematical notation, and automated bibliography preparation. It also allows publishers to apply global styles and templates to already-written material, allowing them to reformat entire documents in house styles with only a few keystrokes. -One of the most important tools available in \LaTeX\ is the BibTeX bibliography manager.\sidenote{ +One of the most important tools available in \LaTeX\ +is the BibTeX citation and bibliography manager.\sidenote{ \url{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.3194&rep=rep1&type=pdf}} BibTeX keeps all the references you might use in an auxiliary file, then references them using a simple element typed directly in the document: a \texttt{cite} command. @@ -149,7 +150,7 @@ \subsection{Technical writing with \LaTeX} and then everywhere they are used they are updated correctly with one process. Specifically, \LaTeX\ inserts references in text using the \texttt{\textbackslash cite\{\}} command. Once this is written, \LaTeX\ automatically pulls all the citations into text -and creates a complete bibliography based on the citations you use when you compile the document. +and creates a complete bibliography based on the citations you used whenever you compile the document. The system allows you to specify exactly how references should be displayed in text (such as superscripts, inline references, etc.) as well as how the bibliography should be styled and in what order @@ -250,76 +251,6 @@ \subsection{Getting started with \LaTeX\ in the cloud} cloud-based implementations are often the easiest way to allow coauthors to write and edit in \LaTeX\, so long as you make sure you are available to troubleshoot minor issues like these. -%------------------------------------------------ - -\section{Publishing primary data} - -If your project collected primary data, -releasing the cleaned dataset is a significant contribution that can be made -in addition to any publication of analysis results. -Publishing data can foster collaboration with researchers -interested in the same subjects as your team. -Collaboration can enable your team to fully explore variables and -questions that you may not have time to focus on otherwise, -even though data was collected on them. -There are different options for data publication. -The World Bank's Development Data Hub\sidenote{ - \url{https://data.worldbank.org/}} -includes a Microdata Catalog\sidenote{ -\url{https://microdata.worldbank.org}} -where researchers can publish data and documentation for their projects.\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Microdata\_Catalog} -\newline -\url{https://dimewiki.worldbank.org/wiki/Checklist:\_Microdata\_Catalog\_submission} -} -The Harvard Dataverse\sidenote{ - \url{https://dataverse.harvard.edu}} -publishes both data and code, -and also creates a data citation for its entries -- -IPA/J-PAL field experiment repository is especially relevant\sidenote{ - \url{https://www.povertyactionlab.org/blog/9-11-19/new-hub-data-randomized-evaluations}} -for those interested in impact evaluation. - -There will almost always be a trade-off between accuracy and privacy. -For publicly disclosed data, you should favor privacy. -Therefore, before publishing data, -you should carefully perform a \textbf{final de-identification}. -Its objective is to create a dataset for publication -that cannot be manipulated or linked to identify any individual research participant. -If you are following the steps outlined in this book, -you have already removed any direct identifiers after collecting the data. -At this stage, however, you should further remove -all indirect identifiers, and assess the risk of statistical disclosure.\sidenote{ - \textbf{Disclosure risk:} the likelihood that a released data record can be associated with an individual or organization.}\index{statistical disclosure} -To the extent required to ensure reasonable privacy, -potentially identifying variables must be further masked or removed. - -There are a number of tools developed to help researchers de-identify data -and which you should use as appropriate at that stage of data collection. -These include \texttt{PII\_detection}\sidenote{ - \url{https://github.com/PovertyAction/PII\_detection}} -from IPA, -\texttt{PII-scan}\sidenote{ - \url{https://github.com/J-PAL/PII-Scan}} -from JPAL, -and \texttt{sdcMicro}\sidenote{ - \url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} -from the World Bank. -\index{anonymization} -The \texttt{sdcMicro} tool, in particular, has a feature -that allows you to assess the uniqueness of your data observations, -and simple measures of the identifiability of records from that. -Additional options to protect privacy in data that will become public exist, -and you should expect and intend to release your datasets at some point. -One option is to add noise to data, as the US Census Bureau has proposed,\cite{abowd2018us} -as it makes the trade-off between data accuracy and privacy explicit. -But there are no established norms for such ``differential privacy'' approaches: -most approaches fundamentally rely on judging ``how harmful'' information disclosure would be. -The fact remains that there is always a balance between information release (and therefore transparency) -and privacy protection, and that you should engage with it actively and explicitly. -The best thing you can do is make a complete record of the steps that have been taken -so that the process can be reviewed, revised, and updated as necessary. - %---------------------------------------------------- \section{Preparing a complete replication package} @@ -351,7 +282,7 @@ \subsection{Publishing data for replication} is an important contribution you can make along with the publication of results. It allows other researchers to validate the mechanical construction of your results, to investigate what other results might be obtained from the same population, -and test alternative approaches to other questions. +and test alternative approaches or other questions. Therefore you should make clear in your study where and how data are stored, and how and under what circumstances they may be accessed. You do not always have to publish the data yourself, @@ -367,6 +298,32 @@ \subsection{Publishing data for replication} They can also provide for timed future releases of datasets once the need for exclusive access has ended. +If your project collected primary data, +releasing the cleaned dataset is a significant contribution that can be made +in addition to any publication of analysis results. +Publishing data can foster collaboration with researchers +interested in the same subjects as your team. +Collaboration can enable your team to fully explore variables and +questions that you may not have time to focus on otherwise, +even though data was collected on them. +There are different options for data publication. +The World Bank's Development Data Hub\sidenote{ + \url{https://data.worldbank.org/}} +includes a Microdata Catalog\sidenote{ +\url{https://microdata.worldbank.org}} +where researchers can publish data and documentation for their projects.\sidenote{ +\url{https://dimewiki.worldbank.org/wiki/Microdata\_Catalog} +\newline +\url{https://dimewiki.worldbank.org/wiki/Checklist:\_Microdata\_Catalog\_submission} +} +The Harvard Dataverse\sidenote{ + \url{https://dataverse.harvard.edu}} +publishes both data and code, +and also creates a data citation for its entries -- +IPA/J-PAL field experiment repository is especially relevant\sidenote{ + \url{https://www.povertyactionlab.org/blog/9-11-19/new-hub-data-randomized-evaluations}} +for those interested in impact evaluation. + What matters is for you to be able to cite or otherwise directly reference the data used. When your raw data is owned by someone else, or for any other reason you are not able to publish it, @@ -420,6 +377,46 @@ \subsection{Publishing data for replication} Access to the embargoed data could be granted for the purposes of study replication, if approved by an IRB. +There will almost always be a trade-off between accuracy and privacy. +For publicly disclosed data, you should favor privacy. +Therefore, before publishing data, +you should carefully perform a \textbf{final de-identification}. +Its objective is to create a dataset for publication +that cannot be manipulated or linked to identify any individual research participant. +If you are following the steps outlined in this book, +you have already removed any direct identifiers after collecting the data. +At this stage, however, you should further remove +all indirect identifiers, and assess the risk of statistical disclosure.\sidenote{ + \textbf{Disclosure risk:} the likelihood that a released data record can be associated with an individual or organization.}\index{statistical disclosure} +To the extent required to ensure reasonable privacy, +potentially identifying variables must be further masked or removed. + +There are a number of tools developed to help researchers de-identify data +and which you should use as appropriate at that stage of data collection. +These include \texttt{PII\_detection}\sidenote{ + \url{https://github.com/PovertyAction/PII\_detection}} +from IPA, +\texttt{PII-scan}\sidenote{ + \url{https://github.com/J-PAL/PII-Scan}} +from JPAL, +and \texttt{sdcMicro}\sidenote{ + \url{https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html}} +from the World Bank. +\index{anonymization} +The \texttt{sdcMicro} tool, in particular, has a feature +that allows you to assess the uniqueness of your data observations, +and simple measures of the identifiability of records from that. +Additional options to protect privacy in data that will become public exist, +and you should expect and intend to release your datasets at some point. +One option is to add noise to data, as the US Census Bureau has proposed,\cite{abowd2018us} +as it makes the trade-off between data accuracy and privacy explicit. +But there are no established norms for such ``differential privacy'' approaches: +most approaches fundamentally rely on judging ``how harmful'' information disclosure would be. +The fact remains that there is always a balance between information release (and therefore transparency) +and privacy protection, and that you should engage with it actively and explicitly. +The best thing you can do is make a complete record of the steps that have been taken +so that the process can be reviewed, revised, and updated as necessary. + \subsection{Publishing code for replication} Before publishing your code, you should edit it for content and clarity @@ -467,7 +464,7 @@ \subsection{Publishing code for replication} any files created by the code so that they can be recreated quickly. They should also be able to quickly map all the outputs of the code to the locations where they are placed in the associated published material, -such as ensuring that the raw components of figures or tables are clearly identified. +so ensure that the raw components of figures or tables are clearly identified. Documentation in the master script is often used to indicate this information. For example, outputs should clearly correspond by name to an exhibit in the paper, and vice versa. (Supplying a compiling \LaTeX\ document can support this.) diff --git a/code/sample.bib b/code/sample.bib index 24a4ac099..cdea9fc93 100644 --- a/code/sample.bib +++ b/code/sample.bib @@ -1,5 +1,5 @@ @article{flom2005latex, - title={LATEX for academics and researchers who (think they) don't need it}, + title={{LaTeX} for academics and researchers who (think they) don't need it}, author={Flom, Peter}, journal={The PracTEX Journal}, volume={4}, From 95ad3a9fcfbc9fdc03dc9658cb22c4728406ce29 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 18 Feb 2020 15:56:46 -0500 Subject: [PATCH 713/854] Appendix small edits --- appendix/stata-guide.tex | 26 +++++++++++++------------- code/stata-conditional-expressions1.do | 8 ++++---- 2 files changed, 17 insertions(+), 17 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 4c3d9a56e..f25363363 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -23,7 +23,7 @@ This appendix begins with a short section containing instructions on how to access and use the code examples shared in this book. -The second section contains the DIME Analytics style guide for Stata code. +The second section contains the DIME Analytics Stata Style Guide. We believe these resources can help anyone write more understandable code, no matter how proficient they are in writing Stata code. Widely accepted and used style guides are common in most programming languages, @@ -45,7 +45,7 @@ \section{Using the code examples in this book} To see the code on GitHub, go to: \url{https://github.com/worldbank/d4di/tree/master/code}. If you are familiar with GitHub you can fork the repository and clone your fork. We only use Stata's built-in datasets in our code examples, -so you do not need to download any data from anywhere. +so you do not need to download any data. If you have Stata installed on your computer, then you will already have the data files used in the code. A less technical way to access the code is to click the individual file in the URL above, then click @@ -97,8 +97,8 @@ \subsection{Understanding Stata code} We understand that it can be confusing to work with packages for first time, but this is the best way to set up your Stata installation to benefit from other -people's work that has been made publicly available, -and once you get used to installing commands like this it will not be confusing at all. +people's work that has been made publicly available. +Once you get used to installing commands like this it will not be confusing at all. All code with user-written commands, furthermore, is best written when it installs such commands at the beginning of the master do-file, so that the user does not have to search for packages manually. @@ -139,9 +139,9 @@ \subsection{Why we use a Stata style guide} \newpage -\section{The DIME Analytics Stata style guide} +\section{The DIME Analytics Stata Style Guide} -While this section is called a \textit{Stata} Style Guide, +While this section is called a \textit{Stata} style guide, many of these practices are agnostic to which programming language you are using: best practices often relate to concepts that are common across many languages. If you are coding in a different language, @@ -214,11 +214,11 @@ \subsection{Abbreviating variables} and will therefore break any code using variable abbreviations. Using wildcards and lists in Stata for variable lists -(texttt{*}, texttt{?}, and texttt{-}) is also discouraged, +(\texttt{*}, \texttt{?}, and \texttt{-}) is also discouraged, because the functionality of the code may change if the dataset is changed or even simply reordered. If you intend explicitly to capture all variables of a certain type, -prefer texttt{unab} or texttt{lookfor} to build that list in a local macro, +prefer \texttt{unab} or \texttt{lookfor} to build that list in a local macro, which can then be checked to have the right variables in the right order. \subsection{Writing loops} @@ -310,14 +310,14 @@ \subsection{Using macros} You can use all lower case (\texttt{mymacro}), underscores (\texttt{my\_macro}), or ``camel case'' (\texttt{myMacro}), as long as you are consistent. Simple prefixes are useful and encouraged such as \texttt{this\_estimate} or \texttt{current\_var}, -or (using \texttt{camelCase}) \texttt{lastValue}, \texttt{allValues}, or \texttt{nValues}. -Nested locals (\texttt{if \`{}\`{}value\textquotesingle\textquotesingle}) +or, using \texttt{camelCase}, \texttt{lastValue}, \texttt{allValues}, or \texttt{nValues}. +Nested locals (\texttt{\`{}\`{}value\textquotesingle\textquotesingle}) are also possible for a variety of reasons when looping, and should be indicated in comments. If you need a macro to hold a literal macro name, it can be done using the backslash escape character; this causes the stored macro to be evaluated at the usage of the macro rather than at its creation. -This function should be used sparingly and commented extensively. +This functionality should be used sparingly and commented extensively. \codeexample{stata-macros.do}{./code/stata-macros.do} @@ -371,7 +371,7 @@ \subsection{Line breaks} since indentations should reflect that the command continues to a new line. Break lines where it makes functional sense. You can write comments after \texttt{///} just as with \texttt{//}, and that is usually a good thing. -(The \texttt{\#delimit} command should only be used for advanced function programming +The \texttt{\#delimit} command should only be used for advanced function programming and is officially discouraged in analytical code.\cite{cox2005styleguide} Never, for any reason, use \texttt{/* */} to wrap a line: it is distracting and difficult to follow compared to the use @@ -451,9 +451,9 @@ \subsection{Miscellaneous notes} \bigskip\noindent Make sure your code doesn't print very much to the results window as this is slow. This can be accomplished by using \texttt{run file.do} rather than \texttt{do file.do}. -Therefore, it is faster to run outputs from commands like \texttt{reg} using the \texttt{qui} prefix. Interactive commands like \texttt{sum} or \texttt{tab} should be used sparingly in dofiles, unless they are for the purpose of getting \texttt{r()}-statistics. In that case, consider using the \texttt{qui} prefix to prevent printing output. +It is also faster to get outputs from commands like \texttt{reg} using the \texttt{qui} prefix. \mainmatter diff --git a/code/stata-conditional-expressions1.do b/code/stata-conditional-expressions1.do index ed0b75f59..0416bf7eb 100644 --- a/code/stata-conditional-expressions1.do +++ b/code/stata-conditional-expressions1.do @@ -1,9 +1,9 @@ GOOD: - replace gender_string = "Female" if (gender == 1) - replace gender_string = "Male" if ((gender != 1) & !missing(gender)) + replace gender_string = "Woman" if (gender == 1) + replace gender_string = "Man" if ((gender != 1) & !missing(gender)) BAD: - replace gender_string = "Female" if gender == 1 - replace gender_string = "Male" if (gender ~= 1) + replace gender_string = "Woman" if gender == 1 + replace gender_string = "Man" if (gender ~= 1) From dc91a4d50a3dece209436b0933fd2cd58f6deb76 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Feb 2020 16:40:13 -0500 Subject: [PATCH 714/854] remove /wiki/ from all dimewiki links --- appendix/stata-guide.tex | 2 +- chapters/data-analysis.tex | 50 +++++++++++------------ chapters/data-collection.tex | 46 ++++++++++----------- chapters/handling-data.tex | 30 +++++++------- chapters/introduction.tex | 8 ++-- chapters/planning-data-work.tex | 10 ++--- chapters/publication.tex | 4 +- chapters/research-design.tex | 22 +++++----- chapters/sampling-randomization-power.tex | 26 ++++++------ 9 files changed, 99 insertions(+), 99 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index f25363363..db719fab7 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -419,7 +419,7 @@ \subsection{Saving data} If there is a unique ID variable or a set of ID variables, the code should test that they are uniqueally and fully identifying the data set.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/ID\_Variable\_Properties}} + \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} ID variables are also perfect variables to sort on, and to \texttt{order} first in the data set. diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index d7564648d..60218ab51 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -62,15 +62,15 @@ \subsection{Organizing your folder structure} Our preferred scheme reflects the task breakdown that will be outlined in this chapter. \index{data organization} Our team at DIME Analytics developed the \texttt{iefolder}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/iefolder}} + \url{https://dimewiki.worldbank.org/iefolder}} command (part of \texttt{ietoolkit}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/ietoolkit}}) + \url{https://dimewiki.worldbank.org/ietoolkit}}) to automatize the creation of a folder following this scheme and to standardize folder structures across teams and projects. Standardizing folders greatly reduces the costs that PIs and RAs face when switching between projects, because they are all organized in exactly the same way and use the same filepaths, shortcuts, and macro references.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/DataWork\_Folder}} + \url{https://dimewiki.worldbank.org/DataWork\_Folder}} We created \texttt{iefolder} based on our experience with primary data, but it can be used for different types of data, and adapted to fit different needs. @@ -78,7 +78,7 @@ \subsection{Organizing your folder structure} the principle of creating a single unified standard remains. At the top level of the structure created by \texttt{iefolder} are what we call ``round'' folders.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/DataWork\_Survey\_Round}} + \url{https://dimewiki.worldbank.org/DataWork\_Survey\_Round}} You can think of a ``round'' as a single source of data, which will all be cleaned using a single script. Inside each round folder, there are dedicated folders for: @@ -87,7 +87,7 @@ \subsection{Organizing your folder structure} The folders that hold code are organized in parallel to these, so that the progression through the whole project can be followed by anyone new to the team. Additionally, \texttt{iefolder} creates \textbf{master do-files}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} + \url{https://dimewiki.worldbank.org/Master\_Do-files}} so the structure of all project code is reflected in a top-level script. \subsection{Breaking down tasks} @@ -160,7 +160,7 @@ \section{De-identifying research data} re-created and nearly always contain confidential data such as personally-identifying information\index{personally-identifying information}. As described in the previous chapter, confidential data must always be -encrypted\sidenote{\url{https://dimewiki.worldbank.org/wiki/Encryption}} and be +encrypted\sidenote{\url{https://dimewiki.worldbank.org/Encryption}} and be properly backed up since every other data file you will use is created from the raw data. The only data sets that can not be re-created are the raw data themselves. @@ -177,7 +177,7 @@ \section{De-identifying research data} Loading encrypted data frequently can be disruptive to the workflow. To facilitate the handling of the data, remove any personally identifiable information from the data set. This will create a de-identified data set, that can be saved in a non-encrypted folder. -De-identification,\sidenote{\url{https://dimewiki.worldbank.org/wiki/De-identification}} +De-identification,\sidenote{\url{https://dimewiki.worldbank.org/De-identification}} at this stage, means stripping the data set of direct identifiers.\sidenote{\url{ https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf}} To be able to do so, you will need to go through your data set and find all the variables that contain identifying information. @@ -194,7 +194,7 @@ \section{De-identifying research data} The \texttt{iefieldkit} command \texttt{iecodebook} lists all variables in a data set and exports an Excel sheet where you can easily select which variables to keep or drop.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Iecodebook}} + \url{https://dimewiki.worldbank.org/Iecodebook}} Once you have a list of variables that contain PII, assess them against the analysis plan and first ask yourself for each variable: @@ -231,7 +231,7 @@ \section{De-identifying research data} \section{Cleaning data for analysis} Data cleaning is the second stage in the transformation of data you received into data that you can analyze.\sidenote{\url{ -https://dimewiki.worldbank.org/wiki/Data\_Cleaning}} +https://dimewiki.worldbank.org/Data\_Cleaning}} The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. @@ -248,16 +248,16 @@ \subsection{Correcting data entry errors} There are two main cases when the raw data will be modified during data cleaning. The first one is when there are duplicated entries in the data. -Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/wiki/ID\_Variable\_Properties}} +Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} is possibly the most important step in data cleaning. Modern survey tools create unique observation identifiers. That, however, is not the same as having a unique ID variable for each individual in the sample. You want to make sure the data set has a unique ID variable -that can be cross-referenced with other records, such as the master data set\sidenote{\url{https://dimewiki.worldbank.org/wiki/Master\_Data\_Set}} +that can be cross-referenced with other records, such as the master data set\sidenote{\url{https://dimewiki.worldbank.org/Master\_Data\_Set}} and other rounds of data collection. \texttt{ieduplicates} and \texttt{iecompdup}, two Stata commands included in the \texttt{iefieldkit} -package\index{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}} +package\index{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/iefieldkit}} create an automated workflow to identify, correct and document occurrences of duplicate entries. @@ -319,7 +319,7 @@ \subsection{Labeling, annotating, and finalizing clean data} The \texttt{iecodebook} command suite, also part of \texttt{iefieldkit}, is designed to make some of the most tedious components of this process, such as renaming, relabeling, and value labeling, much easier.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/iecodebook}} + \url{https://dimewiki.worldbank.org/iecodebook}} \index{iecodebook} We have a few recommendations on how to use this command, @@ -330,12 +330,12 @@ \subsection{Labeling, annotating, and finalizing clean data} Applying labels makes it easier to understand what the data mean as you explore it, and thus reduces the risk of small errors making their way through into the analysis stage. Variable and value labels should be accurate and concise.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Applying\_Labels}} + \url{https://dimewiki.worldbank.org/Data\_Cleaning\#Applying\_Labels}} Applying labels makes it easier to understand what the data is showing while exploring the data. This minimizes the risk of small errors making their way through into the analysis stage. -Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Applying\_Labels}} +Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/Data\_Cleaning\#Applying\_Labels}} Third, recodes should be used to turn codes for ``Don't know'', ``Refused to answer'', and -other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} +other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} String variables that correspond to categorical variables need to be encoded. Open-ended responses stored as strings usually have a high risk of being identifiers, so they should be dropped at this point. @@ -350,7 +350,7 @@ \subsection{Documenting data cleaning} including enumerator manuals, survey instruments, supervisor notes, and data quality monitoring reports. These materials are essential for data documentation.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Data\_Documentation}} + \url{https://dimewiki.worldbank.org/Data\_Documentation}} \index{Documentation} They should be stored in the corresponding \texttt{Documentation} folder for easy access, as you will probably need them during analysis, @@ -392,10 +392,10 @@ \section{Constructing final indicators} During this process, the data points will typically be reshaped and aggregated so that level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household),\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} + \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} so that the level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Unit\_of\_Observation}} +\url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} A constructed data set is built to answer an analysis question. Since different pieces of analysis may require different samples, @@ -606,7 +606,7 @@ \subsection{Organizing analysis code} \subsection{Visualizing data} -\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Data\_visualization}} \index{data visualization} +\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/Data\_visualization}} \index{data visualization} is increasingly popular, and is becoming a field in its own right.\cite{healy2018data,wilke2019fundamentals} Whole books have been written on how to create good data visualizations, so we will not attempt to give you advice on it. @@ -650,12 +650,12 @@ \subsection{Exporting analysis outputs} Our team has created a few products to automate common outputs and save you precious research time. The \texttt{ietoolkit} package includes two commands to export nicely formatted tables. -\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Iebaltab}} +\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/Iebaltab}} creates and exports balance tables to excel or {\LaTeX}. -\texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Ieddtab}} +\texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/Ieddtab}} does the same for difference-in-differences regressions. It also includes a command, \texttt{iegraph},\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Iegraph}} + \url{https://dimewiki.worldbank.org/Iegraph}} to export pre-formatted impact evaluation results graphs. It's okay to not export each and every table and graph created during exploratory analysis. @@ -709,8 +709,8 @@ \subsection{Exporting analysis outputs} This means it should be easy to read and understand them with only the information they contain. Make sure labels and notes cover all relevant information, such as sample, unit of observation, unit of measurement and variable definition.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Reviewing\_Graphs} \\ - \url{https://dimewiki.worldbank.org/wiki/Checklist:\_Submit\_Table}} + \url{https://dimewiki.worldbank.org/Checklist:\_Reviewing\_Graphs} \\ + \url{https://dimewiki.worldbank.org/Checklist:\_Submit\_Table}} If you follow the steps outlined in this chapter, most of the data work involved in the last step of the research process diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index fdcf967e6..03a19e075 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -204,10 +204,10 @@ \subsection{Developing a data collection instrument} such as from the World Bank's Living Standards Measurement Survey.\cite{glewwe2000designing} The focus of this section is the design of electronic field surveys, often referred to as Computer Assisted Personal Interviews (CAPI).\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Computer-Assisted\_Personal\_Interviews\_(CAPI)}} + \url{https://dimewiki.worldbank.org/Computer-Assisted\_Personal\_Interviews\_(CAPI)}} Although most surveys are now collected electronically, by tablet, mobile phone or web browser, \textbf{questionnaire design}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Questionnaire_Design}} + \url{https://dimewiki.worldbank.org/Questionnaire_Design}} \index{questionnaire design} (content development) and questionnaire programming (functionality development) should be seen as two strictly separate tasks. @@ -233,12 +233,12 @@ \subsection{Developing a data collection instrument} begin from broad concepts and slowly flesh out the specifics. It is essential to start with a clear understanding of the \textbf{theory of change}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Theory_of_Change}} + \url{https://dimewiki.worldbank.org/Theory_of_Change}} and \textbf{experimental design} for your project. The first step of questionnaire design is to list key outcomes of interest, as well as the main covariates to control for and any variables needed for experimental design. The ideal starting point for this is a \textbf{pre-analysis plan}.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}} + \url{https://dimewiki.worldbank.org/Pre-Analysis_Plan}} Use the list of key outcomes to create an outline of questionnaire \textit{modules}. Do not number the modules; instead use a short prefix so they can be easily reordered. @@ -248,9 +248,9 @@ \subsection{Developing a data collection instrument} a household income module should be answered by the person responsible for household finances, and a module on agricultural production might be repeated for each crop the household cultivated. Each module should then be expanded into specific indicators to observe in the field.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Literature_Review_for_Questionnaire}} + \url{https://dimewiki.worldbank.org/Literature_Review_for_Questionnaire}} At this point, it is useful to do a \textbf{content-focused pilot}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} + \url{https://dimewiki.worldbank.org/Piloting_Survey_Content}} of the questionnaire. Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, @@ -264,10 +264,10 @@ \subsection{Developing a data collection instrument} Once the content of the survey is drawn up, the team should conduct a small \textbf{survey pilot}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Survey_Pilot}} + \url{https://dimewiki.worldbank.org/Survey_Pilot}} using the paper forms to finalize questionnaire design and detect any content issues. A content-focused pilot\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Piloting_Survey_Content}} + \url{https://dimewiki.worldbank.org/Piloting_Survey_Content}} is best done on pen and paper, before the questionnaire is programmed, because changes at this point may be deep and structural, which are hard to adjust in code. The objective is to improve the structure and length of the questionnaire, @@ -286,11 +286,11 @@ \subsection{Designing surveys for electronic deployment} Electronic data collection has great potential to simplify survey implementation and improve data quality. Electronic questionnaires are typically created in a spreadsheet (e.g. Excel or Google Sheets) or a software-specific form builder, all of which are accessible even to novice users.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Questionnaire_Programming}} + \url{https://dimewiki.worldbank.org/Questionnaire_Programming}} We will not address software-specific form design in this book; rather, we focus on coding conventions that are important to follow for electronic surveys regardless of software choice.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/SurveyCTO_Coding_Practices}} + \url{https://dimewiki.worldbank.org/SurveyCTO_Coding_Practices}} Survey software tools provide a wide range of features designed to make implementing even highly complex surveys easy, scalable, and secure. However, these are not fully automatic: you need to actively design and manage the survey. @@ -392,7 +392,7 @@ \subsection{Programming electronic questionnaires} This is not sufficient, however, to ensure that the resulting dataset will load without errors in your data analysis software of choice. We developed the \texttt{ietestform} command,\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/ietestform}} + \url{https://dimewiki.worldbank.org/ietestform}} part of the Stata package \texttt{iefieldkit}, to implement a form-checking routine for \textbf{SurveyCTO}, a proprietary implementation of the \textbf{Open Data Kit (ODK)} software. @@ -437,7 +437,7 @@ \subsection{Implementing high frequency quality checks} simplifies monitoring and improves data quality. As part of data collection preparation, the research team should develop a \textbf{data quality assurance plan}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Data_Quality_Assurance_Plan}}. + \url{https://dimewiki.worldbank.org/Data_Quality_Assurance_Plan}}. While data collection is ongoing, a research assistant or data analyst should work closely with the field team or partner to ensure that the data collection is progressing correctly, @@ -462,9 +462,9 @@ \subsection{Implementing high frequency quality checks} so cross-referencing with other data sources may be necessary to validate data. Even with careful management, it is often the case that raw data includes duplicate or missing entries, which may occur due to data entry errors or failed submissions to data servers.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Duplicates_and_Survey_Logs}} + \url{https://dimewiki.worldbank.org/Duplicates_and_Survey_Logs}} \texttt{ieduplicates}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/ieduplicates}} + \url{https://dimewiki.worldbank.org/ieduplicates}} provides a workflow for collaborating on the resolution of duplicate entries between you and the provider. Then, observed units in the data must be validated against the expected sample: this is as straightforward as merging the sample list with the survey data and checking for mismatches. @@ -487,7 +487,7 @@ \subsection{Implementing high frequency quality checks} validation of complex calculations like crop yields or medicine stocks (which require unit conversions), suspicious patterns in survey timing, or atypical response patters from specific data sources or enumerators.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Monitoring_Data_Quality}} + \url{https://dimewiki.worldbank.org/Monitoring_Data_Quality}} Electronic data entry software typically provides rich metadata, which can be useful in assessing data quality. For example, automatically collected timestamps show when data was submitted @@ -517,7 +517,7 @@ \subsection{Conducting back-checks and data validation} that comes from variation in the realization of key outcomes, primary data collection provides the opportunity to make sure that there is no error arising from inaccuracies in the data itself. -\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/wiki/Back_Checks}} and +\textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/Back_Checks}} and other validation audits help ensure that data collection is following established protocols, and that data is not fasified, incomplete, or otherwise suspect. For back-checks and validation audits, a random subset of the main data is selected, @@ -563,17 +563,17 @@ \section{Collecting and sharing data securely} All sensitive data must be handled in a way where there is no risk that anyone who is not approved by an Institutional Review Board (IRB)\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/IRB\_Approval}} + \url{https://dimewiki.worldbank.org/IRB\_Approval}} for the specific project has the ability to access the data. Data can be sensitive for multiple reasons, but the two most common reasons are that it contains personally identifiable information (PII)\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Personally\_Identifiable\_Information\_(PII)}} + \url{https://dimewiki.worldbank.org/Personally\_Identifiable\_Information\_(PII)}} or that the partner providing the data does not want it to be released. Central to data security is \index{encryption}\textbf{data encryption}, which is a group of methods that ensure that files are unreadable even if laptops are stolen, servers are hacked, or unauthorized access to the data is obtained in any other way.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Encryption}} + \url{https://dimewiki.worldbank.org/Encryption}} Proper encryption is rarely just a single method, as the data will travel through many servers, devices, and computers from the source of the data to the final analysis. @@ -612,7 +612,7 @@ \subsection{Collecting data securely} In field surveys, most common data collection software will automatically encrypt all data in transit (i.e., upload from field or download from server).\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_in\_Transit}} + \url{https://dimewiki.worldbank.org/Encryption\#Encryption\_in\_Transit}} If this is implemented by the software you are using, then your data will be encrypted from the time it leaves the device (in tablet-assisted data collection) or browser (in web data collection), @@ -626,7 +626,7 @@ \subsection{Collecting data securely} Even though your data is therefore usually safe while it is being transmitted, it is not automatically secure when it is being stored. \textbf{Encryption at rest}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_at\_Rest}} + \url{https://dimewiki.worldbank.org/Encryption\#Encryption\_at\_Rest}} is the only way to ensure that PII data remains private when it is stored on a server on the internet. You must keep your data encrypted on the data collection server whenever PII data is collected. @@ -650,7 +650,7 @@ \subsection{Collecting data securely} never pass through the hands of a third party, including the data storage application. Most survey software implement \textbf{asymmetric encryption}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Encryption\#Asymmetric\_Encryption}} + \url{https://dimewiki.worldbank.org/Encryption\#Asymmetric\_Encryption}} where there are two keys in a public/private key pair. Only the private key can be used to decrypt the encrypted data, and the public key can only be used to encrypt the data. @@ -677,7 +677,7 @@ \subsection{Storing data securely} from the data collection device to the data collection server, it is not practical once you start interacting with the data. Instead, we use \textbf{symmetric encryption}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Encryption\#Symmetric\_Encryption}} + \url{https://dimewiki.worldbank.org/Encryption\#Symmetric\_Encryption}} where we create a secure encrypted folder, using, for example, VeraCrypt.\sidenote{\url{https://www.veracrypt.fr/}} Here, a single key is used to both encrypt and decrypt the information. diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index b4a247902..cd041d2f9 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -86,7 +86,7 @@ \subsection{Research reproducibility} as a resource to others who have similar questions. Secondly, reproducible research\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Reproducible_Research}} + \url{https://dimewiki.worldbank.org/Reproducible_Research}} enables other researchers to re-use your code and processes to do their own work more easily and effectively in the future. This may mean applying your techniques to their data @@ -99,7 +99,7 @@ \subsection{Research reproducibility} It should be easy to read and understand in terms of structure, style, and syntax. Finally, the corresponding dataset should be openly accessible unless for legal or ethical reasons it cannot be.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Publishing_Data}} + \url{https://dimewiki.worldbank.org/Publishing_Data}} \subsection{Research transparency} @@ -109,7 +109,7 @@ \subsection{Research transparency} This means that readers are able to judge for themselves if the research was done well and the decision-making process was sound. If the research is well-structured, and all of the relevant documentation\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Data_Documentation}} + \url{https://dimewiki.worldbank.org/Data_Documentation}} is shared, this makes it easy for the reader to understand the analysis later. Expecting process transparency is also an incentive for researchers to make better decisions, be skeptical and thorough about their assumptions, @@ -117,9 +117,9 @@ \subsection{Research transparency} because it requires methodical organization that is labor-saving over the complete course of a project. Tools like \textbf{pre-registration}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Pre-Registration}}, + \url{https://dimewiki.worldbank.org/Pre-Registration}}, \textbf{pre-analysis plans}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Pre-Analysis_Plan}}, + \url{https://dimewiki.worldbank.org/Pre-Analysis_Plan}}, and \textbf{registered reports}\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} can help with this process where they are available.\index{pre-registration}\index{pre-analysis plans}\index{Registered Reports} @@ -160,7 +160,7 @@ \subsection{Research transparency} The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution,\index{Open Science Framework} with integrated file storage, version histories, and collaborative wiki pages. \textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Getting_started_with_GitHub}},\index{task management}\index{GitHub} + \url{https://dimewiki.worldbank.org/Getting_started_with_GitHub}},\index{task management}\index{GitHub} in addition to version histories and wiki pages. Such services offer multiple different ways to record the decision process leading to changes and additions, @@ -233,7 +233,7 @@ \section{Ensuring privacy and security in research data} information (PII)}.\index{personally-identifying information}\index{primary data}\sidenote{ \textbf{Personally-identifying information:} any piece or set of information that can be used to identify an individual research subject. - \url{https://dimewiki.worldbank.org/wiki/De-identification\#Personally\_Identifiable\_Information}} + \url{https://dimewiki.worldbank.org/De-identification\#Personally\_Identifiable\_Information}} PII data contains information that can, without any transformation, be used to identify individual people, households, villages, or firms that were part of data collection. \index{data collection} @@ -283,7 +283,7 @@ \subsection{Obtaining ethical approval and consent} \index{Institutional Review Board} Most commonly this consists of a formal application for approval of a specific protocol for consent, data collection, and data handling.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/IRB_Approval}} + \url{https://dimewiki.worldbank.org/IRB_Approval}} An IRB which has sole authority over your project is not always apparent, particularly if some institutions do not have their own. It is customary to obtain an approval from a university IRB @@ -324,7 +324,7 @@ \subsection{Obtaining ethical approval and consent} \subsection{Transmitting and storing data securely} Secure data storage and transfer are ultimately your personal responsibility.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Data_Security}} + \url{https://dimewiki.worldbank.org/Data_Security}} There are several precautions needed to ensure that your data is safe. First, all online and offline accounts -- including personal accounts like computer logins and email -- @@ -337,7 +337,7 @@ \subsection{Transmitting and storing data securely} Data sets that include confidential information \textit{must} therefore be \textbf{encrypted}\sidenote{ \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked, or any other type of unauthorized access is obtained. - \url{https://dimewiki.worldbank.org/wiki/encryption}} + \url{https://dimewiki.worldbank.org/encryption}} during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} The biggest security gap is often in transmitting survey plans to and from staff in the field. To protect information in transit to field staff, some key steps are: @@ -349,10 +349,10 @@ \subsection{Transmitting and storing data securely} Most modern data collection software has features that, if enabled, make secure transmission straightforward.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_in\_Transit}} + \url{https://dimewiki.worldbank.org/Encryption\#Encryption\_in\_Transit}} Many also have features that ensure data is encrypted when stored on their servers, although this usually needs to be actively enabled and administered.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Encryption\#Encryption\_at\_Rest}} + \url{https://dimewiki.worldbank.org/Encryption\#Encryption\_at\_Rest}} When files are properly encrypted, the information they contain will be completely unreadable and unusable even if they were to be intercepted my a malicious @@ -393,14 +393,14 @@ \subsection{Transmitting and storing data securely} \subsection{De-identifying and anonymizing information} Most of the field research done in development involves human subjects.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Human_Subjects_Approval}} + \url{https://dimewiki.worldbank.org/Human_Subjects_Approval}} \index{human subjects} As a researcher, you are asking people to trust you with personal information about themselves: where they live, how rich they are, whether they have committed or been victims of crimes, their names, their national identity numbers, and all sorts of other data. PII data carries strict expectations about data storage and handling, and it is the responsibility of the research team to satisfy these expectations.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Research_Ethics}} + \url{https://dimewiki.worldbank.org/Research_Ethics}} Your donor or employer will most likely require you to hold a certification from a source such as Protecting Human Research Participants\sidenote{ \url{https://phrptraining.com}} @@ -422,7 +422,7 @@ \subsection{De-identifying and anonymizing information} Therefore, once data is securely collected and stored, the first thing you will generally do is \textbf{de-identify} it, that is, remove direct identifiers of the individuals in the dataset.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/De-identification}} + \url{https://dimewiki.worldbank.org/De-identification}} \index{de-identification} Note, however, that it is in practice impossible to \textbf{anonymize} data. There is always some statistical chance that an individual's identity diff --git a/chapters/introduction.tex b/chapters/introduction.tex index e6702a261..978a88da1 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -63,7 +63,7 @@ \section{Doing credible research at scale} This book complements the DIME Wiki by providing a structured narrative of the data workflow for a typical research project. We will not give a lot of highly specific details in this text, but we will point you to where they can be found.\sidenote{Like this: -\url{https://dimewiki.worldbank.org/wiki/Primary_Data_Collection}} +\url{https://dimewiki.worldbank.org/Primary_Data_Collection}} Each chapter focuses on one task, providing a primarily narrative account of: what you will be doing; where in the workflow this task falls; when it should be done; and how to implement it according to best practices. @@ -121,7 +121,7 @@ \section{Writing reproducible code in a collaborative environment} little ambiguity about how something ought to be done, and therefore the tools to do it can be set in advance. Standard processes for code help other people to ready your code.\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Stata_Coding_Practices}} +\url{https://dimewiki.worldbank.org/Stata_Coding_Practices}} Code should be well-documented, contain extensive comments, and be readable in the sense that others can: (1) quickly understand what a portion of code is supposed to be doing; (2) evaluate whether or not it does that thing correctly; and @@ -156,8 +156,8 @@ \section{Writing reproducible code in a collaborative environment} and uses built-in functions as much as possible. We will point to user-written functions when they provide important tools. In particular, we point to two suites of Stata commands developed by DIME Analytics, -\texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/wiki/ietoolkit}} and -\texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/wiki/iefieldkit}} +\texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/ietoolkit}} and +\texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/iefieldkit}} which standardize our core data collection, management, and analysis workflows. We will comment the code generously (as you should), but you should reference Stata help-files by writing \texttt{help [command]} diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index ae4beb317..dd9b96fb2 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -170,7 +170,7 @@ \subsection{Documenting decisions and tasks} -- such as, for example, decisions about sampling -- should immediately be recorded in a system that is designed to keep permanent records. We call these systems collaboration tools, and there are several that are very useful.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Collaboration_Tools}} + \url{https://dimewiki.worldbank.org/Collaboration_Tools}} \index{collaboration tools} Many collaboration tools are web-based @@ -242,7 +242,7 @@ \subsection{Choosing software} \index{software versions} (For example, our command \texttt{ieboilstart} in the \texttt{ietoolkit} package provides functionality to support Stata version stability.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/ieboilstart}}) + \url{https://dimewiki.worldbank.org/ieboilstart}}) Next, think about how and where you write and execute code. This book is intended to be agnostic to the size or origin of your data, @@ -348,11 +348,11 @@ \subsection{Organizing files and folder structures} This will prevent future folder reorganizations that may slow down your workflow and, more importantly, ensure that your code files are always able to run on any machine. To support consistent folder organization, DIME Analytics maintains \texttt{iefolder}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/iefolder}} + \url{https://dimewiki.worldbank.org/iefolder}} as a part of our \texttt{ietoolkit} package.\index{\texttt{iefolder}}\index{\texttt{ietoolkit}} This Stata command sets up a pre-standardized folder structure for what we call the \texttt{DataWork} folder.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/DataWork_Folder}} + \url{https://dimewiki.worldbank.org/DataWork_Folder}} The \texttt{DataWork} folder includes folders for all the steps of a typical project. \index{\texttt{DataWork} folder} Since each project will always have its own needs, @@ -584,7 +584,7 @@ \subsection{Documenting and organizing code} Because writing and maintaining a master script can be challenging as a project grows, an important feature of the \texttt{iefolder} is to write master do-files and add to them whenever new subfolders are created in the \texttt{DataWork} folder.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Master\_Do-files}} + \url{https://dimewiki.worldbank.org/Master\_Do-files}} In order to maintain these practices and ensure they are functioning well, you should agree with your team on a plan to review code as it is written. diff --git a/chapters/publication.tex b/chapters/publication.tex index aa68254bc..0e650df6f 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -312,9 +312,9 @@ \subsection{Publishing data for replication} includes a Microdata Catalog\sidenote{ \url{https://microdata.worldbank.org}} where researchers can publish data and documentation for their projects.\sidenote{ -\url{https://dimewiki.worldbank.org/wiki/Microdata\_Catalog} +\url{https://dimewiki.worldbank.org/Microdata\_Catalog} \newline -\url{https://dimewiki.worldbank.org/wiki/Checklist:\_Microdata\_Catalog\_submission} +\url{https://dimewiki.worldbank.org/Checklist:\_Microdata\_Catalog\_submission} } The Harvard Dataverse\sidenote{ \url{https://dataverse.harvard.edu}} diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 61dfa7da7..5c94c70fd 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -152,12 +152,12 @@ \subsection{Experimental and quasi-experimental research designs} Experimental research designs explicitly allow the research team to change the condition of the populations being studied,\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Experimental_Methods}} + \url{https://dimewiki.worldbank.org/Experimental_Methods}} often in the form of government programs, NGO projects, new regulations, information campaigns, and many more types of interventions.\cite{banerjee2009experimental} The classic experimental causal inference method is the \textbf{randomized control trial (RCT)}.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Randomized_Control_Trials}} + \url{https://dimewiki.worldbank.org/Randomized_Control_Trials}} \index{randomized control trials} In randomized control trials, the treatment group is randomized -- that is, from an eligible population, @@ -198,7 +198,7 @@ \subsection{Experimental and quasi-experimental research designs} and often overshadow the effort put into the econometric design itself. \textbf{Quasi-experimental} research designs,\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Quasi-Experimental_Methods}} + \url{https://dimewiki.worldbank.org/Quasi-Experimental_Methods}} by contrast, are causal inference methods based on events not controlled by the research team. Instead, they rely on ``experiments of nature'', in which natural variation can be argued to approximate @@ -287,9 +287,9 @@ \subsection{Cross-sectional designs} to help with the complete process of data analysis,\sidenote{ \url{https://toolkit.povertyactionlab.org/resource/coding-resources-randomized-evaluations}} including to analyze balance\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/iebaltab}} + \url{https://dimewiki.worldbank.org/iebaltab}} and to visualize treatment effects.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/iegraph}} + \url{https://dimewiki.worldbank.org/iegraph}} Extensive tools and methods for analyzing selective non-response are available.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} @@ -299,7 +299,7 @@ \subsection{Difference-in-differences} Where cross-sectional designs draw their estimates of treatment effects from differences in outcome levels in a single measurement, \textbf{differences-in-differences}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Difference-in-Differences}} + \url{https://dimewiki.worldbank.org/Difference-in-Differences}} designs (abbreviated as DD, DiD, diff-in-diff, and other variants) estimate treatment effects from \textit{changes} in outcomes between two or more rounds of measurement. @@ -361,7 +361,7 @@ \subsection{Difference-in-differences} Therefore there exist a large number of standardized tools for analysis. Our \texttt{ietoolkit} Stata package includes the \texttt{ieddtab} command which produces standardized tables for reporting results.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/ieddtab}} + \url{https://dimewiki.worldbank.org/ieddtab}} For more complicated versions of the model (and they can get quite complicated quite quickly), you can use an online dashboard to simulate counterfactual results.\sidenote{ @@ -379,7 +379,7 @@ \subsection{Regression discontinuity} \textbf{Regression discontinuity (RD)} designs exploit sharp breaks or limits in policy designs to separate a single group of potentially eligible recipients into comparable groups of individuals who do and do not receive a treatment.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Regression_Discontinuity}} + \url{https://dimewiki.worldbank.org/Regression_Discontinuity}} These designs differ from cross-sectional and diff-in-diff designs in that the group eligible to receive treatment is not defined directly, but instead created during the treatment implementation. @@ -454,7 +454,7 @@ \subsection{Instrumental variables} To do so, the IV approach selects an \textbf{instrument} for the treatment status -- an otherwise-unrelated predictor of exposure to treatment that affects the uptake status of an individual.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/instrumental_variables}} + \url{https://dimewiki.worldbank.org/instrumental_variables}} Whereas regression discontinuity designs are ``sharp'' -- treatment status is completely determined by which side of a cutoff an individual is on -- IV designs are ``fuzzy'', meaning that they do not completely determine @@ -512,7 +512,7 @@ \subsection{Matching} to directly construct treatment and control groups to be as similar as possible to each other, either before a randomization process or after the collection of non-randomized data.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Matching}} + \url{https://dimewiki.worldbank.org/Matching}} \index{matching} Matching observations may be one-to-one or many-to-many; in any case, the result of a matching process @@ -560,7 +560,7 @@ \subsection{Matching} The coarsened exact matching (\texttt{cem}) package applies the nonparametric approach.\sidenote{ \url{https://gking.harvard.edu/files/gking/files/cem-stata.pdf}} DIME's \texttt{iematch} command in the \texttt{ietoolkit} package produces matchings based on a single continuous matching variable.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/iematch}} + \url{https://dimewiki.worldbank.org/iematch}} In any of these cases, detailed reporting of the matching model is required, including the resulting effective weights of observations, since in some cases the lack of overlapping supports for treatment and control diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index d778bbb11..cdb22ac2a 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -105,9 +105,9 @@ \subsection{Ensuring reproducibility in random Stata processes} You will \textit{never} be able to reproduce a randomization in a different software, such as moving from Stata to R or vice versa.} The \texttt{ieboilstart} command in \texttt{ietoolkit} provides functionality to support this requirement.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/ieboilstart}} + \url{https://dimewiki.worldbank.org/ieboilstart}} We recommend you use \texttt{ieboilstart} at the beginning of your master do-file.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Master_Do-files}} + \url{https://dimewiki.worldbank.org/Master_Do-files}} However, testing your do-files without running them via the master do-file may produce different results, since Stata's \texttt{version} setting expires after each time you run your do-files. @@ -140,7 +140,7 @@ \subsection{Ensuring reproducibility in random Stata processes} You should also describe in your code how the seed was selected. Other commands may induce randomness in the data or alter the seed without you realizing it, so carefully confirm exactly how your code runs before finalizing it.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Randomization_in_Stata}} + \url{https://dimewiki.worldbank.org/Randomization_in_Stata}} \codeexample{replicability.do}{./code/replicability.do} @@ -184,14 +184,14 @@ \subsection{Sampling} \textbf{Sampling} is the process of randomly selecting units of observation from a master list of individuals to be included in data collection.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Sampling_\%26_Power_Calculations}} + \url{https://dimewiki.worldbank.org/Sampling_\%26_Power_Calculations}} \index{sampling} That master list may be called a \textbf{sampling universe}, a \textbf{listing frame}, or something similar. We recommend that this list be organized in a \textbf{master data set}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Master_Data_Set}}, + \url{https://dimewiki.worldbank.org/Master_Data_Set}}, creating an authoritative source for the existence and fixed characteristics of each of the units that may be surveyed.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} + \url{https://dimewiki.worldbank.org/Unit_of_Observation}} The master data set indicates how many individuals are eligible for data collection, and therefore contains statistical information about the likelihood that each will be chosen. @@ -242,7 +242,7 @@ \subsection{Randomization} and must be carefully worked out in more complex designs. Just like sampling, the simplest form of randomization is a uniform-probability process.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Randomization_in_Stata}} + \url{https://dimewiki.worldbank.org/Randomization_in_Stata}} Sampling typically has only two possible outcomes: observed and unobserved. Randomization, by contrast, often involves multiple possible results which each represent various varieties of treatments to be delivered; @@ -287,11 +287,11 @@ \section{Clustering and stratification} \subsection{Clustering} Many studies observe data at a different level than the randomization unit.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Unit_of_Observation}} + \url{https://dimewiki.worldbank.org/Unit_of_Observation}} For example, a policy may only be able to affect an entire village, but the study is interested in household behavior. This type of structure is called \textbf{clustering},\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Multi-stage_(Cluster)_Sampling}} + \url{https://dimewiki.worldbank.org/Multi-stage_(Cluster)_Sampling}} and the groups in which units are assigned to treatment are called clusters. The same principle extends to sampling: it may be be necessary to observe all the children @@ -317,7 +317,7 @@ \subsection{Stratification} \textbf{Stratification} is a research design component that breaks the full set of observations into a number of subgroups before performing randomization within each subgroup.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Stratified_Random_Sample}} + \url{https://dimewiki.worldbank.org/Stratified_Random_Sample}} This has the effect of ensuring that members of each subgroup are included in all groups of the randomization process, since it is possible that a global randomization @@ -414,7 +414,7 @@ \subsection{Power calculations} There are two common and useful practical applications of that definition that give actionable, quantitative results. The \textbf{minimum detectable effect (MDE)}\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Minimum_Detectable_Effect}} + \url{https://dimewiki.worldbank.org/Minimum_Detectable_Effect}} is the smallest true effect that a given research design can detect. This is useful as a check on whether a study is worthwhile. If, in your field, a ``large'' effect is just a few percentage points @@ -431,7 +431,7 @@ \subsection{Power calculations} very simple designs -- \texttt{power} and \texttt{clustersampsi} -- but they will not answer most of the practical questions that complex experimental designs require.\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Power_Calculations_in_Stata}} + \url{https://dimewiki.worldbank.org/Power_Calculations_in_Stata}} We suggest doing more advanced power calculations by simulation, since the interactions of experimental design, sampling and randomization, @@ -499,7 +499,7 @@ \subsection{Randomization inference} and it is interpretable as the probability that a program with no effect would have given you a result like the one actually observed. These randomization inference\sidenote{ - \url{https://dimewiki.worldbank.org/wiki/Randomization\_Inference}} + \url{https://dimewiki.worldbank.org/Randomization\_Inference}} significance levels may be very different than those given by asymptotic confidence intervals, particularly in small samples (up to several hundred clusters). From 6a7ac8f2a34b50954f0ba7f03c9d03257b988b53 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 18 Feb 2020 16:55:54 -0500 Subject: [PATCH 715/854] Resolved #142 --- chapters/introduction.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 8f1e55435..2a78ed662 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -100,7 +100,8 @@ \section{Outline of this book} \section{Adopting reproducible workflows} We will provide free, open-source, and platform-agnostic tools wherever possible, and point to more detailed instructions when relevant. -Stata is the notable exception here due to its current popularity in development economics. +Stata is the notable exception here due to its current popularity in development economics.\sidenote{ +\url{https://aeadataeditor.github.io/presentation-20191211/\#9}} Most tools have a learning and adaptation process, meaning you will become most comfortable with each tool only by using it in real-world work. From 6429cb0827cdb3cdf2ec0837b28c69a9b19b6889 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Feb 2020 17:41:53 -0500 Subject: [PATCH 716/854] general URL link hygiene --- README.md | 2 +- appendix/stata-guide.tex | 4 +-- bibliography.bib | 6 ++-- chapters/conclusion.tex | 16 ++++----- chapters/data-analysis.tex | 41 ++++++++++++----------- chapters/data-collection.tex | 8 ++--- chapters/handling-data.tex | 24 ++++++------- chapters/introduction.tex | 6 ++-- chapters/notes.tex | 2 +- chapters/planning-data-work.tex | 12 +++---- chapters/preamble.tex | 4 +-- chapters/publication.tex | 16 ++++----- chapters/research-design.tex | 18 +++++----- chapters/sampling-randomization-power.tex | 10 +++--- code/randtreat-strata.do | 2 +- code/replicability.do | 2 +- code/simple-multi-arm-randomization.do | 2 +- code/simple-sample.do | 2 +- mkdocs/docs/index.md | 2 +- 19 files changed, 90 insertions(+), 89 deletions(-) diff --git a/README.md b/README.md index 69cd923c4..dcae1a183 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ This book is intended to serve as an introduction to the primary tasks required in development research, from experimental design to data collection to data analysis to publication. It serves as a companion to the [DIME Wiki](https://dimewiki.worldbank.org) -and is produced by [DIME Analytics](http://www.worldbank.org/en/research/dime/data-and-analytics). +and is produced by [DIME Analytics](https://www.worldbank.org/en/research/dime/data-and-analytics). ## Contributing diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index db719fab7..d12efe55b 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -107,9 +107,9 @@ \subsection{Why we use a Stata style guide} Programming languages used in computer science always have style guides associated with them. Sometimes they are official guides that are universally agreed upon, such as PEP8 for -Python.\sidenote{\url{https://www.python.org/dev/peps/pep-0008/}} More commonly, there are well-recognized but +Python.\sidenote{\url{https://www.python.org/dev/peps/pep-0008}} More commonly, there are well-recognized but non-official style guides like the JavaScript Standard Style\sidenote{\url{https://standardjs.com/\#the-rules}} for -JavaScript or Hadley Wickham's style guide for R.\sidenote{\url{http://adv-r.had.co.nz/Style.html}} +JavaScript or Hadley Wickham's style guide for R.\sidenote{\url{https://style.tidyverse.org/syntax.html}} Google, for example, maintains style guides for all languages that are used in its projects.\sidenote{ \url{https://github.com/google/styleguide}} diff --git a/bibliography.bib b/bibliography.bib index 8c4ee42a1..b16167a3f 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -4,10 +4,10 @@ @Article{tidy-data journal = {The Journal of Statistical Software}, selected = {TRUE}, title = {Tidy data}, - url = {http://www.jstatsoft.org/v59/i10/}, + url = {https://www.jstatsoft.org/v59/i10/}, volume = {59}, year = {2014}, - bdsk-url-1 = {http://www.jstatsoft.org/v59/i10/} + bdsk-url-1 = {https://www.jstatsoft.org/v59/i10} } @book{glewwe2000designing, @@ -597,5 +597,5 @@ @MISC{pkg-geometry title = {The \texttt{geometry} package}, year = {2008}, month = dec, - howpublished = {\url{http://ctan.org/pkg/geometry}} + howpublished = {\url{https://ctan.org/pkg/geometry}} } diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex index a11b01949..4ca46217f 100644 --- a/chapters/conclusion.tex +++ b/chapters/conclusion.tex @@ -1,5 +1,5 @@ We hope you have enjoyed \textit{Data for Development Impact: The DIME Analytics Resource Guide}. -Our aim was to teach you to handle data more efficiently, effectively, and ethically. +Our aim was to teach you to handle data more efficiently, effectively, and ethically. We laid out a complete vision of the tasks of a modern researcher, from planning a project's data governance to publishing code and data to accompany a research product. @@ -14,25 +14,25 @@ We then discussed the current research environment, which necessitates cooperation with a diverse group of collaborators using modern approaches to computing technology. -We outlined common research methods in impact evaluation, +We outlined common research methods in impact evaluation, with an eye toward structuring data work. We discussed how to implement reproducible routines for sampling and randomization, and to analyze statistical power and use randomization inference. We discussed the collection of primary data and methods of analysis using statistical software, as well as tools and practices for making this work publicly accessible. -Throughout, we emphasized that data work is a ``social process'', -involving multiple team members with different roles and technical abilities. +Throughout, we emphasized that data work is a ``social process'', +involving multiple team members with different roles and technical abilities. This mindset and workflow, from top to bottom, outline the tasks and responsibilities -that are fundamental to doing credible research. +that are fundamental to doing credible research. However, as you probably noticed, the text itself provides just enough detail to get you started: an understanding of the purpose and function of each of the core research steps. The references and resources get into the details -of how you will realistically implement these tasks: -from DIME Wiki pages detail specific code conventions +of how you will realistically implement these tasks: +from DIME Wiki pages detail specific code conventions and field procedures that our team considers best practices, to the theoretical papers that will help you figure out how to handle the unique cases you will undoubtedly encounter. @@ -41,4 +41,4 @@ and come back to it anytime you need more information. We wish you all the best in your work and will love to hear any input you have on ours!\sidenote{ -You can share your comments and suggestion on this book through \url{https://worldbank.github.io/d4di/}.} +You can share your comments and suggestion on this book through \url{https://worldbank.github.io/d4di}.} diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 60218ab51..d868afaee 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -178,7 +178,8 @@ \section{De-identifying research data} To facilitate the handling of the data, remove any personally identifiable information from the data set. This will create a de-identified data set, that can be saved in a non-encrypted folder. De-identification,\sidenote{\url{https://dimewiki.worldbank.org/De-identification}} -at this stage, means stripping the data set of direct identifiers.\sidenote{\url{ https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf}} +at this stage, means stripping the data set of direct identifiers.\sidenote{\url{ +https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf}} To be able to do so, you will need to go through your data set and find all the variables that contain identifying information. Flagging all potentially identifying variables in the questionnaire design stage @@ -190,7 +191,7 @@ \section{De-identifying research data} The World Bank's \texttt{sdcMicro} lists variables that uniquely identify observations, as well as allowing for more sophisticated disclosure risk calculations.\sidenote{ - \url{http://sdctools.github.io/sdcMicro/articles/sdcMicro.html}} + \url{https://sdctools.github.io/sdcMicro/articles/sdcMicro.html}} The \texttt{iefieldkit} command \texttt{iecodebook} lists all variables in a data set and exports an Excel sheet where you can easily select which variables to keep or drop.\sidenote{ @@ -546,10 +547,10 @@ \section{Writing data analysis code} Data analysis is the stage when research outputs are created. \index{data analysis} Many introductions to common code skills and analytical frameworks exist, such as -\textit{R for Data Science};\sidenote{\url{https://r4ds.had.co.nz/}} +\textit{R for Data Science};\sidenote{\url{https://r4ds.had.co.nz}} \textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical\_introduction\_to\_stata.pdf}} \textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion}} -and \textit{Causal Inference: The Mixtape}.\sidenote{\url{http://scunning.com/mixtape.html}} +and \textit{Causal Inference: The Mixtape}.\sidenote{\url{https://scunning.com/mixtape.html}} This section will not include instructions on how to conduct specific analyses. That is a research question, and requires expertise beyond the scope of this book. Instead, we will outline the structure of writing analysis code, @@ -622,15 +623,15 @@ \subsection{Visualizing data} but it is well worth reviewing the graphics manual.\sidenote{\url{https://www.stata.com/manuals/g.pdf}} For an easier way around it, Gray Kimbrough's \textit{Uncluttered Stata Graphs} code is an excellent default replacement for Stata graphics that is easy to install.\sidenote{ - \url{https://graykimbrough.github.io/uncluttered-stata-graphs/}} -If you are an R user, the \textit{R Graphics Cookbook}\sidenote{\url{https://r-graphics.org/}} + \url{https://graykimbrough.github.io/uncluttered-stata-graphs}} +If you are an R user, the \textit{R Graphics Cookbook}\sidenote{\url{https://r-graphics.org}} is a great resource for the its popular visualization package \texttt{ggplot}\sidenote{ - \url{https://ggplot2.tidyverse.org/}}. + \url{https://ggplot2.tidyverse.org}}. But there are a variety of other visualization packages, -such as \texttt{highcharter},\sidenote{\url{http://jkunst.com/highcharter/}} -\texttt{r2d3},\sidenote{\url{https://rstudio.github.io/r2d3/}} -\texttt{leaflet},\sidenote{\url{https://rstudio.github.io/leaflet/}} -and \texttt{plotly},\sidenote{\url{https://plot.ly/r/}} to name a few. +such as \texttt{highcharter},\sidenote{\url{http://jkunst.com/highcharter}} +\texttt{r2d3},\sidenote{\url{https://rstudio.github.io/r2d3}} +\texttt{leaflet},\sidenote{\url{https://rstudio.github.io/leaflet}} +and \texttt{plotly},\sidenote{\url{https://plot.ly/r}} to name a few. We have no intention of creating an exhaustive list, and this one is certainly missing very good references; but it is a good place to start. We attribute some of the difficulty of creating good data visualization @@ -639,8 +640,8 @@ \subsection{Visualizing data} you didn't have to go through many rounds of googling to understand a command. The trickiest part of using plot commands is to get the data in the right format. This is why we created the \textbf{Stata Visual Library}\sidenote{ - \url{https://worldbank.github.io/Stata-IE-Visual-Library/}}, -which has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com/}} + \url{https://worldbank.github.io/Stata-IE-Visual-Library}}, +which has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com}} The Stata Visual Library includes example data sets to use with each do-file, so you get a good sense of what your data should look like before you can start writing code to create a visualization. @@ -650,12 +651,12 @@ \subsection{Exporting analysis outputs} Our team has created a few products to automate common outputs and save you precious research time. The \texttt{ietoolkit} package includes two commands to export nicely formatted tables. -\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/Iebaltab}} +\texttt{iebaltab}\sidenote{\url{https://dimewiki.worldbank.org/iebaltab}} creates and exports balance tables to excel or {\LaTeX}. -\texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/Ieddtab}} +\texttt{ieddtab}\sidenote{\url{https://dimewiki.worldbank.org/ieddtab}} does the same for difference-in-differences regressions. It also includes a command, \texttt{iegraph},\sidenote{ - \url{https://dimewiki.worldbank.org/Iegraph}} + \url{https://dimewiki.worldbank.org/iegraph}} to export pre-formatted impact evaluation results graphs. It's okay to not export each and every table and graph created during exploratory analysis. @@ -678,10 +679,10 @@ \subsection{Exporting analysis outputs} Copying results from a software console is risk-prone, even more inefficient, and totally unnecessary. There are numerous commands to export outputs from both R and Stata to a myriad of formats.\sidenote{ - Some examples are \href{ http://repec.sowi.unibe.ch/stata/estout/}{\texttt{estout}}, \href{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, -and \href{https://www.benjaminbdaniels.com/stata-code/outwrite/}{\texttt{outwrite}} in Stata, -and \href{https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf}{\texttt{stargazer}} -and \href{https://ggplot2.tidyverse.org/reference/ggsave.html}{\texttt{ggsave}} in R.} + Some examples are \url{http://repec.sowi.unibe.ch/stata/estout}{\texttt{estout}}, \url{https://www.princeton.edu/~otorres/Outreg2.pdf}{\texttt{outreg2}}, +and \url{https://www.benjaminbdaniels.com/stata-code/outwrite}{\texttt{outwrite}} in Stata, +and \url{https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf}{\texttt{stargazer}} +and \url{https://ggplot2.tidyverse.org/reference/ggsave.html}{\texttt{ggsave}} in R.} Save outputs in accessible and, whenever possible, lightweight formats. Accessible means that it's easy for other people to open them. In Stata, that would mean always using \texttt{graph export} to save images as diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 03a19e075..28feef630 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -273,7 +273,7 @@ \subsection{Developing a data collection instrument} The objective is to improve the structure and length of the questionnaire, refine the phrasing and translation of specific questions, and confirm coded response options are exhaustive.\sidenote{ - \url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)&printable=yes}} + \url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Content)}} In addition, it is an opportunity to test and refine all survey protocols, such as how units will be sampled or pre-selected units identified. The pilot must be done out-of-sample, @@ -409,7 +409,7 @@ \subsection{Programming electronic questionnaires} A second survey pilot should be done after the questionnaire is programmed. The objective of this \textbf{data-focused pilot}\sidenote{ - \url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)&printable=yes}} + \url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)}} is to validate the programming and export a sample dataset. Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. @@ -679,7 +679,7 @@ \subsection{Storing data securely} Instead, we use \textbf{symmetric encryption}\sidenote{ \url{https://dimewiki.worldbank.org/Encryption\#Symmetric\_Encryption}} where we create a secure encrypted folder, -using, for example, VeraCrypt.\sidenote{\url{https://www.veracrypt.fr/}} +using, for example, VeraCrypt.\sidenote{\url{https://www.veracrypt.fr}} Here, a single key is used to both encrypt and decrypt the information. Since only one key is used, the workflow can be simplified: the re-encryption after decrypted access can be done automatically, @@ -713,7 +713,7 @@ \subsection{Storing data securely} \noindent This handling satisfies the \textbf{3-2-1 rule}: there are two on-site copies of the data and one off-site copy, so the data can never be lost in case of hardware failure.\sidenote{ - \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} + \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy}} However, you still need to keep track of your encryption keys as without them your data is lost. If you remain lucky, you will never have to access your ``master'' or ``golden master'' copies -- you just want to know it is there, safe, if you need it. diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index cd041d2f9..51794cbc0 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -41,7 +41,7 @@ \section{Protecting confidence in development research} Major publishers and funders, most notably the American Economic Association, have taken steps to require that these research components are accurately reported and preserved as outputs in themselves.\sidenote{ - \url{https://www.aeaweb.org/journals/policies/data-code/}} + \url{https://www.aeaweb.org/journals/policies/data-code}} The empirical revolution in development research\cite{angrist2017economic} has therefore led to increased public scrutiny of the reliability of research.\cite{rogers_2017}\index{transparency}\index{credibility}\index{reproducibility} @@ -105,7 +105,7 @@ \subsection{Research transparency} Transparent research will expose not only the code, but all research processes involved in developing the analytical approach.\sidenote{ - \url{http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} + \url{https://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf}} This means that readers are able to judge for themselves if the research was done well and the decision-making process was sound. If the research is well-structured, and all of the relevant documentation\sidenote{ @@ -124,7 +124,7 @@ \subsection{Research transparency} \url{https://blogs.worldbank.org/impactevaluations/registered-reports-piloting-pre-results-review-process-journal-development-economics}} can help with this process where they are available.\index{pre-registration}\index{pre-analysis plans}\index{Registered Reports} By pre-specifying a large portion of the research design,\sidenote{ - \url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis/}} + \url{https://www.bitss.org/2019/04/18/better-pre-analysis-plans-through-design-declaration-and-diagnosis}} a great deal of analytical planning has already been completed, and at least some research questions are pre-committed for publication regardless of the outcome. This is meant to combat the ``file-drawer problem'',\cite{simonsohn2014p} @@ -157,7 +157,7 @@ \subsection{Research transparency} and there is nothing wrong with sensible adaptation so long as it is recorded and disclosed. There are various software solutions for building documentation over time. -The \textbf{Open Science Framework}\sidenote{\url{https://osf.io/}} provides one such solution,\index{Open Science Framework} +The \textbf{Open Science Framework}\sidenote{\url{https://osf.io}} provides one such solution,\index{Open Science Framework} with integrated file storage, version histories, and collaborative wiki pages. \textbf{GitHub}\sidenote{\url{https://github.com}} provides a transparent documentation system\sidenote{ \url{https://dimewiki.worldbank.org/Getting_started_with_GitHub}},\index{task management}\index{GitHub} @@ -184,11 +184,11 @@ \subsection{Research credibility} by fully specifying some set of analysis intended to be conducted. Regardless of whether or not a formal pre-analysis plan is utilized, all experimental and observational studies should be pre-registered -simply to create a record of the fact that the study was undertaken.\sidenote{\url{http://datacolada.org/12}} +simply to create a record of the fact that the study was undertaken.\sidenote{\url{https://datacolada.org/12}} This is increasingly required by publishers and can be done very quickly -using the \textbf{AEA} database,\sidenote{\url{https://www.socialscienceregistry.org/}} -the \textbf{3ie} database,\sidenote{\url{http://ridie.3ieimpact.org/}} -the \textbf{eGAP} database,\sidenote{\url{http://egap.org/content/registration/}} +using the \textbf{AEA} database,\sidenote{\url{https://www.socialscienceregistry.org}} +the \textbf{3ie} database,\sidenote{\url{https://ridie.3ieimpact.org}} +the \textbf{eGAP} database,\sidenote{\url{https://egap.org/content/registration}} or the \textbf{OSF} registry,\sidenote{\url{https://osf.io/registries}} as appropriate. \index{pre-registration} @@ -248,7 +248,7 @@ \section{Ensuring privacy and security in research data} even though these would not be considered PII in a larger context. There is no one-size-fits-all solution to determine what is PII, and you will have to use careful judgment in each case to decide which pieces of information fall into this category.\sidenote{ - \url{https://sdcpractice.readthedocs.io/en/latest/}} + \url{https://sdcpractice.readthedocs.io}} In all cases where this type of information is involved, you must make sure that you adhere to several core principles. @@ -258,7 +258,7 @@ \section{Ensuring privacy and security in research data} \url{https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html}} If you interact with European institutions or persons, you will also become familiar with ``GDPR'',\sidenote{ - \url{http://blogs.lshtm.ac.uk/library/2018/01/15/gdpr-for-research-data/}} + \url{http://blogs.lshtm.ac.uk/library/2018/01/15/gdpr-for-research-data}} a set of regulations governing \textbf{data ownership} and privacy standards.\sidenote{ \textbf{Data ownership:} the set of rights governing who may accesss, alter, use, or share data, regardless of who possesses it.} \index{data ownership} @@ -337,7 +337,7 @@ \subsection{Transmitting and storing data securely} Data sets that include confidential information \textit{must} therefore be \textbf{encrypted}\sidenote{ \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked, or any other type of unauthorized access is obtained. - \url{https://dimewiki.worldbank.org/encryption}} + \url{https://dimewiki.worldbank.org/Encryption}} during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} The biggest security gap is often in transmitting survey plans to and from staff in the field. To protect information in transit to field staff, some key steps are: @@ -405,7 +405,7 @@ \subsection{De-identifying and anonymizing information} such as Protecting Human Research Participants\sidenote{ \url{https://phrptraining.com}} or the CITI Program.\sidenote{ - \url{https://about.citiprogram.org/en/series/human-subjects-research-hsr/}} + \url{https://about.citiprogram.org/en/series/human-subjects-research-hsr}} In general, though, you shouldn't need to handle PII data very often once the data collection processes are completed. diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 978a88da1..a55e7ba5b 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -36,9 +36,9 @@ \section{Doing credible research at scale} The team responsible for this book is known as \textbf{DIME Analytics}.\sidenote{ -\url{http://www.worldbank.org/en/research/dime/data-and-analytics}} +\url{https://www.worldbank.org/en/research/dime/data-and-analytics}} The DIME Analytics team works within the \textbf{Development Impact Evaluation (DIME)} Department\sidenote{ -\url{http://www.worldbank.org/en/research/dime}} +\url{https://www.worldbank.org/en/research/dime}} at the World Bank's \textbf{Development Economics (DEC) Vice Presidency}.\sidenote{ \url{https://www.worldbank.org/en/about/unit/unit-dec}} @@ -59,7 +59,7 @@ \section{Doing credible research at scale} has developed while supporting DIME's global impact evaluation portfolio. The \textbf{DIME Wiki} is one of our flagship products, a free online collection of our resources and best practices.\sidenote{ -\url{http://dimewiki.worldbank.org/}} +\url{https://dimewiki.worldbank.org}} This book complements the DIME Wiki by providing a structured narrative of the data workflow for a typical research project. We will not give a lot of highly specific details in this text, but we will point you to where they can be found.\sidenote{Like this: diff --git a/chapters/notes.tex b/chapters/notes.tex index 2a7f29008..bf8106dcf 100644 --- a/chapters/notes.tex +++ b/chapters/notes.tex @@ -27,7 +27,7 @@ \subsection{Feedback} We encourage feedback and corrections so that we can improve the contents of the book in future editions. Please visit -\url{https://worldbank.github.com/d4di/feedback/} to +\url{https://worldbank.github.com/d4di/feedback} to see different options on how to provide feedback. You can also email us at \url{dimeanalytics@worldbank.org} with input or comments, and we will be very thankful. diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index dd9b96fb2..6db5e6a20 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -90,7 +90,7 @@ \subsection{Setting up your computer} Follow the \textbf{3-2-1 rule}: maintain 3 copies of all original or irreplaceable data, on at least 2 different hardware devices you have access to, with 1 offsite storage method.\sidenote{ - \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy/}} + \url{https://www.backblaze.com/blog/the-3-2-1-backup-strategy}} One example of this setup is having one copy on your primary computer, one copy on an external hard drive stored in a safe place, and one copy in the cloud. @@ -261,7 +261,7 @@ \subsection{Choosing software} \url{https://www.rstudio.com}} For Stata, the built-in do-file editor is the most widely adopted code editor, but \textbf{Atom}\sidenote{\url{https://atom.io}} and -\textbf{Sublime}\sidenote{\url{https://www.sublimetext.com/}} +\textbf{Sublime}\sidenote{\url{https://www.sublimetext.com}} can also be configured to run Stata code externally, while offering great code accessibility and quality features. (We recommend setting up and becoming comfortable with one of these.) @@ -674,12 +674,12 @@ \subsection{Managing outputs} So dynamic documents can be great for creating appendices or quick documents with results as you work on them, but are not usually considered for final papers and reports. -RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com/}} is the most widely adopted solution in R. +RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com}} is the most widely adopted solution in R. There are also different options for Markdown in Stata, such as \texttt{markstat},\sidenote{\url{https://data.princeton.edu/stata/markdown}} -Stata 15 dynamic documents,\sidenote{\url{https://www.stata.com/new-in-stata/markdown/}} -\texttt{webdoc},\sidenote{\url{http://repec.sowi.unibe.ch/stata/webdoc/index.html}} and -\texttt{texdoc}.\sidenote{\url{http://repec.sowi.unibe.ch/stata/texdoc/index.html}} +Stata 15 dynamic documents,\sidenote{\url{https://www.stata.com/new-in-stata/markdown}} +\texttt{webdoc},\sidenote{\url{http://repec.sowi.unibe.ch/stata/webdoc}} and +\texttt{texdoc}.\sidenote{\url{http://repec.sowi.unibe.ch/stata/texdoc}} Whichever options you choose, agree with your team on what tools will be used for what outputs, and diff --git a/chapters/preamble.tex b/chapters/preamble.tex index 71c550d2e..5e805e43e 100644 --- a/chapters/preamble.tex +++ b/chapters/preamble.tex @@ -153,11 +153,11 @@ \bigskip\par\smallcaps{Published by \thanklesspublisher} -\par\smallcaps{\url{http://worldbank.github.com/d4di}} +\par\smallcaps{\url{https://worldbank.github.com/d4di}} \par Released under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. -\url{https://creativecommons.org/licenses/by/4.0/} +\url{https://creativecommons.org/licenses/by/4.0} \par\textit{First printing, \monthyear} \end{fullwidth} diff --git a/chapters/publication.tex b/chapters/publication.tex index 0e650df6f..e2ecea952 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -85,11 +85,11 @@ \subsection{Preparing dynamic documents} Therefore this is a broadly unsuitable way to prepare technical documents. There are a number of tools that can be used for dynamic documents. -Some are code-based tools such as R's RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com/}} +Some are code-based tools such as R's RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com}} and Stata's \texttt{dyndoc}.\sidenote{\url{https://www.stata.com/manuals/rptdyndoc.pdf}} These tools ``knit'' or ``weave'' text and code together, and are programmed to insert code outputs in pre-specified locations. -Documents called ``notebooks'' (such as Jupyter\sidenote{\url{https://jupyter.org/}}) work similarly, +Documents called ``notebooks'' (such as Jupyter\sidenote{\url{https://jupyter.org}}) work similarly, as they also use the underlying analytical software to create the document. These types of dynamic documents are usually appropriate for short or informal materials because they tend to offer restricted editability outside the base software @@ -142,7 +142,7 @@ \subsection{Technical writing with \LaTeX} One of the most important tools available in \LaTeX\ is the BibTeX citation and bibliography manager.\sidenote{ - \url{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.3194&rep=rep1&type=pdf}} + \url{https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.3194&rep=rep1&type=pdf}} BibTeX keeps all the references you might use in an auxiliary file, then references them using a simple element typed directly in the document: a \texttt{cite} command. The same principles that apply to figures and tables are therefore applied here: @@ -171,7 +171,7 @@ \subsection{Technical writing with \LaTeX} in a format you can manage and control.\cite{flom2005latex} Finally, \LaTeX\ has one more useful trick: using \textbf{\texttt{pandoc}},\sidenote{ - \url{http://pandoc.org/}} + \url{https://pandoc.org}} you can translate the raw document into Word (or a number of other formats) by running the following code from the command line: @@ -262,7 +262,7 @@ \section{Preparing a complete replication package} provide direct links to both the code and data used to create the results, and some even require being able to reproduce the results themselves before they will approve a paper for publication.\sidenote{ - \url{https://www.aeaweb.org/journals/policies/data-code/}} + \url{https://www.aeaweb.org/journals/policies/data-code}} If your material has been well-structured throughout the analytical process, this will only require a small amount of extra work; if not, paring it down to the ``replication package'' may take some time. @@ -308,7 +308,7 @@ \subsection{Publishing data for replication} even though data was collected on them. There are different options for data publication. The World Bank's Development Data Hub\sidenote{ - \url{https://data.worldbank.org/}} + \url{https://data.worldbank.org}} includes a Microdata Catalog\sidenote{ \url{https://microdata.worldbank.org}} where researchers can publish data and documentation for their projects.\sidenote{ @@ -330,13 +330,13 @@ \subsection{Publishing data for replication} in many cases you will have the right to release at least some subset of your constructed data set, even if it is just the derived indicators you constructed and their documentation.\sidenote{ - \url{https://guide-for-data-archivists.readthedocs.io/en/latest/}} + \url{https://guide-for-data-archivists.readthedocs.io}} If you have questions about your rights over original or derived materials, check with the legal team at your organization or at the data provider's. Make sure you have a clear understanding of the rights associated with the data release and communicate them to any future users of the data. You must provide a license with any data release.\sidenote{ - \url{https://iatistandard.org/en/guidance/preparing-organisation/organisation-data-publication/how-to-license-your-data/}} + \url{https://iatistandard.org/en/guidance/preparing-organisation/organisation-data-publication/how-to-license-your-data}} Some common license types are documented at the World Bank Data Catalog\sidenote{ \url{https://datacatalog.worldbank.org/public-licenses}} and the World Bank Open Data Policy has futher examples of licenses that are used there.\sidenote{ diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 5c94c70fd..6d2c67e8a 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -98,17 +98,17 @@ \subsection{Estimating treatment effects using control groups} \url{https://www.worldbank.org/en/programs/sief-trust-fund/publication/impact-evaluation-in-practice}} \textit{Causal Inference} and \textit{Causal Inference: The Mixtape} provides more detailed mathematical approaches to the tools.\sidenote{ - \url{https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/} + \url{https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book} \\ \noindent \url{http://scunning.com/cunningham_mixtape.pdf}} \textit{Mostly Harmless Econometrics} and \textit{Mastering Metrics} are excellent resources on the statistical principles behind all econometric approaches.\sidenote{ \url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion} - \\ \noindent \url{http://assets.press.princeton.edu/chapters/s10363.pdf}} + \\ \noindent \url{https://assets.press.princeton.edu/chapters/s10363.pdf}} Intuitively, the problem is as follows: we can never observe the same unit in both their treated and untreated states simultaneously, so measuring and averaging these effects directly is impossible.\sidenote{ - \url{http://www.stat.columbia.edu/~cook/qr33.pdf}} + \url{https://www.stat.columbia.edu/~cook/qr33.pdf}} Instead, we typically make inferences from samples. \textbf{Causal inference} methods are those in which we are able to estimate the average treatment effect without observing individual-level effects, @@ -123,7 +123,7 @@ \subsection{Estimating treatment effects using control groups} exactly the parameter we are seeking to estimate. Therefore, almost all designs can be accurately described as a series of between-group comparisons.\sidenote{ - \url{http://nickchk.com/econ305.html}} + \url{https://nickchk.com/econ305.html}} Most of the methods that you will encounter rely on some variant of this strategy, which is designed to maximize their ability to estimate the effect @@ -171,7 +171,7 @@ \subsection{Experimental and quasi-experimental research designs} as evidenced by its broad credibility in fields ranging from clinical medicine to development. Therefore RCTs are very popular tools for determining the causal impact of specific programs or policy interventions.\sidenote{ - \url{https://www.nobelprize.org/prizes/economic-sciences/2019/ceremony-speech/}} + \url{https://www.nobelprize.org/prizes/economic-sciences/2019/ceremony-speech}} However, there are many other types of interventions that are impractical or unethical to effectively approach using an experimental strategy, and therefore there are limitations to accessing ``big questions'' @@ -355,7 +355,7 @@ \subsection{Difference-in-differences} it is important to create careful records during the first round so that follow-ups can be conducted with the same subjects, and attrition across rounds can be properly taken into account.\sidenote{ - \url{http://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} + \url{https://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} As with cross-sectional designs, difference-in-differences designs are widespread. Therefore there exist a large number of standardized tools for analysis. @@ -391,7 +391,7 @@ \subsection{Regression discontinuity} that serves as the sole determinant of access to the program, and a strict cutoff determines the value of this variable at which eligibility stops.\cite{imbens2008regression}\index{running variable} Common examples are test score thresholds and income thresholds.\sidenote{ - \url{http://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn}} + \url{https://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn}} The intuition is that individuals who are just above the threshold will be very nearly indistinguishable from those who are just under it, and their post-treatment outcomes are therefore directly comparable.\cite{lee2010regression} @@ -430,11 +430,11 @@ \subsection{Regression discontinuity} These presentations help to suggest both the functional form of the underlying relationship and the type of change observed at the discontinuity, and help to avoid pitfalls in modeling that are difficult to detect with hypothesis tests.\sidenote{ - \url{http://econ.lse.ac.uk/staff/spischke/ec533/RD.pdf}} + \url{https://econ.lse.ac.uk/staff/spischke/ec533/RD.pdf}} Because these designs are so flexible compared to others, there is an extensive set of commands that help assess the efficacy and results from these designs under various assumptions.\sidenote{ - \url{https://sites.google.com/site/rdpackages/}} + \url{https://sites.google.com/site/rdpackages}} These packages support the testing and reporting of robust plotting and estimation procedures, tests for manipulation of the running variable, diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index cdb22ac2a..90938510e 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -56,12 +56,12 @@ \section{Random processes in Stata} the conceptual process of assigning units to treatment arms, and the technical process of assigning random numbers in statistical software, which is a part of all tasks that include a random component.\sidenote{ - \url{https://blog.stata.com/2016/03/10/how-to-generate-random-numbers-in-stata/}} + \url{https://blog.stata.com/2016/03/10/how-to-generate-random-numbers-in-stata}} Randomization is challenging and its mechanics are unintuitive for the human brain. ``True'' randomization is also nearly impossible to achieve for computers, which are inherently deterministic.\sidenote{ - \url{https://www.random.org/randomness/}} + \url{https://www.random.org/randomness}} For our purposes, we will focus on what you need to understand in order to produce truly random results for your project using Stata, and how you can make sure you can get those exact results again in the future. @@ -129,7 +129,7 @@ \subsection{Ensuring reproducibility in random Stata processes} \textbf{Seeding} means manually setting the start-point in the list of random numbers. The seed is a number that should be at least six digits long and you should use exactly one unique, different, and randomly created seed per randomization process.\sidenote{You -can draw a uniformly distributed six-digit seed randomly by visiting \url{http://bit.ly/stata-random}. +can draw a uniformly distributed six-digit seed randomly by visiting \url{https://bit.ly/stata-random}. (This link is a just shortcut to request such a random seed on \url{https://www.random.org}.) There are many more seeds possible but this is a large enough set for most purposes.} In Stata, \texttt{set seed [seed]} will set the generator to that start-point. In R, the \texttt{set.seed} function does the same. @@ -354,7 +354,7 @@ \subsection{Stratification} across different strata (such as ``sample/treat all female heads of household''). If this is done, you must calculate and record the exact probability of inclusion for every unit, and re-weight observations accordingly.\sidenote{ - \url{http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-when-to-use-those-sample-weights}} + \url{https://blogs.worldbank.org/impactevaluations/tools-of-the-trade-when-to-use-those-sample-weights}} The exact formula depends on the analysis being performed, but is usually related to the inverse of the likelihood of inclusion. @@ -397,7 +397,7 @@ \subsection{Power calculations} will be able to detect the treatment effects you are interested in. This measure of \textbf{power} can be described in various different ways, each of which has different practical uses.\sidenote{ - \url{http://www.stat.columbia.edu/~gelman/stuff_for_blog/chap20.pdf}} + \url{https://www.stat.columbia.edu/~gelman/stuff_for_blog/chap20.pdf}} The purpose of power calculations is to identify where the strengths and weaknesses of your design are located, so you know the relative tradeoffs you will face by changing your randomization scheme for the final design. diff --git a/code/randtreat-strata.do b/code/randtreat-strata.do index c1d896b99..6697bd6aa 100644 --- a/code/randtreat-strata.do +++ b/code/randtreat-strata.do @@ -7,7 +7,7 @@ `r(version)' // Version sysuse bpwide.dta, clear // Load data isid patient, sort // Sort - set seed 796683 // Seed - drawn using http://bit.ly/stata-random + set seed 796683 // Seed - drawn using https://bit.ly/stata-random * Create strata indicator. The indicator is a categorical variable with * a different value for each unique combination of gender and age group. diff --git a/code/replicability.do b/code/replicability.do index 78d93b4ff..b398efa0f 100644 --- a/code/replicability.do +++ b/code/replicability.do @@ -8,7 +8,7 @@ * SORTING - sort on the uniquely identifying variable "make" isid make, sort -* SEEDING - Seed picked using http://bit.ly/stata-random +* SEEDING - Seed picked using https://bit.ly/stata-random set seed 287608 * Demonstrate stability after VERSIONING, SORTING and SEEDING diff --git a/code/simple-multi-arm-randomization.do b/code/simple-multi-arm-randomization.do index ea5e23f32..dc6dd450f 100644 --- a/code/simple-multi-arm-randomization.do +++ b/code/simple-multi-arm-randomization.do @@ -3,7 +3,7 @@ `r(version)' // Version sysuse bpwide.dta, clear // Load data isid patient, sort // Sort - set seed 654697 // Seed - drawn using http://bit.ly/stata-random + set seed 654697 // Seed - drawn using https://bit.ly/stata-random * Generate a random number and use it to sort the observation. Then * the order the observations are sorted in is random. diff --git a/code/simple-sample.do b/code/simple-sample.do index d34d85023..e4237b088 100644 --- a/code/simple-sample.do +++ b/code/simple-sample.do @@ -3,7 +3,7 @@ `r(version)' // Version sysuse bpwide.dta, clear // Load data isid patient, sort // Sort - set seed 215597 // Seed - drawn using http://bit.ly/stata-random + set seed 215597 // Seed - drawn using https://bit.ly/stata-random * Generate a random number and use it to sort the observation. Then * the order the observations are sorted in is random. diff --git a/mkdocs/docs/index.md b/mkdocs/docs/index.md index 06d52f394..9b5c1dd93 100644 --- a/mkdocs/docs/index.md +++ b/mkdocs/docs/index.md @@ -5,7 +5,7 @@ This book is intended to serve as an introduction to the primary tasks required in development research, from experimental design to data collection to data analysis to publication. It serves as a companion to the [DIME Wiki](https://dimewiki.worldbank.org) -and is produced by [DIME Analytics](http://www.worldbank.org/en/research/dime/data-and-analytics). +and is produced by [DIME Analytics](https://www.worldbank.org/en/research/dime/data-and-analytics). ### Full book in PDF-format for download From f44466903e7281f205477196aa5e740cb2a16cb3 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Feb 2020 17:42:25 -0500 Subject: [PATCH 717/854] [ch5] two very similar links too close to each other --- chapters/data-collection.tex | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 28feef630..cab44a5b1 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -441,8 +441,7 @@ \subsection{Implementing high frequency quality checks} While data collection is ongoing, a research assistant or data analyst should work closely with the field team or partner to ensure that the data collection is progressing correctly, -and set up and perform \textbf{high-frequency checks (HFCs)} with the incoming data.\sidenote{ - \url{https://github.com/PovertyAction/high-frequency-checks/wiki}} +and set up and perform \textbf{high-frequency checks (HFCs)} with the incoming data. High-frequency checks (HFCs) should carefully inspect key treatment and outcome variables so that the data quality of core experimental variables is uniformly high, @@ -450,7 +449,7 @@ \subsection{Implementing high frequency quality checks} Data quality checks should be run on the data every time it is received from the field or partner to flag irregularities in the aquisition progress, in sample completeness, or in response quality. \texttt{ipacheck}\sidenote{ - \url{https://github.com/PovertyAction/high-frequency-checks}} + \url{https://github.com/PovertyAction/high-frequency-checks/wiki}} is a very useful command that automates some of these tasks, regardless of the source of the data. From e4888c8cddfe0ac892a0b993f341b610f4811253 Mon Sep 17 00:00:00 2001 From: Maria Date: Wed, 19 Feb 2020 16:03:07 -0500 Subject: [PATCH 718/854] Update introduction.tex Small edits (repeated words). Added subsection header for code examples. Removed bullet points about good code. --- chapters/introduction.tex | 37 ++++++++++++++++++------------------- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index a55e7ba5b..1b77fe139 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -17,7 +17,7 @@ This book is targeted to everyone who interacts with development data: graduate students, research assistants, policymakers, and empirical researchers. -It covers data workflows at all stages of the research process: design, data acquisition, and analysis. +It covers data workflows at all stages of the research process, from design to data acquisition and analysis. This book is not sector-specific; it will not teach you econometrics, or how to design an impact evaluation. There are many excellent existing resources on those topics. Instead, this book will teach you how to think about all aspects of your research from a data perspective, @@ -37,9 +37,9 @@ \section{Doing credible research at scale} The team responsible for this book is known as \textbf{DIME Analytics}.\sidenote{ \url{https://www.worldbank.org/en/research/dime/data-and-analytics}} -The DIME Analytics team works within the \textbf{Development Impact Evaluation (DIME)} Department\sidenote{ +The DIME Analytics team is part of the \textbf{Development Impact Evaluation (DIME)} Department\sidenote{ \url{https://www.worldbank.org/en/research/dime}} -at the World Bank's \textbf{Development Economics (DEC) Vice Presidency}.\sidenote{ +within the World Bank's \textbf{Development Economics (DEC) Vice Presidency}.\sidenote{ \url{https://www.worldbank.org/en/about/unit/unit-dec}} DIME generates high-quality and operationally relevant data and research @@ -93,7 +93,7 @@ \section{Adopting reproducible workflows} in part because you will find tools for the more advanced practices; and most importantly because you will acquire the mindset of doing research with a high-quality data focus. We hope you will find this book helpful for accomplishing all of the above, -and that mastery of data helps you make an impact! +and that mastery of data helps you make an impact. \section{Writing reproducible code in a collaborative environment} @@ -104,15 +104,14 @@ \section{Writing reproducible code in a collaborative environment} rather it is part of the output itself: a means for communicating how something was done, in a world where the credibility and transparency of data cleaning and analysis is increasingly important. As this is fundamental to the remainder of the book's content, -we provide here a brief introduction to ``good'' code and standardized practices. -``Good'' code has two elements: -\begin{itemize} -\item It is correct (doesn't produce any errors along the way) -\item It is useful and comprehensible to someone who hasn't seen it before (or even yourself a few weeks, months or years later) -\end{itemize} +we provide here a brief introduction to \textbf{``good'' code} and \textbf{process standardization}. +``Good'' code has two elements: (1) it is correct, i.e. it doesn't produce any errors, +and (2) it is useful and comprehensible to someone who hasn't seen it before +(or even yourself a few weeks, months or years later). Many researchers have been trained to code correctly. -However, when your code runs on your computer and you get the correct results, you are only half-done writing \textit{good} code. +However, when your code runs on your computer and you get the correct results, +you are only half-done writing \textit{good} code. Good code is easy to read and replicate, making it easier to spot mistakes. Good code reduces sampling, randomization, and cleaning errors. Good code can easily be reviewed by others before it's published and replicated afterwards. @@ -128,7 +127,7 @@ \section{Writing reproducible code in a collaborative environment} (3) modify it efficiently either to test alternative hypotheses or to adapt into their own work.\sidenote{\url{https://kbroman.org/Tools4RR/assets/lectures/07_clearcode.pdf}} -To accomplish that, you should think of code in terms of three major elements: +You should think of code in terms of three major elements: \textbf{structure}, \textbf{syntax}, and \textbf{style}. We always tell people to ``code as if a stranger would read it'' (from tomorrow, that stranger could be you!). @@ -145,6 +144,7 @@ \section{Writing reproducible code in a collaborative environment} Elements like spacing, indentation, and naming (or lack thereof) can make your code much more (or much less) accessible to someone who is reading it for the first time and needs to understand it quickly and correctly. +/subsection{Code examples} For some implementation portions where precise code is particularly important, we will provide minimal code examples either in the book or on the DIME Wiki. All code guidance is software-agnostic, but code examples are provided in Stata. @@ -169,11 +169,11 @@ \section{Writing reproducible code in a collaborative environment} \section{Outline of this book} This book covers each stage of an empirical research project, from design to publication. +We start with ethical principles to guide empirical research, +focusing on research transparency and the right to privacy. In Chapter 1, we outline a set of practices that help to ensure research participants are appropriately protected and research consumers can be confident in the conclusions reached. -We start with ethical principles to guide empirical research, -focusing on research transparency and the right to privacy. Chapter 2 will teach you to structure your data work to be efficient, collaborative and reproducible. It discusses the importance of planning data work at the outset of the research project -- @@ -181,20 +181,18 @@ \section{Outline of this book} In Chapter 3, we turn to research design, focusing specifically on how to measure treatment effects and structure data for common experimental and quasi-experimental research methods. -We present outlines of common research designs for -causal inference, and consider their implications for data structure. +We provide an overview of research designs frequently used for +causal inference, and consider implications for data structure. Chapter 4 concerns sampling and randomization: how to implement both simple and complex designs reproducibly, and how to use power calculations and randomization inference to critically and quantitatively assess -sampling and randomization designs to make optimal choices when planning studies. +sampling and randomization to make optimal choices when planning studies. Chapter 5 covers data acquisition. We start with the legal and institutional frameworks for data ownership and licensing, dive in depth on collecting high-quality survey data, and finally discuss secure data handling during transfer, sharing, and storage. -It provides guidance on high-quality data collection -and handling for development projects. Chapter 6 teaches reproducible and transparent workflows for data processing and analysis, and provides guidance on de-identification of personally-identified data, focusing on how to organize data work so that it is easy to code the desired analysis. @@ -202,6 +200,7 @@ \section{Outline of this book} how to effectively collaborate on technical writing, how and why to publish data, and guidelines for preparing functional and informative replication packages. + We hope that by the end of the book, you will have learned how to handle data more efficiently, effectively and ethically at all stages of the research process. From 2d366d4318b06077fdd20989ca7bc3ef78a5a2f9 Mon Sep 17 00:00:00 2001 From: Maria Date: Wed, 19 Feb 2020 16:57:27 -0500 Subject: [PATCH 719/854] Update handling-data.tex Restructure of intro paragraph for 'protecting confidence in development research' section --- chapters/handling-data.tex | 31 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 51794cbc0..aa8ccab9c 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -37,28 +37,27 @@ \section{Protecting confidence in development research} - -Major publishers and funders, most notably the American Economic Association, -have taken steps to require that these research components -are accurately reported and preserved as outputs in themselves.\sidenote{ - \url{https://www.aeaweb.org/journals/policies/data-code}} - The empirical revolution in development research\cite{angrist2017economic} has therefore led to increased public scrutiny of the reliability of research.\cite{rogers_2017}\index{transparency}\index{credibility}\index{reproducibility} Three major components make up this scrutiny: \textbf{reproducibility}\cite{duvendack2017meant}, \textbf{transparency},\cite{christensen2018transparency} and \textbf{credibility}.\cite{ioannidis2017power} Development researchers should take these concerns seriously. Many development research projects are purpose-built to address specific questions, and often use unique data or small samples. -This approach opens the door to working closely with the broader development community -to answer specific programmatic questions and general research inquiries. -However, almost by definition, -primary data that researchers use for such studies has never been reviewed by anyone else, -so it is hard for others to verify that it was collected, handled, and analyzed appropriately.\sidenote{ - \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} -Maintaining credibility in research via transparent and reproducibile methods -is key for researchers to avoid serious errors. -This is even more important in research using primary data, -and therefore these are not byproducts but core components of research output. +As a result, it is often the case that the data +researchers use for such studies has never been reviewed by anyone else, +so it is hard for others to verify that it was +collected, handled, and analyzed appropriately.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} + +Reproducible and transparent methods are key to maintaining credibility +and avoiding serious errors. +This is particularly true for research that relies on new data sources, +from innovative big data sources to surveys. +The field is slowly moving in the direction of requiring greater transparency. +Major publishers and funders, most notably the American Economic Association, +have taken steps to require that code and data +are accurately reported, cited, and preserved as outputs in themselves.\sidenote{ + \url{https://www.aeaweb.org/journals/policies/data-code}} + \subsection{Research reproducibility} From 77c65c88ed6f5444e578c7d90bff088385c0ccaa Mon Sep 17 00:00:00 2001 From: Maria Date: Wed, 19 Feb 2020 17:17:32 -0500 Subject: [PATCH 720/854] Update handling-data.tex re-write of research reproducibility subsection. removed reference that reproducibility and replicability are interchangeable. added bit on computational reproducibility. --- chapters/handling-data.tex | 34 ++++++++++++++++++++++------------ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index aa8ccab9c..54c7ae3e3 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -46,7 +46,7 @@ \section{Protecting confidence in development research} As a result, it is often the case that the data researchers use for such studies has never been reviewed by anyone else, so it is hard for others to verify that it was -collected, handled, and analyzed appropriately.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} +collected, handled, and analyzed appropriately. Reproducible and transparent methods are key to maintaining credibility and avoiding serious errors. @@ -61,11 +61,20 @@ \section{Protecting confidence in development research} \subsection{Research reproducibility} -Reproducible research means that the actual analytical processes you used are executable by others.\cite{dafoe2014science} -(We use ``reproducible'' and ``replicable'' interchangeably in this book, -though there is much discussion about the use and definition of these concepts.\sidenote{ +Can another researcher reuse the same code on the same data +and get the exact same results as in your published paper?\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} +This is a standard known as \textbf{computational reproducibility}, +and it is an increasingly common requirement for publication.\sidenote{ \url{https://www.nap.edu/resource/25303/R&R.pdf}}) -All your code files involving data cleaning, construction and analysis +It is best practice to verify computational reproducibility before submitting a paper before publication. +This should be done by someone who is not on your research team, on a different computer, +using exactly the package of code and data files you plan to submit with your paper. +Code that is well-organized into a master do-file, and written to be easily run by others, +makes this task simpler. +The next chapter discusses organization of data work in detail. + +For research to be reproducible, +all code files for data cleaning, construction and analysis should be public, unless they contain identifying information. Nobody should have to guess what exactly comprises a given index, or what controls are included in your main regression, @@ -77,14 +86,8 @@ \subsection{Research reproducibility} is a great way to have new questions asked and answered based on the valuable work you have already done.\sidenote{ \url{https://blogs.worldbank.org/opendata/making-analytics-reusable}} -Services that log your research process are valuable resources here -- -Such services can show things like modifications made in response to referee comments, -by having tagged version histories at each major revision. -They also allow you to use issue trackers -to document the research paths and questions you may have tried to answer -as a resource to others who have similar questions. -Secondly, reproducible research\sidenote{ +Making your research reproducible is also a public good: \sidenote{ \url{https://dimewiki.worldbank.org/Reproducible_Research}} enables other researchers to re-use your code and processes to do their own work more easily and effectively in the future. @@ -172,6 +175,13 @@ \subsection{Research transparency} Email, however, is \textit{not} a note-taking service, because communications are rarely well-ordered, can be easily deleted, and are not available for future team members. +Services that log your research process are valuable resources here -- +Such services can show things like modifications made in response to referee comments, +by having tagged version histories at each major revision. +They also allow you to use issue trackers +to document the research paths and questions you may have tried to answer +as a resource to others who have similar questions. + \subsection{Research credibility} The credibility of research is traditionally a function of design choices.\cite{angrist2010credibility,ioannidis2005most} From 047a8c10348129b300493ccdfd61be78d1cff566 Mon Sep 17 00:00:00 2001 From: Maria Date: Wed, 19 Feb 2020 17:30:07 -0500 Subject: [PATCH 721/854] Update handling-data.tex small edits to transparency and credibility sections --- chapters/handling-data.tex | 40 ++++++++++++++++++-------------------- 1 file changed, 19 insertions(+), 21 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 54c7ae3e3..53e8d9cb3 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -168,19 +168,17 @@ \subsection{Research transparency} to record the decision process leading to changes and additions, track and register discussions, and manage tasks. These are flexible tools that can be adapted to different team and project dynamics. -Each project has specific requirements for data, code, and documentation management, -and the exact shape of this process can be molded to the team's needs, -but it should be agreed upon prior to project launch. -This way, you can start building a project's documentation as soon as you start making decisions. -Email, however, is \textit{not} a note-taking service, because communications are rarely well-ordered, -can be easily deleted, and are not available for future team members. - -Services that log your research process are valuable resources here -- -Such services can show things like modifications made in response to referee comments, +Services that log your research process can show things like modifications made in response to referee comments, by having tagged version histories at each major revision. They also allow you to use issue trackers to document the research paths and questions you may have tried to answer as a resource to others who have similar questions. +Each project has specific requirements for data, code, and documentation management, +and the exact transparency tools to use will depend on the team's needs, +but they should be agreed upon prior to project launch. +This way, you can start building a project's documentation as soon as you start making decisions. +Email, however, is \textit{not} a note-taking service, because communications are rarely well-ordered, +can be easily deleted, and are not available for future team members. \subsection{Research credibility} @@ -237,7 +235,7 @@ \subsection{Research credibility} \section{Ensuring privacy and security in research data} -Anytime you are collecting primary data in a development research project, +Anytime you are working with raw data in a development research project, you are almost certainly handling data that include \textbf{personally-identifying information (PII)}.\index{personally-identifying information}\index{primary data}\sidenote{ \textbf{Personally-identifying information:} any piece or set of information @@ -248,13 +246,14 @@ \section{Ensuring privacy and security in research data} \index{data collection} This includes names, addresses, and geolocations, and extends to personal information such as email addresses, phone numbers, and financial information.\index{geodata}\index{de-identification} -It is important to keep in mind these data privacy principles not only for the individual respondent but also the PII data of their household members or other caregivers who are covered under the survey. +It is important to keep in mind data privacy principles not only for the respondent +but also the PII data of their household members or other individuals who are covered under the survey. \index{privacy} In some contexts this list may be more extensive -- for example, if you are working in an environment that is either small, specific, or has extensive linkable data sources available to others, information like someone's age and gender may be sufficient to identify them -even though these would not be considered PII in a larger context. +even though these would not be considered PII in general. There is no one-size-fits-all solution to determine what is PII, and you will have to use careful judgment in each case to decide which pieces of information fall into this category.\sidenote{ \url{https://sdcpractice.readthedocs.io}} @@ -266,7 +265,7 @@ \section{Ensuring privacy and security in research data} with a set of governance standards known as ``The Common Rule''.\sidenote{ \url{https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html}} If you interact with European institutions or persons, -you will also become familiar with ``GDPR'',\sidenote{ +you will also become familiar with the General Data Protection Regulation ``GDPR'',\sidenote{ \url{http://blogs.lshtm.ac.uk/library/2018/01/15/gdpr-for-research-data}} a set of regulations governing \textbf{data ownership} and privacy standards.\sidenote{ \textbf{Data ownership:} the set of rights governing who may accesss, alter, use, or share data, regardless of who possesses it.} @@ -293,19 +292,19 @@ \subsection{Obtaining ethical approval and consent} Most commonly this consists of a formal application for approval of a specific protocol for consent, data collection, and data handling.\sidenote{ \url{https://dimewiki.worldbank.org/IRB_Approval}} -An IRB which has sole authority over your project is not always apparent, +Which IRB has sole authority over your project is not always apparent, particularly if some institutions do not have their own. It is customary to obtain an approval from a university IRB where at least one PI is affiliated, and if work is being done in an international setting, approval is often also required -from an appropriate institution subject to local law. +from an appropriate local institution subject to the laws of the country. One primary consideration of IRBs is the protection of the people about whom information is being collected and whose lives may be affected by the research design. Some jurisdictions (especially those responsible to EU law) view all personal data -as being intrinsically owned by the persons who they describe. +as intrinsically owned by the persons who they describe. This means that those persons have the right to refuse to participate in data collection before it happens, as it is happening, or after it has already happened. It also means that they must explicitly and affirmatively consent @@ -317,14 +316,13 @@ \subsection{Obtaining ethical approval and consent} such as minors, prisoners, and people with disabilities, and these should be confirmed with relevant authorities if your research includes them. -Make sure you have significant advance timing with your IRB submissions. -You may not begin data collection until approval is in place, -and IRBs may have infrequent meeting schedules +IRB approval should be obtained well before any data is acquired. +IRBs may have infrequent meeting schedules or require several rounds of review for an application to be approved. If there are any deviations from an approved plan or expected adjustments, -report these as early as you can so that you can update or revise the protocol. +report these as early as possible so that you can update or revise the protocol. Particularly at universities, IRBs have the power to retroactively deny -the right to use data which was not collected in accordance with an approved plan. +the right to use data which was not acquired in accordance with an approved plan. This is extremely rare, but shows the seriousness of these considerations since the institution itself may face legal penalties if its IRB is unable to enforce them. As always, as long as you work in good faith, From 7ea682450649f5942cb1d9c660aef302f91f5c21 Mon Sep 17 00:00:00 2001 From: Maria Date: Wed, 19 Feb 2020 17:30:43 -0500 Subject: [PATCH 722/854] Update handling-data.tex remove detailed text on securely transferring files to field. this should be re-added to chaper 5. --- chapters/handling-data.tex | 7 ------- 1 file changed, 7 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 53e8d9cb3..cb9d36a66 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -346,13 +346,6 @@ \subsection{Transmitting and storing data securely} \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked, or any other type of unauthorized access is obtained. \url{https://dimewiki.worldbank.org/Encryption}} during data collection, storage, and transfer.\index{encryption}\index{data transfer}\index{data storage} -The biggest security gap is often in transmitting survey plans to and from staff in the field. -To protect information in transit to field staff, some key steps are: -(a) ensure that all devices that store confidential data -have hard drive encryption and password-protection; -(b) never send confidential data over email, WhatsApp, or other chat services. -without encrypting the information first; and -(c) train all field staff on the adequate privacy standards applicable to their work. Most modern data collection software has features that, if enabled, make secure transmission straightforward.\sidenote{ From 4267d7459261d54c78a8daced594a1a63dee70f4 Mon Sep 17 00:00:00 2001 From: Maria Date: Wed, 19 Feb 2020 17:34:23 -0500 Subject: [PATCH 723/854] Update handling-data.tex remove detailed text on securely transferring files to field. this should be re-added to chaper 5. --- chapters/handling-data.tex | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index cb9d36a66..deb47f99a 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -365,27 +365,25 @@ \subsection{Transmitting and storing data securely} The easiest way to reduce the risk of leaking personal information is to use it as rarely as possible. It is often very simple to conduct planning and analytical work -using a subset of the data that has anonymous identifying ID variables, -and has had personal characteristics removed from the dataset altogether. +using a subset of the data that has been \textbf{de-identified}. We encourage this approach, because it is easy. However, when PII is absolutely necessary to a task, such as implementing an intervention or submitting survey data, -you must actively protect those materials in transmission and storage. +you must actively protect that data in transmission and storage. There are plenty of options available to keep your data safe, at different prices, from enterprise-grade solutions to free software. It may be sufficient to hold identifying information in an encrypted service, or you may need to encrypt information at the file level using a special tool. (This is in contrast to using software or services with disk-level or service-level encryption.) -Data security is important not only for identifying, but also sensitive information, +Data security is important not only for identifying, but all confidential information, especially when a worst-case scenario could potentially lead to re-identifying subjects. -Extremely sensitive information may be required to be held in a ``cold'' machine +Extremely confidential information may be required to be held in a ``cold'' machine which does not have internet access -- this is most often the case with government records such as granular tax information. -Each of these tools and requirements will vary in level of security and ease of use, -and sticking to a standard practice will make your life much easier, -so agreeing on a protocol from the start of a project is ideal. +What data security protocols you employ will depend on project needs and data sources, +but agreeing on a protocol from the start of a project will make your life easier. Finally, having an end-of-life plan for data is essential: you should always know how to transfer access and control to a new person if the team changes, and what the expiry of the data and the planned deletion processes are. From 53ec953420e12ae2382dd67f4591083ddc3c8ee1 Mon Sep 17 00:00:00 2001 From: Maria Date: Wed, 19 Feb 2020 17:38:29 -0500 Subject: [PATCH 724/854] Update handling-data.tex small edits to de-identification section --- chapters/handling-data.tex | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index deb47f99a..834f15b03 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -388,7 +388,7 @@ \subsection{Transmitting and storing data securely} you should always know how to transfer access and control to a new person if the team changes, and what the expiry of the data and the planned deletion processes are. -\subsection{De-identifying and anonymizing information} +\subsection{De-identifying data} Most of the field research done in development involves human subjects.\sidenote{ \url{https://dimewiki.worldbank.org/Human_Subjects_Approval}} @@ -412,7 +412,9 @@ \subsection{De-identifying and anonymizing information} Second, avoid the proliferation of copies of identified data. There should only be one raw identified dataset copy and it should be somewhere where only approved people can access it. -Finally, not everyone on the research team needs access to identified data. +Even within the research team, +access to PII data should be limited to team members who require it for specific analysis +(most analysis will not depend on PII). Analysis that requires PII data is rare and can be avoided by properly linking identifiers to research information such as treatment statuses and weights, then removing identifiers. @@ -428,9 +430,11 @@ \subsection{De-identifying and anonymizing information} -- even if that data has had all directly identifying information removed -- by using some other data that becomes identifying when analyzed together. For this reason, we recommend de-identification in two stages. -The \textbf{initial de-identification} process strips the data of direct identifiers +The \textbf{initial de-identification} process strips the data of direct identifiers +as early in the process as possible, to create a working de-identified dataset that can be shared \textit{within the research team} without the need for encryption. +This simplifies workflows. The \textbf{final de-identification} process involves making a decision about the trade-off between risk of disclosure and utility of the data From 8505ad4304e31ee7467889ede20ff7eb5fa78402 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 19 Feb 2020 18:03:47 -0500 Subject: [PATCH 725/854] Large and small Fixes #339 --- bibliography.bib | 9 +++++++++ chapters/sampling-randomization-power.tex | 9 +++++++++ 2 files changed, 18 insertions(+) diff --git a/bibliography.bib b/bibliography.bib index b16167a3f..754fb469a 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -17,6 +17,15 @@ @book{glewwe2000designing publisher={World Bank} } +@incollection{schwarzer2015small, + title={Small-study effects in meta-analysis}, + author={Schwarzer, Guido and Carpenter, James R and R{\"u}cker, Gerta}, + booktitle={Meta-analysis with R}, + pages={107--141}, + year={2015}, + publisher={Springer} +} + @article{king2019propensity, title={Why propensity scores should not be used for matching}, author={King, Gary and Nielsen, Richard}, diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 90938510e..62b962a59 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -479,10 +479,19 @@ \subsection{Randomization inference} in quasi-experimental designs and in small samples, because these conditions usually lead to the situation where the number of possible \textit{randomizations} is itself small. +It is difficult to draw a firm line on when a study is ``large enough'', +but even when a treatment effect is real, +it can take ten or one hundred times more sample size +than the minimum required power +to ensure that the average statistically significant estimate +of the treatment effect is accurate.\cite{schwarzer2015small} In those cases, we cannot rely on the usual assertion (a consequence of the Central Limit Theorem) that the variance of the treatment effect estimate is normal, and we therefore cannot use the ``asymptotic'' standard errors from Stata. +We also don't need to: these methods were developed when it was very costly +to compute large numbers of combinations using computers, +and in most cases we now have the ability to do these calculations reasonably quickly. Instead, we directly simulate a large variety of possible alternative randomizations. Specifically, the user-written \texttt{ritest} command\sidenote{ From 32e4bc87d97e3ea182588171d8e2bbc4c309e8c3 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 19 Feb 2020 18:06:14 -0500 Subject: [PATCH 726/854] Add JPAL resource link Address #301 --- chapters/publication.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index e2ecea952..32993eb11 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -298,9 +298,10 @@ \subsection{Publishing data for replication} They can also provide for timed future releases of datasets once the need for exclusive access has ended. -If your project collected primary data, +If your project collected original data, releasing the cleaned dataset is a significant contribution that can be made -in addition to any publication of analysis results. +in addition to any publication of analysis results.\sidenote{ + \url{https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-publishing-research-data.pdf}} Publishing data can foster collaboration with researchers interested in the same subjects as your team. Collaboration can enable your team to fully explore variables and From 502738a3e8fb7aa9d5c15673ac87f0b18bb9e34c Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 19 Feb 2020 18:09:26 -0500 Subject: [PATCH 727/854] Add questionnaire link Fix #296 --- chapters/data-collection.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index cab44a5b1..57b7cd5cc 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -230,7 +230,8 @@ \subsection{Developing a data collection instrument} since it is difficult to work backwards from the survey program to the intended concepts. The workflow for designing a questionnaire will feel much like writing an essay, or writing pseudocode: -begin from broad concepts and slowly flesh out the specifics. +begin from broad concepts and slowly flesh out the specifics.\sidenote{ + \url{https://iriss.stanford.edu/sites/g/files/sbiybj6196/f/questionnaire_design_1.pdf}} It is essential to start with a clear understanding of the \textbf{theory of change}\sidenote{ \url{https://dimewiki.worldbank.org/Theory_of_Change}} From 66cec1e1d7344bdc44b3d18f455b815f6c8b0c3c Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 19 Feb 2020 18:13:06 -0500 Subject: [PATCH 728/854] Add Dropbox security reference Address #186 --- chapters/data-collection.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 57b7cd5cc..6a5d8e2ee 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -599,7 +599,9 @@ \section{Collecting and sharing data securely} the service provider needs to keep a copy of the password or key. Since it unlikely that that software provider is included in your IRB, this is not secure enough. - +This includes all file syncing software such as Dropbox, +who, in addition to being able to view your data, may require +you to give them additional legal usage rights in order to host it on their servers. It is possible, in some enterprise versions of data sharing software, to set up appropriately secure on-the-fly encryption. However, that setup is advanced, and you should never trust it From 2adf384887cf93bb2fa2855c7634acf50c94e59e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 19 Feb 2020 18:18:50 -0500 Subject: [PATCH 729/854] Add Stata guides --- appendix/stata-guide.tex | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index d12efe55b..f6f3bf733 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -63,6 +63,9 @@ \subsection{Understanding Stata code} Whether you are new to Stata or have used it for decades, you will always run into commands that you have not seen before or whose function you do not remember. +(Whether you are new or not, you should frequently revisit the most common commands -- +often you will learn they can do something you never realized.\sidenote{ + \url{https://www.stata.com/manuals13/u27.pdf}}) Every time that happens, you should always look up the help file for that command. We often encounter the conception that help files are only for beginners. @@ -115,6 +118,9 @@ \subsection{Why we use a Stata style guide} \url{https://github.com/google/styleguide}} Aesthetics is an important part of style guides, but not the main point. +Neither is telling you which commands to use: +there are plenty of guides to Stata's extensive functionality.\sidenote{ + \url{https://scholar.harvard.edu/files/mcgovern/files/practical_introduction_to_stata.pdf}} The important function is to allow programmers who are likely to work together to share conventions and understandings of what the code is doing. Style guides therefore help improve the quality of the code From 456e62067b6d7a0595f0747e477758485b84a14f Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Feb 2020 11:01:35 -0500 Subject: [PATCH 730/854] fix links listed in #383 --- bibliography.bib | 2 +- chapters/planning-data-work.tex | 2 +- chapters/research-design.tex | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/bibliography.bib b/bibliography.bib index 754fb469a..3c611addd 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -66,7 +66,7 @@ @inbook {stock2005weak publisher = {Cambridge University Press}, organization = {Cambridge University Press}, address = {New York}, - url = {http://www.economics.harvard.edu/faculty/stock/files/TestingWeakInstr_Stock\%2BYogo.pdf}, + url = {https://scholar.harvard.edu/files/stock/files/testing\_for\_weak\_instruments\_in\_linear\_iv\s_regression.pdf}, author = {James Stock and Motohiro Yogo}, editor = {Donald W.K. Andrews} } diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 6db5e6a20..fbe2df498 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -447,7 +447,7 @@ \subsection{Organizing files and folder structures} and reduce the amount of time others will spend opening files to find out what is inside them. The main point to be considered is that files accessed by code have special naming requirements\sidenote{ - \url{http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-Git/slides/naming-slides/naming-slides.pdf}}, + \url{https://www2.stat.duke.edu/~rcs46/lectures\_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf}}, since different software and operating systems read file names in different ways. Some of the differences between the two naming approaches are major and may be new to you, so below are a few examples. diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 6d2c67e8a..92a7f3e16 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -426,7 +426,7 @@ \subsection{Regression discontinuity} In the analytical stage, regression discontinuity designs often include a large component of visual evidence presentation.\sidenote{ - \url{http://faculty.smu.edu/kyler/courses/7312/presentations/baumer/Baumer\_RD.pdf}} + \url{https://faculty.smu.edu/kyler/courses/7312/presentations/baumer/Baumer\_RD.pdf}} These presentations help to suggest both the functional form of the underlying relationship and the type of change observed at the discontinuity, and help to avoid pitfalls in modeling that are difficult to detect with hypothesis tests.\sidenote{ @@ -465,7 +465,7 @@ \subsection{Instrumental variables} is similar to either cross-sectional or difference-in-differences designs. However, instead of controlling for the instrument directly, the IV approach typically uses the \textbf{two-stage-least-squares (2SLS)} estimator.\sidenote{ - \url{http://www.nuff.ox.ac.uk/teaching/economics/bond/IV\%20Estimation\%20Using\%20Stata.pdf}} + \url{https://www.nuff.ox.ac.uk/teaching/economics/bond/IV\%20Estimation\%20Using\%20Stata.pdf}} This estimator forms a prediction of the probability that the unit receives treatment based on a regression against the instrumental variable. That prediction will, by assumption, be the portion of the actual treatment From ca5ec8a8c8302f26e48db6ee9b6b678af781311f Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 11:04:18 -0500 Subject: [PATCH 731/854] Update chapters/handling-data.tex Co-Authored-By: Benjamin Daniels --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 834f15b03..fa9607abd 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -87,7 +87,7 @@ \subsection{Research reproducibility} based on the valuable work you have already done.\sidenote{ \url{https://blogs.worldbank.org/opendata/making-analytics-reusable}} -Making your research reproducible is also a public good: \sidenote{ +Making your research reproducible is also a public good.\sidenote{ \url{https://dimewiki.worldbank.org/Reproducible_Research}} enables other researchers to re-use your code and processes to do their own work more easily and effectively in the future. From add34083f0ca6271428f7a385564dcde5bd748c5 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 11:04:32 -0500 Subject: [PATCH 732/854] Update chapters/handling-data.tex Co-Authored-By: Benjamin Daniels --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index fa9607abd..4b85c54ff 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -89,7 +89,7 @@ \subsection{Research reproducibility} Making your research reproducible is also a public good.\sidenote{ \url{https://dimewiki.worldbank.org/Reproducible_Research}} -enables other researchers to re-use your code and processes +It enables other researchers to re-use your code and processes to do their own work more easily and effectively in the future. This may mean applying your techniques to their data or implementing a similar structure in a different context. From 73c115aa485107f4c80ce21cd8f5a3e3e42bbe40 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 11:04:54 -0500 Subject: [PATCH 733/854] Update chapters/handling-data.tex Co-Authored-By: Benjamin Daniels --- chapters/handling-data.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 4b85c54ff..068455de6 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -62,7 +62,8 @@ \section{Protecting confidence in development research} \subsection{Research reproducibility} Can another researcher reuse the same code on the same data -and get the exact same results as in your published paper?\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} +and get the exact same results as in your published paper?\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} This is a standard known as \textbf{computational reproducibility}, and it is an increasingly common requirement for publication.\sidenote{ \url{https://www.nap.edu/resource/25303/R&R.pdf}}) From 51694e8f8946199a899f09f2e9b58752027f180c Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 11:05:07 -0500 Subject: [PATCH 734/854] Update chapters/handling-data.tex Co-Authored-By: Benjamin Daniels --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 068455de6..144413fa3 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -236,7 +236,7 @@ \subsection{Research credibility} \section{Ensuring privacy and security in research data} -Anytime you are working with raw data in a development research project, +Anytime you are working with original data in a development research project, you are almost certainly handling data that include \textbf{personally-identifying information (PII)}.\index{personally-identifying information}\index{primary data}\sidenote{ \textbf{Personally-identifying information:} any piece or set of information From 4a2f5f1fbbc905ca2e52916758184e040b744dfb Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 11:07:49 -0500 Subject: [PATCH 735/854] Update chapters/handling-data.tex Co-Authored-By: Benjamin Daniels --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 144413fa3..60ef02adb 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -50,7 +50,7 @@ \section{Protecting confidence in development research} Reproducible and transparent methods are key to maintaining credibility and avoiding serious errors. -This is particularly true for research that relies on new data sources, +This is particularly true for research that relies on original or novel data sources, from innovative big data sources to surveys. The field is slowly moving in the direction of requiring greater transparency. Major publishers and funders, most notably the American Economic Association, From 4cb252038e76b0a16a8d57dbf307f11e46685df6 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 11:08:10 -0500 Subject: [PATCH 736/854] Update chapters/handling-data.tex Co-Authored-By: Benjamin Daniels --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 60ef02adb..95ef66340 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -266,7 +266,7 @@ \section{Ensuring privacy and security in research data} with a set of governance standards known as ``The Common Rule''.\sidenote{ \url{https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html}} If you interact with European institutions or persons, -you will also become familiar with the General Data Protection Regulation ``GDPR'',\sidenote{ +you will also become familiar with the General Data Protection Regulation (GDPR),\sidenote{ \url{http://blogs.lshtm.ac.uk/library/2018/01/15/gdpr-for-research-data}} a set of regulations governing \textbf{data ownership} and privacy standards.\sidenote{ \textbf{Data ownership:} the set of rights governing who may accesss, alter, use, or share data, regardless of who possesses it.} From d8e2f69d493ea987bed4c6c32f75a833d0e4b6c4 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 11:08:21 -0500 Subject: [PATCH 737/854] Update chapters/handling-data.tex Co-Authored-By: Benjamin Daniels --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 95ef66340..6e2823dea 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -299,7 +299,7 @@ \subsection{Obtaining ethical approval and consent} where at least one PI is affiliated, and if work is being done in an international setting, approval is often also required -from an appropriate local institution subject to the laws of the country. +from an appropriate local institution subject to the laws of the country where data originates. One primary consideration of IRBs is the protection of the people about whom information is being collected From 5872d8502887072439ccb70d5ecf0c007832a772 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 11:08:29 -0500 Subject: [PATCH 738/854] Update chapters/handling-data.tex Co-Authored-By: Benjamin Daniels --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 6e2823dea..8b704f60c 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -371,7 +371,7 @@ \subsection{Transmitting and storing data securely} However, when PII is absolutely necessary to a task, such as implementing an intervention or submitting survey data, -you must actively protect that data in transmission and storage. +you must actively protect that information in transmission and storage. There are plenty of options available to keep your data safe, at different prices, from enterprise-grade solutions to free software. From 7f50b6177474151562f3109362c1f51557ae150f Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 11:08:40 -0500 Subject: [PATCH 739/854] Update chapters/handling-data.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 8b704f60c..5938cb92f 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -70,7 +70,7 @@ \subsection{Research reproducibility} It is best practice to verify computational reproducibility before submitting a paper before publication. This should be done by someone who is not on your research team, on a different computer, using exactly the package of code and data files you plan to submit with your paper. -Code that is well-organized into a master do-file, and written to be easily run by others, +Code that is well-organized into a master script, and written to be easily run by others, makes this task simpler. The next chapter discusses organization of data work in detail. From 1c0e5dbf05291dd7bdc048c3506b92d8f1a55494 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Feb 2020 11:17:50 -0500 Subject: [PATCH 740/854] escape character in URL typo --- bibliography.bib | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bibliography.bib b/bibliography.bib index 3c611addd..1640bd2c6 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -66,7 +66,7 @@ @inbook {stock2005weak publisher = {Cambridge University Press}, organization = {Cambridge University Press}, address = {New York}, - url = {https://scholar.harvard.edu/files/stock/files/testing\_for\_weak\_instruments\_in\_linear\_iv\s_regression.pdf}, + url = {https://scholar.harvard.edu/files/stock/files/testing\_for\_weak\_instruments\_in\_linear\_ivs\_regression.pdf}, author = {James Stock and Motohiro Yogo}, editor = {Donald W.K. Andrews} } From bfc1493a2e1ee92a6717a047c2ec1ed2a16065af Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Feb 2020 11:40:23 -0500 Subject: [PATCH 741/854] [ch1] no confidential data on email or whatsapp --- chapters/handling-data.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 5938cb92f..560a0a421 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -362,7 +362,9 @@ \subsection{Transmitting and storing data securely} no one who is not listed on the IRB may have access to the decryption key. This means that is it usually not enough to rely service providers' on-the-fly encryption as they need to keep a copy -of the decryption key to make it automatic. +of the decryption key to make it automatic. When confidential data is stored on a local +computer it must always remain encrypted, and confidential data may never be sent unencrypted +over email, WhatsApp, or other chat services. The easiest way to reduce the risk of leaking personal information is to use it as rarely as possible. It is often very simple to conduct planning and analytical work From a7f4c402daa8fc01db86db49505df4af3db895b9 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 12:15:37 -0500 Subject: [PATCH 742/854] Update handling-data.tex addresses issue #386 --- chapters/handling-data.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 834f15b03..a11d37ca1 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -410,8 +410,8 @@ \subsection{De-identifying data} You can take simple steps to avoid risks by minimizing the handling of PII. First, only collect information that is strictly needed for the research. Second, avoid the proliferation of copies of identified data. -There should only be one raw identified dataset copy -and it should be somewhere where only approved people can access it. +There should never be more than one copy of the raw identified dataset in the project folder, +and it must always be encrypted. Even within the research team, access to PII data should be limited to team members who require it for specific analysis (most analysis will not depend on PII). From 2fa30cace48998460fd94458c9f7594d7b471fa1 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 12:28:59 -0500 Subject: [PATCH 743/854] Update handling-data.tex Added content and links on informed consent. --- chapters/handling-data.tex | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 3071222e5..782890e62 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -310,9 +310,14 @@ \subsection{Obtaining ethical approval and consent} before it happens, as it is happening, or after it has already happened. It also means that they must explicitly and affirmatively consent to the collection, storage, and use of their information for any purpose. -Therefore, the development of appropriate consent processes is of primary importance. -Ensuring that research participants are aware that their information -will be stored and may be used for various research purposes is critical. +Therefore, the development of appropriate consent processes is of primary importance.\sidenote{ + url\https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} +All survey instruments must include a module in which the sampled respondent grants informed consent to participate. +Research participants must be informed of the purpose of the research, +what their participation will entail in terms of duration and any procedures, +any foreseeable benefits or risks, +and how their identity will be protected.\sidenote{ + \url{https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} There are special additional protections in place for vulnerable populations, such as minors, prisoners, and people with disabilities, and these should be confirmed with relevant authorities if your research includes them. From c7fe0b96c57b345839ef13601371e069b19f856c Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 13:43:30 -0500 Subject: [PATCH 744/854] Move stata justification to intro Moved the paragraph on why stata to the intro, as i think it's useful for this content to come early on and it applies to all chapters (we previously had just a sentence on this in the intro). --- chapters/introduction.tex | 43 +++++++++++++++++++++++++++++++-------- 1 file changed, 34 insertions(+), 9 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 1b77fe139..07dfbf574 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -77,23 +77,40 @@ \section{Doing credible research at scale} handling data processing and analytical tasks. -\section{Adopting reproducible workflows} +\section{Adopting reproducible tools} We will provide free, open-source, and platform-agnostic tools wherever possible, and point to more detailed instructions when relevant. -Stata is the notable exception here due to its current popularity in development economics. Most tools have a learning and adaptation process, meaning you will become most comfortable with each tool only by using it in real-world work. Get to know them well early on, so that you do not spend a lot of time learning through trial and error. -While adopting the workflows and mindsets described in this book requires an up-front cost, -it will save you (and your collaborators) a lot of time and hassle very quickly. -In part this is because you will learn how to implement essential practices directly; -in part because you will find tools for the more advanced practices; -and most importantly because you will acquire the mindset of doing research with a high-quality data focus. -We hope you will find this book helpful for accomplishing all of the above, -and that mastery of data helps you make an impact. +Stata is the notable exception here due to its current popularity in development economics. +We focus on Stata-specific tools and instructions in this book. +Hence, we will use the terms ``script'' and ``do-file'' +interchangeably to refer to Stata code throughout. +Stata is primarily a scripting language for statistics and data, +meaning that its users often come from economics and statistics backgrounds +and understand Stata to be encoding a set of tasks as a record for the future. +We believe that this must change somewhat: +in particular, we think that practitioners of Stata +must begin to think about their code and programming workflows +just as methodologically as they think about their research workflows, +and that people who adopt this approach will be dramatically +more capable in their analytical ability. +This means that they will be more productive when managing teams, +and more able to focus on the challenges of experimental design +and econometric analysis, rather than spending excessive time +re-solving problems on the computer. +To support this goal, this book also includes +an introductory Stata Style Guide +that we use in our work, which provides +some new standards for coding so that code styles +can be harmonized across teams for easier understanding and reuse of code. +Stata has relatively few resources of this type available, +and the ones that we have created and shared here +we hope will be an asset to all its users. \section{Writing reproducible code in a collaborative environment} @@ -201,6 +218,14 @@ \section{Outline of this book} how and why to publish data, and guidelines for preparing functional and informative replication packages. + +While adopting the workflows and mindsets described in this book requires an up-front cost, +it will save you (and your collaborators) a lot of time and hassle very quickly. +In part this is because you will learn how to implement essential practices directly; +in part because you will find tools for the more advanced practices; +and most importantly because you will acquire the mindset of doing research with a high-quality data focus. +We hope you will find this book helpful for accomplishing all of the above, +and that mastery of data helps you make an impact. We hope that by the end of the book, you will have learned how to handle data more efficiently, effectively and ethically at all stages of the research process. From f42184c28197a744db557d60ab0bbc09c008591a Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 13:45:09 -0500 Subject: [PATCH 745/854] Move stata justification to intro --- chapters/planning-data-work.tex | 33 ++------------------------------- 1 file changed, 2 insertions(+), 31 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 6db5e6a20..836a7587f 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -7,7 +7,7 @@ In order to be prepared to work on the data you receive with a group, you need to structure your workflow in advance. This means knowing which data sets and outputs you need at the end of the process, -how they will stay organized, what types of data you'll handle, +how they will stay organized, what types of data you'll acquire, and whether the data will require special handling due to size or privacy considerations. Identifying these details will help you map out the data needs for your project, and give you a sense for how information resources should be organized. @@ -20,7 +20,7 @@ so it's important to plan ahead. Seemingly small decisions such as sharing services, folder structures, and filenames can be extremely painful to alter down the line in any project. -Similarly, making sure to set up a self-documenting discussion platform +Similarly, make sure to set up a self-documenting discussion platform and process for version control; this makes working together on outputs much easier from the very first discussion. This chapter will guide you on preparing a collaborative work environment, @@ -270,35 +270,6 @@ \subsection{Choosing software} such as folder management, Git integration, and simultaneous work with other types of files, without leaving the editor. -In our field of development economics, -Stata is currently the most commonly used statistical software, -and the built-in do-file editor the most common editor for programming Stata. -We focus on Stata-specific tools and instructions in this book. -Hence, we will use the terms ``script'' and ``do-file'' -interchangeably to refer to Stata code throughout. -This is only in part due to its popularity. -Stata is primarily a scripting language for statistics and data, -meaning that its users often come from economics and statistics backgrounds -and understand Stata to be encoding a set of tasks as a record for the future. -We believe that this must change somewhat: -in particular, we think that practitioners of Stata -must begin to think about their code and programming workflows -just as methodologically as they think about their research workflows, -and that people who adopt this approach will be dramatically -more capable in their analytical ability. -This means that they will be more productive when managing teams, -and more able to focus on the challenges of experimental design -and econometric analysis, rather than spending excessive time -re-solving problems on the computer. -To support this goal, this book also includes -an introductory Stata Style Guide -that we use in our work, which provides -some new standards for coding so that code styles -can be harmonized across teams for easier understanding and reuse of code. -Stata has relatively few resources of this type available, -and the ones that we have created and shared here -we hope will be an asset to all its users. - % ---------------------------------------------------------------------------------------------- % ---------------------------------------------------------------------------------------------- \section{Organizing code and data} From f59a7227788defa9553509c5fcd4852020b9b75b Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 14:35:57 -0500 Subject: [PATCH 746/854] Update planning-data-work.tex removed content on file naming, added to wiki, added wiki link --- chapters/planning-data-work.tex | 48 ++++++++------------------------- 1 file changed, 11 insertions(+), 37 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index fbe2df498..d2cdf11ca 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -370,13 +370,13 @@ \subsection{Organizing files and folder structures} it is intended to be an easy template to start from. This system operates by creating a \texttt{DataWork} folder at the project level, and within that folder, it provides standardized directory structures -for each data source (in the primary data context, ``rounds'' of data collection). +for each data source or survey round. For each, \texttt{iefolder} creates folders for raw encrypted data, raw deidentified data, cleaned data, final data, outputs, and documentation. In parallel, it creates folders for the code files that move the data through this progression, and for the files that manage final analytical work. -The command also has some flexibility for the addition of +The command has some flexibility for the addition of folders for other types of data sources, although this is less well developed as the needs for larger data sets tend to be very specific. The \texttt{ietoolkit} package also includes the \texttt{iegitaddmd} command, @@ -388,11 +388,9 @@ \subsection{Organizing files and folder structures} The \texttt{DataWork} folder may be created either inside an existing project-based folder structure, or it may be created separately. -It's usually created by the leading RA in agreement with the PI. -Increasingly, our recommendation is to create the \texttt{DataWork} folder -separately from the project management materials, -reserving the ``project folder'' for contracts, Terms of Reference, briefs and other administrative or management work. - \index{project folder} +It is preferable to create the \texttt{DataWork} folder +in a code folder maintained separately from the project management materials +(such as contracts, Terms of Reference, briefs and other administrative or management work). This is so the project folder can be maintained in a synced location like Dropbox, while the code folder can be maintained in a version-controlled location like GitHub. (Remember, a version-controlled folder \textit{should not} @@ -408,8 +406,7 @@ \subsection{Organizing files and folder structures} Keeping such plaintext files in a version-controlled folder allows you to maintain better control of their history and functionality. Because of the high degree of dependence between code files depend and file structure, -you will be able to enforce better practices in a separate folder than in the project folder, -which will usually be managed by a PI, FC, or field team members. +you will be able to enforce better practices in a separate code folder than in the project folder. Setting up the \texttt{DataWork} folder folder in a version-controlled directory also enables you to use Git and GitHub for version control on your code files. @@ -432,43 +429,20 @@ \subsection{Organizing files and folder structures} and allows you to go back to previous versions without losing the information on changes made. It also makes it possible to work on multiple parallel versions of the code, so you don't risk breaking the code for other team members as you try something new. -The DIME file management and organization approach is designed with this in mind. Once the \texttt{DataWork} folder's directory structure is set up, you should adopt a file naming convention. You will generally be working with two types of files: -``code-compatiable'' files, which are those that are accessed by code processes, -and ``non-code-compatiable'' files, which will not be accessed by code processes. -The former takes precedent: an Excel file is a code-compatiable file +``code-compatible'' files, which are those that are accessed by code processes, +and ``non-code-compatible'' files, which will not be accessed by code processes. +The former takes precedent: an Excel file is a code-compatible file even if it is a field log, because at some point it will be used by code. We will not give much emphasis to files that are not linked to code here; but you should make sure to name them in an orderly fashion that works for your team. These rules will ensure you can find files within folders and reduce the amount of time others will spend opening files -to find out what is inside them. -The main point to be considered is that files accessed by code have special naming requirements\sidenote{ - \url{https://www2.stat.duke.edu/~rcs46/lectures\_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf}}, -since different software and operating systems read file names in different ways. -Some of the differences between the two naming approaches are major and may be new to you, -so below are a few examples. -Introducing spaces between words in a file name (including the folder path) -can break a file's path when it's read by code, -so while a Word document may be called \texttt{2019-10-30 Sampling Procedure Description.docx}, -a related do file would have a name like \texttt{sampling-endline.do}. -Adding timestamps to binary files as in the example above can be useful, -as it is not straightforward to track changes using version control software. -However, for plaintext files version-controlled using Git, timestamps are an unnecessary distraction. -Similarly, code-compatiable files should never include capital letters, -as strings and file paths are case-sensitive in some software. -Finally, one organizational practice that takes some getting used to -is the fact that the best names from a coding perspective -are usually the opposite of those from an English perspective. -For example, for a deidentified household dataset from the baseline round, -you should prefer a name like \texttt{baseline-household-deidentified.dta}, -rather than the opposite way around as occurs in natural language. -This ensures that all \texttt{baseline} data stays together, -then all \texttt{baseline-household} data, -and finally provides unique information about this specific file. +to find out what is inside them.\sidenote{\url\{https://dimewiki.worldbank.org/wiki/Naming_Conventions}} + % ---------------------------------------------------------------------------------------------- \subsection{Documenting and organizing code} From 7d5716b51d8e1f46e06779a687bb41f88c94467d Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 15:10:18 -0500 Subject: [PATCH 747/854] Update planning-data-work.tex Added reference to master .do file template --- chapters/planning-data-work.tex | 40 +++++++++++++++++++++------------ 1 file changed, 26 insertions(+), 14 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index d2cdf11ca..dfbec5d8e 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -446,7 +446,6 @@ \subsection{Organizing files and folder structures} % ---------------------------------------------------------------------------------------------- \subsection{Documenting and organizing code} - Once you start a project's data work, the number of scripts, datasets, and outputs that you have to manage will grow very quickly. This can get out of hand just as quickly, @@ -457,6 +456,10 @@ \subsection{Documenting and organizing code} They all come from the principle that code is an output by itself, not just a means to an end, and should be written thinking of how easy it will be for someone to read it later. +At the end of this section, we include a template for a Master Script in Stata, +to provide a concrete example of the required elements and structure. +Throughout this section, we refer to lines of the example .do file +to give concrete examples of the required code elements, organization and structure. Code documentation is one of the main factors that contribute to readability. Start by adding a code header to every file. @@ -464,19 +467,22 @@ \subsection{Documenting and organizing code} \textbf{Comments:} Code components that have no function, but describe in plain language what the code is supposed to do. } -that details the functionality of the entire script. +that details the functionality of the entire script; +refer to lines 5-10 in the example .do file. This should include simple things such as the purpose of the script and the name of the person who wrote it. If you are using a version control software, the last time a modification was made and the person who made it will be recorded by that software. Otherwise, you should include it in the header. -Finally, use the header to track the inputs and outputs of the script. +You should always track the inputs and outputs of the script, as well as the uniquely identifying variable; +refer to lines 49-51 in the example .do file. When you are trying to track down which code creates which data set, this will be very helpful. While there are other ways to document decisions related to creating code, the information that is relevant to understand the code should always be written in the code file. In the script, alongside the code, are two types of comments that should be included. -The first type of comment describes what is being done. +The first type of comment describes what is being done; +refer to line 35 in the example .do file. This might be easy to understand from the code itself if you know the language well enough and the code is clear, but often it is still a great deal of work to reverse-engineer the code's intent. @@ -491,7 +497,7 @@ \subsection{Documenting and organizing code} Even you will probably not remember the exact choices that were made in a couple of weeks. Therefore, you must document your precise processes in your code. -Code organization means keeping each piece of code in an easily findable location. +Code organization means keeping each piece of code in an easy-to-find location. \index{code organization} Breaking your code into independently readable ``chunks'' is one good practice on code organization. You should write each functional element as a chunk that can run completely on its own, @@ -499,13 +505,15 @@ \subsection{Documenting and organizing code} created by other code chunks that are not obvious from the immediate context. One way to do this is to create sections where a specific task is completed. So, for example, if you want to find the line in your code where a variable was created, -you can go straight to \texttt{PART 2: Create new variables}, +you can go straight to \texttt{PART 2: Prepare folder paths and define programs}, instead of reading line by line through the entire code. RStudio, for example, makes it very easy to create sections, and it compiles them into an interactive script index for you. In Stata, you can use comments to create section headers, -though they're just there to make the reading easier and don't have functionality. -You should also add an index in the code header by copying and pasting section titles. +though they're just there to make the reading easier and don't have functionality; +refer to line 24 of the example .do file. +You should also add an index in the code header by copying and pasting section titles; +refer to lines 8-10 in the example .do file. You can then add and navigate through them using the \texttt{find} functionality. Since Stata code is harder to navigate, as you will need to scroll through the document, it's particularly important to avoid writing very long scripts. @@ -516,6 +524,7 @@ \subsection{Documenting and organizing code} This is an arbitrary limit, just like the standard restriction of each line to 80 characters: it seems to be ``enough but not too much'' for most purposes. +\subsection{Working with a Master script} To bring all these smaller code files together, you must maintain a master script. \index{master do-file} A master script is the map of all your project's data work @@ -529,8 +538,6 @@ \subsection{Documenting and organizing code} The master script is also where all the settings are established, such as versions, folder paths, functions, and constants used throughout the project. -\codeexample{stata-master-dofile.do}{./code/stata-master-dofile.do} - Try to create the habit of running your code from the master script. Creating ``section switches'' using macros or objects to run only the codes related to a certain task should always be preferred to manually open different scripts to run them in a certain order @@ -543,17 +550,21 @@ \subsection{Documenting and organizing code} and when it does, it may take time for you to understand what's causing an error. The same applies to changes in data sets and results. -To link code, data and outputs, the master script reflects the structure of the \texttt{DataWork} folder in code -through globals (in Stata) or string scalars (in R). +To link code, data and outputs, +the master script reflects the structure of the \texttt{DataWork} folder in code +through globals (in Stata) or string scalars (in R); +refer to lines 35-40 of the example .do file. These coding shortcuts can refer to subfolders, so that those folders can be referenced without repeatedly writing out their absolute file paths. Because the \texttt{DataWork} folder is shared by the whole team, its structure is the same in each team member's computer. The only difference between machines should be -the path to the project root folder, i.e. the highest-level shared folder, which in the context of \texttt{iefolder} is the \texttt{DataWork} folder. +the path to the project root folder, i.e. the highest-level shared folder, +which in the context of \texttt{iefolder} is the \texttt{DataWork} folder. This is reflected in the master script in such a way that the only change necessary to run the entire code from a new computer -is to change the path to the project folder to reflect the filesystem and username. +is to change the path to the project folder to reflect the filesystem and username; +refer to lines 27-32 of the example .do file. The code in \texttt{stata-master-dofile.do} shows how folder structure is reflected in a master do-file. Because writing and maintaining a master script can be challenging as a project grows, an important feature of the \texttt{iefolder} is to write master do-files @@ -578,6 +589,7 @@ \subsection{Documenting and organizing code} is the easiest way to be prepared in advance for a smooth project handover or for release of the code to the general public. +\codeexample{stata-master-dofile.do}{./code/stata-master-dofile.do} % ---------------------------------------------------------------------------------------------- \subsection{Managing outputs} From d6aeeed9f3bccad2fc306ffe323695fbe34f8cce Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 15:13:07 -0500 Subject: [PATCH 748/854] Update planning-data-work.tex small edits --- chapters/planning-data-work.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index dfbec5d8e..bbfae7adb 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -571,7 +571,7 @@ \subsection{Working with a Master script} and add to them whenever new subfolders are created in the \texttt{DataWork} folder.\sidenote{ \url{https://dimewiki.worldbank.org/Master\_Do-files}} -In order to maintain these practices and ensure they are functioning well, +In order to maintain well-documented and organized code, you should agree with your team on a plan to review code as it is written. \index{code review} Reading other people's code is the best way to improve your coding skills. @@ -641,8 +641,9 @@ \subsection{Managing outputs} like documents and presentations using {\LaTeX}\index{{\LaTeX}}.\sidenote{ \url{https://www.latex-project.org} and \url{https://github.com/worldbank/DIME-LaTeX-Templates}.} {\LaTeX} is a document preparation system that can create both text documents and presentations. -The main advantage is that {\LaTeX} uses plaintext for all formatting, + {\LaTeX} uses plaintext for all formatting, and it is necessary to learn its specific markup convention to use it. + The main advantage of using {\LaTeX} is that you can write dynamic documents, that import inputs every time they are compiled. This means you can skip the copying and pasting whenever an output is updated. From 99dd907a4be3d0faf2aa79e2eabff16fa6184de8 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:05:25 -0500 Subject: [PATCH 749/854] Update chapters/planning-data-work.tex Co-Authored-By: Benjamin Daniels --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index bbfae7adb..ce9b39d20 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -456,7 +456,7 @@ \subsection{Documenting and organizing code} They all come from the principle that code is an output by itself, not just a means to an end, and should be written thinking of how easy it will be for someone to read it later. -At the end of this section, we include a template for a Master Script in Stata, +At the end of this section, we include a template for a master script in Stata, to provide a concrete example of the required elements and structure. Throughout this section, we refer to lines of the example .do file to give concrete examples of the required code elements, organization and structure. From 4a254dc4f2db7ed3956937c30ec9485b1f6579cd Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:05:32 -0500 Subject: [PATCH 750/854] Update chapters/planning-data-work.tex Co-Authored-By: Benjamin Daniels --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index ce9b39d20..0b9ceb6b9 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -458,7 +458,7 @@ \subsection{Documenting and organizing code} and should be written thinking of how easy it will be for someone to read it later. At the end of this section, we include a template for a master script in Stata, to provide a concrete example of the required elements and structure. -Throughout this section, we refer to lines of the example .do file +Throughout this section, we refer to lines of the example do file to give concrete examples of the required code elements, organization and structure. Code documentation is one of the main factors that contribute to readability. From 4150fc238afba56f10734b49c822f52bc659f82e Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:05:39 -0500 Subject: [PATCH 751/854] Update chapters/planning-data-work.tex Co-Authored-By: Benjamin Daniels --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 0b9ceb6b9..83afb6deb 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -468,7 +468,7 @@ \subsection{Documenting and organizing code} but describe in plain language what the code is supposed to do. } that details the functionality of the entire script; -refer to lines 5-10 in the example .do file. +refer to lines 5-10 in the example do file. This should include simple things such as the purpose of the script and the name of the person who wrote it. If you are using a version control software, From 4f5f377fc8996aa9fa47069aa7da572c98762369 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:05:45 -0500 Subject: [PATCH 752/854] Update chapters/planning-data-work.tex Co-Authored-By: Benjamin Daniels --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 83afb6deb..975bb53a2 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -475,7 +475,7 @@ \subsection{Documenting and organizing code} the last time a modification was made and the person who made it will be recorded by that software. Otherwise, you should include it in the header. You should always track the inputs and outputs of the script, as well as the uniquely identifying variable; -refer to lines 49-51 in the example .do file. +refer to lines 49-51 in the example do file. When you are trying to track down which code creates which data set, this will be very helpful. While there are other ways to document decisions related to creating code, the information that is relevant to understand the code should always be written in the code file. From f99f9a95885946969cce7be44a546d8f3fec81ff Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:05:52 -0500 Subject: [PATCH 753/854] Update chapters/planning-data-work.tex Co-Authored-By: Benjamin Daniels --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 975bb53a2..c03a8ee61 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -482,7 +482,7 @@ \subsection{Documenting and organizing code} In the script, alongside the code, are two types of comments that should be included. The first type of comment describes what is being done; -refer to line 35 in the example .do file. +refer to line 35 in the example do file. This might be easy to understand from the code itself if you know the language well enough and the code is clear, but often it is still a great deal of work to reverse-engineer the code's intent. From a2ceef06bb70908835a708fc94ce355b585c406f Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:05:58 -0500 Subject: [PATCH 754/854] Update chapters/planning-data-work.tex Co-Authored-By: Benjamin Daniels --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index c03a8ee61..a693d413c 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -524,7 +524,7 @@ \subsection{Documenting and organizing code} This is an arbitrary limit, just like the standard restriction of each line to 80 characters: it seems to be ``enough but not too much'' for most purposes. -\subsection{Working with a Master script} +\subsection{Working with a master script} To bring all these smaller code files together, you must maintain a master script. \index{master do-file} A master script is the map of all your project's data work From 1fff82aad300dbf67692a8ddb5c57edf2af63d2f Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:06:05 -0500 Subject: [PATCH 755/854] Update chapters/planning-data-work.tex Co-Authored-By: Benjamin Daniels --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index a693d413c..c4beda8db 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -553,7 +553,7 @@ \subsection{Working with a master script} To link code, data and outputs, the master script reflects the structure of the \texttt{DataWork} folder in code through globals (in Stata) or string scalars (in R); -refer to lines 35-40 of the example .do file. +refer to lines 35-40 of the example do file. These coding shortcuts can refer to subfolders, so that those folders can be referenced without repeatedly writing out their absolute file paths. Because the \texttt{DataWork} folder is shared by the whole team, From ad2a84fe293402317242c5cbdc1f175049c0d053 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:06:12 -0500 Subject: [PATCH 756/854] Update chapters/planning-data-work.tex Co-Authored-By: Benjamin Daniels --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index c4beda8db..92fb29fad 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -564,7 +564,7 @@ \subsection{Working with a master script} This is reflected in the master script in such a way that the only change necessary to run the entire code from a new computer is to change the path to the project folder to reflect the filesystem and username; -refer to lines 27-32 of the example .do file. +refer to lines 27-32 of the example do file. The code in \texttt{stata-master-dofile.do} shows how folder structure is reflected in a master do-file. Because writing and maintaining a master script can be challenging as a project grows, an important feature of the \texttt{iefolder} is to write master do-files From 2d472a2deb100c65cf72c27a7d6e6bb2327c935f Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:06:25 -0500 Subject: [PATCH 757/854] Update chapters/planning-data-work.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 92fb29fad..af15dcce2 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -389,7 +389,7 @@ \subsection{Organizing files and folder structures} The \texttt{DataWork} folder may be created either inside an existing project-based folder structure, or it may be created separately. It is preferable to create the \texttt{DataWork} folder -in a code folder maintained separately from the project management materials +separately from the project management materials (such as contracts, Terms of Reference, briefs and other administrative or management work). This is so the project folder can be maintained in a synced location like Dropbox, while the code folder can be maintained in a version-controlled location like GitHub. From ec4792dcb0f1d184d2ed7c7a495ac4b2eb877ad3 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:25:16 -0500 Subject: [PATCH 758/854] Update research-design.tex small edits to ch3 --- chapters/research-design.tex | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 92a7f3e16..773ae1b5b 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -33,6 +33,7 @@ in response to an unexpected event. Intuitive knowledge of your project's chosen approach will make you much more effective at the analytical part of your work. + This chapter first covers causal inference methods. Next it discusses how to measure treatment effects and structure data for specific methods, including cross-sectional randomized control trials, difference-in-difference designs, @@ -66,6 +67,7 @@ \section{Causality, inference, and identification} Therefore it is important to understand how exactly your study identifies its estimate of treatment effects, so you can calculate and interpret those estimates appropriately. + All the study designs we discuss here use the potential outcomes framework\cite{athey2017state} to compare a group that received some treatment to another, counterfactual group. Each of these approaches can be used in two types of designs: @@ -87,10 +89,10 @@ \subsection{Estimating treatment effects using control groups} and the \textbf{average treatment effect (ATE)} is the average of all individual differences across the potentially treated population. \index{average treatment effect} -This is the parameter that most research designs attempt to estimate. -Their goal is to establish a \textbf{counterfactual}\sidenote{ +This is the parameter that most research designs attempt to estimate, +by establishing a \textbf{counterfactual}\sidenote{ \textbf{Counterfactual:} A statistical description of what would have happened to specific individuals in an alternative scenario, for example, a different treatment assignment outcome.} -for the treatment group with which outcomes can be directly compared. +for the treatment group against which outcomes can be directly compared. \index{counterfactual} There are several resources that provide more or less mathematically intensive approaches to understanding how various methods do this. @@ -137,7 +139,7 @@ \subsection{Estimating treatment effects using control groups} Typically, causal inference designs are not interested in predictive accuracy, and the estimates and predictions that they produce will not be as good at predicting outcomes or fitting the data as other models. -Additionally, when control variables or other variables are used in estimation, +Second, when control variables or other variables are used in estimation, there is no guarantee that the resulting parameters are marginal effects. They can only be interpreted as correlative averages, unless there are additional sources of identification. @@ -195,7 +197,7 @@ \subsection{Experimental and quasi-experimental research designs} since they require strong assumptions about the randomness or non-randomness of takeup. Therefore a large amount of field time and descriptive work must be dedicated to understanding how these effects played out in a given study, -and often overshadow the effort put into the econometric design itself. +and may overshadow the effort put into the econometric design itself. \textbf{Quasi-experimental} research designs,\sidenote{ \url{https://dimewiki.worldbank.org/Quasi-Experimental_Methods}} @@ -209,7 +211,7 @@ \subsection{Experimental and quasi-experimental research designs} of having access to data collected at the right times and places to exploit events that occurred in the past, or having the ability to collect data in a time and place -where an event that produces causal identification occurred. +where an event that produces causal identification occurred or will occur. Therefore, these methods often use either secondary data, or they use primary data in a cross-sectional retrospective method, including administrative data or other new classes of routinely-collected information. From aff0db322c9a83e6e0404736a94a501759116fb9 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:26:26 -0500 Subject: [PATCH 759/854] Update chapters/planning-data-work.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index af15dcce2..be5ddc802 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -497,7 +497,7 @@ \subsection{Documenting and organizing code} Even you will probably not remember the exact choices that were made in a couple of weeks. Therefore, you must document your precise processes in your code. -Code organization means keeping each piece of code in an easy-to-find location. +Code organization means keeping each piece of code in an easy-to-find location and naming them in a meaningful way. \index{code organization} Breaking your code into independently readable ``chunks'' is one good practice on code organization. You should write each functional element as a chunk that can run completely on its own, From 7a73864f826d0a9059d2f3240118a1023dfd78c8 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:29:04 -0500 Subject: [PATCH 760/854] Update planning-data-work.tex changed all ".do file" and "do file" to "do-file" --- chapters/planning-data-work.tex | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index af15dcce2..cf17f915c 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -458,7 +458,7 @@ \subsection{Documenting and organizing code} and should be written thinking of how easy it will be for someone to read it later. At the end of this section, we include a template for a master script in Stata, to provide a concrete example of the required elements and structure. -Throughout this section, we refer to lines of the example do file +Throughout this section, we refer to lines of the example do-file to give concrete examples of the required code elements, organization and structure. Code documentation is one of the main factors that contribute to readability. @@ -468,21 +468,21 @@ \subsection{Documenting and organizing code} but describe in plain language what the code is supposed to do. } that details the functionality of the entire script; -refer to lines 5-10 in the example do file. +refer to lines 5-10 in the example do-file. This should include simple things such as the purpose of the script and the name of the person who wrote it. If you are using a version control software, the last time a modification was made and the person who made it will be recorded by that software. Otherwise, you should include it in the header. You should always track the inputs and outputs of the script, as well as the uniquely identifying variable; -refer to lines 49-51 in the example do file. +refer to lines 49-51 in the example do-file. When you are trying to track down which code creates which data set, this will be very helpful. While there are other ways to document decisions related to creating code, the information that is relevant to understand the code should always be written in the code file. In the script, alongside the code, are two types of comments that should be included. The first type of comment describes what is being done; -refer to line 35 in the example do file. +refer to line 35 in the example do-file. This might be easy to understand from the code itself if you know the language well enough and the code is clear, but often it is still a great deal of work to reverse-engineer the code's intent. @@ -511,9 +511,9 @@ \subsection{Documenting and organizing code} and it compiles them into an interactive script index for you. In Stata, you can use comments to create section headers, though they're just there to make the reading easier and don't have functionality; -refer to line 24 of the example .do file. +refer to line 24 of the example do-file. You should also add an index in the code header by copying and pasting section titles; -refer to lines 8-10 in the example .do file. +refer to lines 8-10 in the example do-file. You can then add and navigate through them using the \texttt{find} functionality. Since Stata code is harder to navigate, as you will need to scroll through the document, it's particularly important to avoid writing very long scripts. @@ -553,7 +553,7 @@ \subsection{Working with a master script} To link code, data and outputs, the master script reflects the structure of the \texttt{DataWork} folder in code through globals (in Stata) or string scalars (in R); -refer to lines 35-40 of the example do file. +refer to lines 35-40 of the example do-file. These coding shortcuts can refer to subfolders, so that those folders can be referenced without repeatedly writing out their absolute file paths. Because the \texttt{DataWork} folder is shared by the whole team, @@ -564,7 +564,7 @@ \subsection{Working with a master script} This is reflected in the master script in such a way that the only change necessary to run the entire code from a new computer is to change the path to the project folder to reflect the filesystem and username; -refer to lines 27-32 of the example do file. +refer to lines 27-32 of the example do-file. The code in \texttt{stata-master-dofile.do} shows how folder structure is reflected in a master do-file. Because writing and maintaining a master script can be challenging as a project grows, an important feature of the \texttt{iefolder} is to write master do-files From f9686f239a81e4863dddbe4fd753dfa3c461b49e Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:45:43 -0500 Subject: [PATCH 761/854] Update sampling-randomization-power.tex edits to Random Processes section --- chapters/sampling-randomization-power.tex | 37 +++++++++++------------ 1 file changed, 17 insertions(+), 20 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 62b962a59..24560b5d2 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -41,7 +41,7 @@ %----------------------------------------------------------------------------------------------- -\section{Random processes in Stata} +\section{Random processes} Most experimental designs rely directly on random processes, particularly sampling and randomization, to be executed in code. @@ -50,7 +50,7 @@ \section{Random processes in Stata} and treatment assignment processes are truly random. Therefore, understanding and correctly programming for sampling and randomization is essential to ensuring that planned experiments -are correctly implemented in the field, so that the results +are correctly implemented, so that the results can be interpreted according to the experimental design. There are two distinct random processes referred to here: the conceptual process of assigning units to treatment arms, @@ -58,30 +58,25 @@ \section{Random processes in Stata} which is a part of all tasks that include a random component.\sidenote{ \url{https://blog.stata.com/2016/03/10/how-to-generate-random-numbers-in-stata}} +Any process that includes a random component is a random process, +including sampling, randomization, power calculation simulations, and algorithms like bootstrapping. Randomization is challenging and its mechanics are unintuitive for the human brain. ``True'' randomization is also nearly impossible to achieve for computers, which are inherently deterministic.\sidenote{ \url{https://www.random.org/randomness}} + +\subsection{Implementing random processes reproducibly in Stata} + +Reproducibility in statistical programming means that the outputs of random processes +can be re-obtained at a future time.\cite{orozco2018make} For our purposes, we will focus on what you need to understand in order to produce truly random results for your project using Stata, and how you can make sure you can get those exact results again in the future. This takes a combination of strict rules, solid understanding, and careful programming. -This section introduces the strict rules: these are non-negotiable (but thankfully simple). -The second section provides basic introductions to the tasks of sampling and randomization, -and the third introduces common varieties encountered in the field. -The fourth section discusses more advanced topics that are used -to analyze the random processes directly in order to understand their properties. -However, field realities will inevitably -be more complex than anything we present here, -and you will need to recombine these lessons to match your project's needs. - -\subsection{Ensuring reproducibility in random Stata processes} +This section introduces the strict rules: these are non-negotiable (but thankfully simple). +At the end of the section, +we provide a do-file that provides a concrete example of how to implement these principles. -Any process that includes a random component is a random process, -including sampling, randomization, power calculation simulations, and algorithms like bootstrapping. -Reproducibility in statistical programming means that the outputs of random processes -can be re-obtained at a future time. -All random methods must be reproducible.\cite{orozco2018make} Stata, like most statistical software, uses a \textbf{pseudo-random number generator}. Basically, it has a pre-calculated really long ordered list of numbers with the property that knowing the previous one gives you precisely zero information about the next one, i.e. a list of random numbers. @@ -142,14 +137,16 @@ \subsection{Ensuring reproducibility in random Stata processes} so carefully confirm exactly how your code runs before finalizing it.\sidenote{ \url{https://dimewiki.worldbank.org/Randomization_in_Stata}} -\codeexample{replicability.do}{./code/replicability.do} - To confirm that a randomization has worked well before finalizing its results, save the outputs of the process in a temporary location, re-run the code, and use \texttt{cf} or \texttt{datasignature} to ensure nothing has changed. It is also advisable to let someone else reproduce your randomization results on their machine to remove any doubt that your results -are reproducable. +are reproducible. + +\codeexample{replicability.do}{./code/replicability.do} + + %----------------------------------------------------------------------------------------------- From a4d3c81a83cd6dfca2d27f7e593399553b4c7471 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Feb 2020 17:57:48 -0500 Subject: [PATCH 762/854] Update sampling-randomization-power.tex edits to the rest of chapter 4 --- chapters/sampling-randomization-power.tex | 32 ++++++++++++++--------- 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 24560b5d2..7a5e44f19 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -204,7 +204,10 @@ \subsection{Sampling} because it will make the probability of selection very hard to calculate.) There are a number of shortcuts to doing this process, but they all use this method as the starting point, -so you should become familiar with exactly how this method works. +so you should become familiar with exactly how it works. +The do-file below provides an example of how to implement uniform-probability sampling in practice. + +\codeexample{simple-sample.do}{./code/simple-sample.do} Almost all of the relevant considerations for sampling come from two sources: deciding what population, if any, a sample is meant to represent (including subgroups); @@ -218,7 +221,6 @@ \subsection{Sampling} Ex post changes to the study scope using a sample drawn for a different purpose usually involve tedious calculations of probabilities and should be avoided. -\codeexample{simple-sample.do}{./code/simple-sample.do} \subsection{Randomization} @@ -242,15 +244,18 @@ \subsection{Randomization} \url{https://dimewiki.worldbank.org/Randomization_in_Stata}} Sampling typically has only two possible outcomes: observed and unobserved. Randomization, by contrast, often involves multiple possible results -which each represent various varieties of treatments to be delivered; +which each represent different varieties of treatments to be delivered; in some cases, multiple treatment assignments are intended to overlap in the same sample. Complexity can therefore grow very quickly in randomization and it is doubly important to fully understand the conceptual process that is described in the experimental design, -and fill in any gaps in the process before implementing it in code. +and fill in any gaps before implementing it in code. +The do-file below provides an example of how to implement a simple random assignment of multiple treatment arms. -Some types of experimental designs necessitate that randomization results be revealed during data collection. -It is possible to do this using survey software or live events. +\codeexample{simple-multi-arm-randomization.do}{./code/simple-multi-arm-randomization.do} + +Some types of experimental designs necessitate that randomization results be revealed in the field. +It is possible to do this using survey software or live events, such as a live lottery. These methods typically do not leave a record of the randomization, so particularly when the experiment is done as part of data collection, it is best to execute the randomization in advance and preload the results. @@ -261,7 +266,6 @@ \subsection{Randomization} Understanding that process will also improve the ability of the team to ensure that the field randomization process is appropriately designed and executed. -\codeexample{simple-multi-arm-randomization.do}{./code/simple-multi-arm-randomization.do} %----------------------------------------------------------------------------------------------- @@ -285,8 +289,8 @@ \subsection{Clustering} Many studies observe data at a different level than the randomization unit.\sidenote{ \url{https://dimewiki.worldbank.org/Unit_of_Observation}} -For example, a policy may only be able to affect an entire village, -but the study is interested in household behavior. +For example, a policy may be implemented at the village level, +but the outcome of interest for the study is behavior changes at the household level. This type of structure is called \textbf{clustering},\sidenote{ \url{https://dimewiki.worldbank.org/Multi-stage_(Cluster)_Sampling}} and the groups in which units are assigned to treatment are called clusters. @@ -325,11 +329,13 @@ \subsection{Stratification} In particular, it is difficult to precisely account for the interaction of strata sizes with multiple treatment arms. Even for a very simple design, the method of randomly ordering the observations -will often create very skewed assignments. +will often create very skewed assignments.\sidenote{\url + {https://blogs.worldbank.org/impactevaluations/tools-of-the-trade-doing-stratified-randomization-with-uneven-numbers-in-some-strata}} This is especially true when a given stratum contains a small number of clusters, and when there are a large number of treatment arms, since the strata will rarely be exactly divisible by the number of arms.\cite{carril2017dealing} -The user-written \texttt{randtreat} command properly implements stratification. +The user-written \texttt{randtreat} command properly implements stratification, +as shown in the do-file below. However, the options and outputs (including messages) from the command should be carefully reviewed so that you understand exactly what has been implemented. Notably, it is extremely hard to target precise numbers of observations @@ -399,13 +405,13 @@ \subsection{Power calculations} of your design are located, so you know the relative tradeoffs you will face by changing your randomization scheme for the final design. They also allow realistic interpretations of evidence: -results of low-power studies can be very interesting, +low-power studies can be very interesting, but they have a correspondingly higher likelihood of reporting false positive results. The classic definition of power is the likelihood that a design detects a significant treatment effect, -given that there is a non-zero true effect in reality. +given that there is a non-zero treatment effect in reality. This definition is useful retrospectively, but it can also be re-interpreted to help in experimental design. There are two common and useful practical applications From 9d164721feac254ab3e9842a10de94047e2116bc Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 21 Feb 2020 09:42:55 -0500 Subject: [PATCH 763/854] Update data-collection.tex Updated intro --- chapters/data-collection.tex | 47 +++++++++++++++++------------------- 1 file changed, 22 insertions(+), 25 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 6a5d8e2ee..744edcc7f 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -3,40 +3,37 @@ \begin{fullwidth} Much of the recent push toward credibility in the social sciences has focused on analytical practices. However, credible development research often depends, first and foremost, on the quality of the raw data. -This is because, when you are collecting the data yourself, -or it is provided only to you through a unique partnership, -there is no way for others to validate that it accurately reflects the reality +When you are using original data - whether collected for the first time through surveys or sensors or acquired through a unique partnership - +there is no way for others to validate that it accurately reflects reality and that the indicators you have based your analysis on are meaningful. This chapter details the necessary components for a high-quality data acquisition process, -no matter whether you are recieving large amounts of unique data from partners -or fielding a small, specialized custom survey. -It begins with a discussion of some key ethical and legal descriptions -to ensure that you have the right to do research using a specific dataset. -Particularly when sensitive data is being collected by you -or shared with you from a program implementer, government, or other partner, -you need to make sure these permissions are correctly granted and documented, -so that the ownership and licensing of all information is established -and the privacy rights of the people it describes are respected. +no matter whether you are receiving large amounts of unique data from partners +or fielding a small, specialized custom survey. By following these guidelines, you will be able to move on to data analysis, +assured that your data has been obtained at high standards of both quality and security. +The chapter begins with a discussion of some key ethical and legal descriptions +to ensure that you have the right to do research using a specific dataset. +Particularly when confidential data is being collected at your behest +or shared with you by a program implementer, government, or other partner, +you need to make sure permissions are correctly granted and documented. +Clearly establishing ownership and licensing of all information protects +the privacy rights of the people it describes and your own right to publish. While the principles of data governance and data quality apply to all types of data, there are additional considerations to ensuring data quality if you are collecting data yourself through an instrument like a field survey. This chapter provides detailed guidance on the data generation workflow, from questionnaire design to programming electronic instruments and monitoring data quality. -While surveys remain popular, the rise of electronic data collection instruments -means that there are additional workflow considerations needed -to ensure that your data is accurate and usable in statistical software. -There are many excellent resources on questionnaire design and field supervision, -but few covering the particular challenges and opportunities presented by electronic surveys. -As there are many survey software, and the market is rapidly evolving, -we focus on workflows and primary concepts, rather than software-specific tools. +As there are many excellent resources on questionnaire design and field supervision, +but few covering the particular challenges and opportunities presented by electronic surveys, +we focus on the specific workflow considerations to ensure that +digitally-generated data is accurate and usable in statistical software. +There are many survey software options, and the market is rapidly evolving, +so we focus on workflows and primary concepts rather than software-specific tools. We conclude with a discussion of safe handling, storage, and sharing of data. -Regardless of the type of data you collect, -the secure management of those files is a basic requirement -for satisfying the legal and ethical agreements that have allowed you -to access personal information for research purposes in the first place. -By following these guidelines, you will be able to move on to data analysis, -assured that your data has been obtained at high standards of both quality and security. +Secure file management is a basic requirement +to comply with the legal and ethical agreements that allow + access to personal information for research purposes. + \end{fullwidth} From 6ea21462abedea91c57f08981600d9fe8ed8f1da Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Fri, 21 Feb 2020 10:30:15 -0500 Subject: [PATCH 764/854] [ch4] add one word to help reader follow this applies to all numbers, so this is not more correct, but I think this helps the users stay on track --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 7a5e44f19..0cf777dc7 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -108,7 +108,7 @@ \subsection{Implementing random processes reproducibly in Stata} since Stata's \texttt{version} setting expires after each time you run your do-files. \textbf{Sorting} means that the actual data that the random process is run on is fixed. -Because numbers are assigned to each observation in row-by-row starting from +Because random numbers are assigned to each observation in row-by-row starting from the top row, changing their order will change the result of the process. Since the exact order must be unchanged, the underlying data itself must be unchanged as well between runs. From cf99b8226eef4ea38d6f547382f4ed9678d3e53d Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 21 Feb 2020 11:43:45 -0500 Subject: [PATCH 765/854] Update data-collection.tex Updates to data ownership and data licensing subsections --- chapters/data-collection.tex | 76 +++++++++++++++++++++++------------- 1 file changed, 49 insertions(+), 27 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 744edcc7f..4c2583203 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -41,16 +41,20 @@ \section{Acquiring data} High-quality data is essential to most modern development research. -Often, there is simply no source of reliable official statistics -on the inputs or outcomes we are interested in. -Therefore we undertake to create or obtain development data --- including administrative data, secondary data, original records, -field surveys, or other forms of big data -- -typically in partnership with a local agency or organization. -The intention of this mode of data acquisition -is to answer a unique question that cannot be approached in any other way, -so it is important to properly collect and handle that data, -especially when it belongs to or describes people. +Many research questions require the generation of novel data, +because no source of reliable official statistics +or other public data that addresses the inputs or outcomes of interest. +Data generation can take many forms, including: +primary data collection through surveys; +private sector partnerships granting access to new data sources, including administrative and sensor data; +digitization of paper records, including administrative data; +primary data capture by unmanned aerial vehicles or other types of remote sensing; +or novel integration of various types of datasets, e.g. combining survey and sensor data. +Except in the case of primary surveys funded by the research team, +the data is typically not owned by the research team. +Data ownership and licensing agreements are required +for the research team to access the data and publish derivative work. + \subsection{Data ownership} @@ -88,11 +92,12 @@ \subsection{Data licensing agreements} Data licensing is the formal act of giving some data rights to others while retaining ownership of a particular dataset. -Whether or not you are the owner of a dataset you want to analyze, -you can enter into a licensing agreement to access it for research purposes. +If you are not the owner of the dataset you want to analyze, +you should enter into a licensing or terms-of-use agreement to access it for research purposes. Similarly, when you own a dataset, -you may be interested in allowing specific people -or the general public to use it for various reasons. +you must consider whether the data can be made accessible to other researchers, +and what terms-of-use you require. + As a researcher, it is your responsibility to respect the rights of people who own data and people who are described in it; but it is also your responsibility to make sure @@ -102,31 +107,48 @@ \subsection{Data licensing agreements} and to fully inform others of what you are doing. Writing down and agreeing to specific details is a good way of doing that. -When you are licensing someone else's data for research, -keep in mind that they are not likely to be familiar +If the research team requires access to existing data for novel research, +terms of use should be agreed on with the data owner, +typically through a data licensing agreement. +Keep in mind that the data owner is likely not familiar with the research process, and therefore may be surprised at some of the things you want to do if you are not clear up front. -You will typically want the right to create and retain -derivative indicators, and you will want to own that output dataset. -You will want to store, catalog, or publish, in whole or in part, -either the original licensed material or the derived dataset. +You will typically want intellectual property rights to all derivative works developed used the data, +a license for all uses of derivative works, including public distribution +(unless ethical considerations contraindicate this). +This is important to allow the research team to store, catalog, and publish, in whole or in part, +either the original licensed dataset or the derived materials. Make sure that the license you obtain from the data owner allows these uses, -and that you check in with them if you have any questions -about what you are allowed to do with specific portions of their data. - -When you are licensing your own data for release, -whether it is to a particular individual or to a group, -make sure you take the same considerations. -Would you be okay with someone else publicly releasing that data in full? +and that you consult with the owner if you foresee exceptions with specific portions of the data. + +If the research team contracts a vendor to collect original data, +contract terms must clearly stipulate that the research team owns the data +and maintains full intellectual property rights. +The contract should also explicitly stipulate that the contracted firm +is responsible for protecting the confidentiality of the respondents, +and that the data collection will not be distributed to any third parties +or used by the firm or subcontractors for any purpose not expressly stated in the contract, +before, during or after the assignment. +The contract should also stipulate that the vendor is required to comply with +ethical standards for social science research, +and adhere to the specific terms of agreement with the relevant +Institutional Review Board or applicable local authority. + +Research teams that collect their own data must consider the terms +under which they will release that data to other researchers or to the general public. +Will you publicly releasing the data in full (removing personal identifiers))? Would you be okay with it being stored on servers anywhere in the world, even ones that are owned by corporations or governments abroad? +Would you prefer to decided permission on a case-by-case basis, dependent on specific proposed uses? Would you expect that users of your data cite you or give you credit, or would you require them in turn to release their derivative data or publications under similar licenses as yours? Whatever your answers are to these questions, make sure your license or other agreement +under which you publish the data specifically details those requirements. + \subsection{Receiving data from development partners} Data may be received from development partners in various ways. From ece359f9973cc7b801b683df73ed67600117316c Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 21 Feb 2020 11:55:13 -0500 Subject: [PATCH 766/854] Update data-collection.tex updates to 'receiving data from development partners' --- chapters/data-collection.tex | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 4c2583203..2fc60214b 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -151,23 +151,21 @@ \subsection{Data licensing agreements} \subsection{Receiving data from development partners} -Data may be received from development partners in various ways. -You may conduct a first-hand survey either of them or with them -(more on that in the next section). +Research teams granted access to existing data may receive that data in a number of different ways. You may receive access to servers or accounts that already exist. You may receive a one-time transfer of a block of data, or you may be given access to a restricted area to extract information. +In all cases, you must take action to ensure that data is transferred through +secure channels so that confidential data is not compromised. Talk to an information-technology specialist, either at your organization or at the partner organization, to ensure that data is being transferred, received, and stored in a method that conforms to the relevant level of security. -The data owner will determine the appropriate level of security. -Whether or not you are the data owner, you will need to use your judgment -and follow the data protocols that were determined -in the course of your IRB approval to obtain and use the data: -these may be stricter than the requirements of the data provider. +Keep in mind that compliance with ethical standards may +in some cases require a stricter level of security than initially proposed by the partner agency. -Another consideration that is important at this stage is proper documentation and cataloging of data. +Another important consideration at this stage is +proper documentation and cataloging of data and associated metadata. It is not always clear what pieces of information jointly constitute a ``dataset'', and many of the sources you receive data from will not be organized for research. To help you keep organized and to put some structure on the materials you will be receiving, @@ -180,6 +178,15 @@ \subsection{Receiving data from development partners} and it is not possible to keep track of these kinds of information as data over time. Eventually, you will want to make sure that you are creating a collection or object that can be properly submitted to a data catalog and given a reference and citation. +The metadata - documentation about the data - is critical for future use of the data. +Metadata should include documentation of how the data was created, +what they measure, and how they are to be used. +In the case of survey data, this includes the survey instrument and associated manuals; +the sampling protocols and field adherence to those protocols, and any sampling weights; +what variable(s) uniquely identify the dataset(s), and how different datasets can be linked; +and a description of field procedures and quality controls. +We use as a standard the Data Documentation Initiative (DDI), which is supported by the +World Bank's Microdata Catalog.\sidenote{\url{microdata.worldbank.org}} As soon as the requisite pieces of information are stored together, think about which ones are the components of what you would call a dataset. From bdbdd96e39c55ead3daed26257aca487dedd2778 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 21 Feb 2020 13:15:57 -0500 Subject: [PATCH 767/854] small fixes to compile --- chapters/handling-data.tex | 8 ++++---- chapters/planning-data-work.tex | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 782890e62..11615f766 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -310,14 +310,14 @@ \subsection{Obtaining ethical approval and consent} before it happens, as it is happening, or after it has already happened. It also means that they must explicitly and affirmatively consent to the collection, storage, and use of their information for any purpose. -Therefore, the development of appropriate consent processes is of primary importance.\sidenote{ - url\https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} +Therefore, the development of appropriate consent processes is of primary importance.\sidenote{url\ + {https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} All survey instruments must include a module in which the sampled respondent grants informed consent to participate. Research participants must be informed of the purpose of the research, what their participation will entail in terms of duration and any procedures, any foreseeable benefits or risks, -and how their identity will be protected.\sidenote{ - \url{https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} +and how their identity will be protected.\sidenote{\url + {https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} There are special additional protections in place for vulnerable populations, such as minors, prisoners, and people with disabilities, and these should be confirmed with relevant authorities if your research includes them. diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 89edf6c5c..510c59149 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -441,7 +441,7 @@ \subsection{Organizing files and folder structures} but you should make sure to name them in an orderly fashion that works for your team. These rules will ensure you can find files within folders and reduce the amount of time others will spend opening files -to find out what is inside them.\sidenote{\url\{https://dimewiki.worldbank.org/wiki/Naming_Conventions}} +to find out what is inside them.\sidenote{\url{https://dimewiki.worldbank.org/wiki/Naming_Conventions}} % ---------------------------------------------------------------------------------------------- From 3b6d44e4fdbbc5a0ad579df3ba3a2e9dce11572f Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 21 Feb 2020 14:08:38 -0500 Subject: [PATCH 768/854] Update publication.tex Updates to intro --- chapters/publication.tex | 57 +++++++++++++++++++--------------------- 1 file changed, 27 insertions(+), 30 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 32993eb11..24c2ec40a 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -3,38 +3,32 @@ \begin{fullwidth} For most research projects, completing a manuscript is not the end of the task. Academic journals increasingly require submission of a replication package -which contains the code and materials needed to create the results. +which contains the code and materials used to create the results. These represent an intellectual contribution in their own right, because they enable others to learn from your process and better understand the results you have obtained. -Typically, various contributors collaborate on both code and writing, -manuscripts go through many iterations and revisions, -and the final package for publication includes not just a manuscript -but also the code and data used to generate the results. -Ideally, your team will spend as little time as possible -fussing with the technical requirements of publication. +Holding code and data to the same standards a written work +is a new practice for many researchers. +Publication typically involves multiple iterations of manuscript, +code, and data files, with inputs from multiple collaborators. +This process can quickly become unwieldy. It is in nobody's interest for a skilled and busy researcher to spend days re-numbering references (and it can take days) when a small amount of up-front effort can automate the task. -In this section we suggest several methods -- -collectively referred to as dynamic documents -- -for managing the process of collaboration on any technical product. -Holding code and data to the same standards a written work -is a new practice for many researchers. -In this chapter, we provide guidelines that will help you -prepare a functioning and informative replication package. -Ideally, if you have organized your analytical work -according to the general principles outlined throughout this book, -then preparing to release materials will not require +In this chapter, we suggest tools and workflows for efficiently managing collaboration +and ensuring reproducible outputs. +First, we discuss how to use dynamic documents to collaborate on technical writing. +Second, we provide guidelines for preparing a functioning and informative replication package. +If you have organized your analytical work +according to the general principles outlined in earlier chapters, +preparing to release materials will not require substantial reorganization of the work you have already done. Hence, this step represents the conclusion of the system of transparent, reproducible, and credible research we introduced from the very first chapter of this book. -We start the chapter with a discussion about tools and workflows for collaborating on technical writing. -Next, we turn to publishing data, -noting that the data can itself be a significant contribution in addition to analytical results. -Finally, we provide guidelines that will help you to prepare a functioning and informative replication package. +We include specific guidance on publishing both code and data files, +noting that these can be a significant contribution in addition to analytical results. In all cases, we note that technology is rapidly evolving and that the specific tools noted here may not remain cutting-edge, but the core principles involved in publication and transparency will endure. @@ -44,19 +38,22 @@ \section{Collaborating on technical writing} -It is increasingly rare that a single author will prepare an entire manuscript alone. -More often than not, documents will pass back and forth between several writers +Development economics research is increasingly a collaborative effort. +This reflects changes in the economics discipline overall: +the number of sole-authored papers is decreasing, +and the majority of recent papers in top journals have three or more +authors.\sidenote{\url{https://voxeu.org/article/growth-multi-authored-journal-articles-economics}} +As a consequence, manuscripts typically pass back and forth between several writers before they are ready for publication, -so it is essential to use technology and workflows that avoid conflicts. Just as with the preparation of analytical outputs, -this means adopting tools and practices that enable tasks -such as version control and simultaneous contribution. -Furthermore, it means preparing documents that are \textbf{dynamic} -- -meaning that updates to the analytical outputs that constitute them +effective collaboration requires the adoption of tools and practices +that enable version control and simultaneous contribution. +\textbf{Dynamic documents} are a way to significantly simplify workflows: +updates to the analytical outputs that constitute them can be passed on to the final output with a single process, rather than copy-and-pasted or otherwise handled individually. -Thinking of the writing process in this way -is intended to improve organization and reduce error, +Managing the writing process in this way +improves organization and reduces error, such that there is no risk of materials being compiled with out-of-date results, or of completed work being lost or redundant. From b7cd0da86da37dda64097843ae075684074ffa85 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 21 Feb 2020 15:12:36 -0500 Subject: [PATCH 769/854] Update publication.tex updates to dynamic docs and data publication --- chapters/publication.tex | 187 +++++++++++++++++++-------------------- 1 file changed, 92 insertions(+), 95 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 24c2ec40a..9a4aab17a 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -81,41 +81,39 @@ \subsection{Preparing dynamic documents} that a mistake will be made or something will be missed. Therefore this is a broadly unsuitable way to prepare technical documents. -There are a number of tools that can be used for dynamic documents. -Some are code-based tools such as R's RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com}} +However, the most widely utilized software +for dynamically managing both text and results is \LaTeX\ (pronounced ``lah-tek'').\sidenote{ + \url{https://github.com/worldbank/DIME-LaTeX-Templates}} +\index{\LaTeX} +\LaTeX\ is a document preparation and typesetting system with a unique syntax. +While this tool has a significant learning curve, +its enormous flexibility in terms of operation, collaboration, output formatting, and styling +make it the primary choice for most large technical outputs. +In fact, \LaTeX\ operates behind-the-scenes in many other dynamic document tools (discussed below). +Therefore, we recommend that you learn to use \LaTeX\ directly +as soon as you are able to and provide several resources for doing so in the next section. + +There are also code-based tools that can be used for dynamic documents, +such as R's RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com}} and Stata's \texttt{dyndoc}.\sidenote{\url{https://www.stata.com/manuals/rptdyndoc.pdf}} These tools ``knit'' or ``weave'' text and code together, and are programmed to insert code outputs in pre-specified locations. Documents called ``notebooks'' (such as Jupyter\sidenote{\url{https://jupyter.org}}) work similarly, as they also use the underlying analytical software to create the document. -These types of dynamic documents are usually appropriate for short or informal materials +These tools are usually appropriate for short or informal documents because they tend to offer restricted editability outside the base software -and often have limited abilities to incorporate precision formatting. +and often have limited abilities to incorporate precise formatting. -There are other dynamic document tools -which do not require direct operation of the underlying code or software, +There are also simple tools for dynamic documents +that do not require direct operation of the underlying code or software, simply access to the updated outputs. -These can be useful for working on informal outputs, such as blogposts, -with collaborators who do not code. An example of this is Dropbox Paper, a free online writing tool that allows linkages to files in Dropbox which are automatically updated anytime the file is replaced. +They have limited functionality in terms of version control and formatting, +but can be useful for working on informal outputs, such as blogposts, +with collaborators who do not code. -However, the most widely utilized software -for dynamically managing both text and results is \LaTeX\ (pronounced ``lah-tek'').\sidenote{ - \url{https://github.com/worldbank/DIME-LaTeX-Templates}} - \index{\LaTeX} -Rather than using a coding language that is built for another purpose -or trying to hide the code entirely, -\LaTeX\ is a document preparation and typesetting system with a unique syntax. -While this tool has a significant learning curve, -its enormous flexibility in terms of operation, collaboration, -and output formatting and styling -makes it the primary choice for most large technical outputs today, -and it has proven to have enduring popularity. -In fact, \LaTeX\ operates behind-the-scenes in many of the tools listed before. -Therefore, we recommend that you learn to use \LaTeX\ directly -as soon as you are able to and provide several resources for doing so. \subsection{Technical writing with \LaTeX} @@ -166,7 +164,8 @@ \subsection{Technical writing with \LaTeX} With these tools, you can ensure that references are handled in a format you can manage and control.\cite{flom2005latex} -Finally, \LaTeX\ has one more useful trick: + +\LaTeX\ has one more useful trick: using \textbf{\texttt{pandoc}},\sidenote{ \url{https://pandoc.org}} you can translate the raw document into Word @@ -202,9 +201,9 @@ \subsection{Technical writing with \LaTeX} with \LaTeX\ before adopting one of these tools. They can require a lot of troubleshooting at a basic level at first, and non-technical staff may not be willing or able to acquire the required knowledge. -Therefore, to take advantage of the features of \LaTeX, -while making it easy and accessible to the entire writing team, -we need to abstract away from the technical details where possible. +Cloud-based implementations of \LaTex\, discussed in the next section, +allow teams to take advantage of the features of \LaTeX, +without requiring knowledge of the technical details. \subsection{Getting started with \LaTeX\ in the cloud} @@ -212,22 +211,22 @@ \subsection{Getting started with \LaTeX\ in the cloud} but the control it offers over the writing process is invaluable. In order to make it as easy as possible for your team to use \LaTeX\ without all members having to invest in new skills, -we suggest using a web-based implementation as your first foray into \LaTeX\ writing. +we suggest using a cloud-based implementation as your first foray into \LaTeX\ writing. Most such sites offer a subscription feature with useful extensions and various sharing permissions, and some offer free-to-use versions with basic tools that are sufficient for a broad variety of applications, up to and including writing a complete academic paper with coauthors. -Cloud-based implementations of \LaTeX\ are suggested here for several reasons. -Since they are completely hosted online, -they avoids the inevitable troubleshooting of setting up a \LaTeX\ installation -on various personal computers run by the different members of your team. -They also typically maintain a single continuously synced master copy of the document +Cloud-based implementations of \LaTeX\ have several advantageous features. +First, since they are completely hosted online, +they avoid the inevitable troubleshooting required to set up a \LaTeX\ installation +on various personal computers run by the different members of a team. +Second, they typically maintain a single, continuously synced, master copy of the document so that different writers do not create conflicted or out-of-sync copies, or need to deal with Git themselves to maintain that sync. -They typically allow inviting collaborators to edit in a fashion similar to Google Docs, +Third, they typically allow collaborators to edit in a fashion similar to Google Docs, though different services vary the number of collaborators and documents allowed at each tier. -Most importantly, some tools provide a ``rich text'' editor +Fourth, and most usefully, some implementations provide a ``rich text'' editor that behaves pretty similarly to familiar tools like Word, so that collaborators can write text directly into the document without worrying too much about the underlying \LaTeX\ coding. @@ -235,10 +234,11 @@ \subsection{Getting started with \LaTeX\ in the cloud} so it is easy to start up a project and see results right away without needing to know a lot of the code that controls document formatting. -On the downside, there is a small amount of up-front learning required, +Cloud-based implementations of \LaTeX\ also have disadvantages. +There is a small amount of up-front learning required, continous access to the Internet is necessary, and updating figures and tables requires a bulk file upload that is tough to automate. -One of the most common issues you will face using online editors will be special characters +A common problem you will face using online editors is special characters which, because of code functions, need to be handled differently than in Word. Most critically, the ampersand (\texttt{\&}), percent (\texttt{\%}), and underscore (\texttt{\_}) need to be ``escaped'' (interpreted as text and not code) in order to render. @@ -248,6 +248,7 @@ \subsection{Getting started with \LaTeX\ in the cloud} cloud-based implementations are often the easiest way to allow coauthors to write and edit in \LaTeX\, so long as you make sure you are available to troubleshoot minor issues like these. + %---------------------------------------------------- \section{Preparing a complete replication package} @@ -265,33 +266,30 @@ \section{Preparing a complete replication package} if not, paring it down to the ``replication package'' may take some time. A complete replication package should accomplish several core functions. It must provide the exact data and code that is used for a paper, -all necessary de-identified data for the analysis, +all necessary (de-identified) data for the analysis, and all code necessary for the analysis. -The code should exactly reproduce the raw outputs you have used for the paper, -and should not include documentation or data you would not share publicly. +The code and data should exactly reproduce the raw outputs you have used for the paper, +and the replication file should not include any documentation or data you would not share publicly. This usually means removing project-related documentation such as contracts and details of data collection and other field work, and double-checking all datasets for potentially identifying information. \subsection{Publishing data for replication} -Enabling permanent access to the data used in your study -is an important contribution you can make along with the publication of results. -It allows other researchers to validate the mechanical construction of your results, -to investigate what other results might be obtained from the same population, -and test alternative approaches or other questions. -Therefore you should make clear in your study -where and how data are stored, and how and under what circumstances they may be accessed. -You do not always have to publish the data yourself, -and in some cases you are legally not allowed to. +Publicly documenting all original data generated as part of a research project +is an important contribution in its own right. +Your paper should clearly cite the data used, +where and how it is stored, and how and under what circumstances it may be accessed. +You may not be able to publish the data itself, +due to licensing agreements or ethical concerns. Even if you cannot release data immediately or publicly, -there are often options to catalog or archive the data without open publication. +there are often options to catalog or archive the data. These may take the form of metadata catalogs or embargoed releases. Such setups allow you to hold an archival version of your data which your publication can reference, -as well as provide information about the contents of the datasets +and provide information about the contents of the datasets and how future users might request permission to access them -(even if you are not the person who can grant that permission). +(even if you are not the person to grant that permission). They can also provide for timed future releases of datasets once the need for exclusive access has ended. @@ -299,16 +297,17 @@ \subsection{Publishing data for replication} releasing the cleaned dataset is a significant contribution that can be made in addition to any publication of analysis results.\sidenote{ \url{https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-publishing-research-data.pdf}} -Publishing data can foster collaboration with researchers -interested in the same subjects as your team. -Collaboration can enable your team to fully explore variables and -questions that you may not have time to focus on otherwise, -even though data was collected on them. +It allows other researchers to validate the mechanical construction of your results, +to investigate what other results might be obtained from the same population, +and test alternative approaches or answer other questions. +This fosters collaboration and may enable your team to fully explore variables and +questions that you may not have time to focus on otherwise. There are different options for data publication. The World Bank's Development Data Hub\sidenote{ - \url{https://data.worldbank.org}} + \url{https://datacatalog.worldbank.org}} includes a Microdata Catalog\sidenote{ \url{https://microdata.worldbank.org}} +and a Geospatial Catalog, where researchers can publish data and documentation for their projects.\sidenote{ \url{https://dimewiki.worldbank.org/Microdata\_Catalog} \newline @@ -316,43 +315,42 @@ \subsection{Publishing data for replication} } The Harvard Dataverse\sidenote{ \url{https://dataverse.harvard.edu}} -publishes both data and code, -and also creates a data citation for its entries -- -IPA/J-PAL field experiment repository is especially relevant\sidenote{ - \url{https://www.povertyactionlab.org/blog/9-11-19/new-hub-data-randomized-evaluations}} -for those interested in impact evaluation. +publishes both data and code. +The Datahub for Field Experiments in Economics and Public Policy\sidenote{\url{https://dataverse.harvard.edu/dataverse/DFEEP}} +is especially relevant for impact evaluations. +Both the World Bank Microdata Catalog and the Harvard Dataverse +create data citations for deposited entries. -What matters is for you to be able to cite or otherwise directly reference the data used. When your raw data is owned by someone else, or for any other reason you are not able to publish it, -in many cases you will have the right to release -at least some subset of your constructed data set, -even if it is just the derived indicators you constructed and their documentation.\sidenote{ +in many cases you will still have the right to release derivate datasets, +even if it is just the indicators you constructed and their documentation.\sidenote{ \url{https://guide-for-data-archivists.readthedocs.io}} If you have questions about your rights over original or derived materials, check with the legal team at your organization or at the data provider's. Make sure you have a clear understanding of the rights associated with the data release and communicate them to any future users of the data. -You must provide a license with any data release.\sidenote{ + +When you do publish data, you decide how it may be used and what, if any license, you will assign to it.\sidenote{ \url{https://iatistandard.org/en/guidance/preparing-organisation/organisation-data-publication/how-to-license-your-data}} -Some common license types are documented at the World Bank Data Catalog\sidenote{ - \url{https://datacatalog.worldbank.org/public-licenses}} -and the World Bank Open Data Policy has futher examples of licenses that are used there.\sidenote{ +Terms of use available in the World Bank Microdata Catalog include, in order of increasing restrictiveness: open access, direct access, and licensed access.\sidenote{ \url{https://microdata.worldbank.org/index.php/terms-of-use}} -This document need not be extremely detailed, -but it should clearly communicate to the reader what they are allowed to do with your data and -how credit should be given and to whom in further work that uses it. +Open Access data is freely available to anyone, and simply requires attribution. +Direct Access data is to registered users who agree to use the data for statistical and scientific research purposes only, +to cite the data appropriately, and to not attempt to identify respondents or data providers or link to other datasets that could allow for re-identification. +Licensed access data is restricted to bona fide users, who submit a documented application for how they will use the data and sign an agreement governing data use. +The user must be acting on behalf of an organization, which will be held responsible in the case of any misconduct. Keep in mind that you may or may not own your data, depending on how it was collected, -and the best time to resolve any questions about these rights -is at the time that data collection or transfer agreements are signed. +and the best time to resolve any questions about licensing rights +is at the time that data collection or sharing agreements are signed. -Data publication should release the dataset in a widely recognized format. +Published data should be released in a widely recognized format. While software-specific datasets are acceptable accompaniments to the code (since those precise materials are probably necessary), you should also consider releasing generic datasets such as CSV files with accompanying codebooks, -since these will be re-adaptable by any researcher. +since these can be used by any researcher. Additionally, you should also release the data collection instrument or survey questionnaire so that readers can understand which data components are @@ -365,18 +363,7 @@ \subsection{Publishing data for replication} particularly where definitions may vary, so that others can learn from your work and adapt it as they like. -As in the case of raw primary data, -final analysis data sets that will become public for the purpose of replication -must also be fully de-identified. -In cases where PII data is required for analysis, -we recommend embargoing the sensitive variables when publishing the data. -You should contact an appropriate data catalog -to determine what privacy and licensing options are available. -Access to the embargoed data could be granted for the purposes of study replication, -if approved by an IRB. - -There will almost always be a trade-off between accuracy and privacy. -For publicly disclosed data, you should favor privacy. +\subsection{De-identifying data for publication} Therefore, before publishing data, you should carefully perform a \textbf{final de-identification}. Its objective is to create a dataset for publication @@ -404,17 +391,27 @@ \subsection{Publishing data for replication} The \texttt{sdcMicro} tool, in particular, has a feature that allows you to assess the uniqueness of your data observations, and simple measures of the identifiability of records from that. -Additional options to protect privacy in data that will become public exist, -and you should expect and intend to release your datasets at some point. -One option is to add noise to data, as the US Census Bureau has proposed,\cite{abowd2018us} -as it makes the trade-off between data accuracy and privacy explicit. -But there are no established norms for such ``differential privacy'' approaches: + +There will almost always be a trade-off between accuracy and privacy. +For publicly disclosed data, you should favor privacy. +Stripping identifying variables from a dataset may not be sufficient to protect respondent privacy, +due to the risk of re-identification. +One potential solution is to add noise to data, as the US Census Bureau has proposed.\cite{abowd2018us} +This makes the trade-off between data accuracy and privacy explicit. +But there are not, as of yet, established norms for such ``differential privacy'' approaches: most approaches fundamentally rely on judging ``how harmful'' information disclosure would be. The fact remains that there is always a balance between information release (and therefore transparency) and privacy protection, and that you should engage with it actively and explicitly. The best thing you can do is make a complete record of the steps that have been taken so that the process can be reviewed, revised, and updated as necessary. +In cases where PII data is required for analysis, +we recommend embargoing the sensitive variables when publishing the data. +Access to the embargoed data could be granted for specific purposes, +such as a computational reproducibility check required for publication, +if done under careful data security protocols and approved by an IRB. + + \subsection{Publishing code for replication} Before publishing your code, you should edit it for content and clarity From e66992739daf503ca8313c42ffff649b8334e310 Mon Sep 17 00:00:00 2001 From: Maria Date: Fri, 21 Feb 2020 15:36:24 -0500 Subject: [PATCH 770/854] Update publication.tex final edits to ch7 --- chapters/publication.tex | 68 ++++++++++++++++++++-------------------- 1 file changed, 34 insertions(+), 34 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 9a4aab17a..bd8c0461f 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -419,34 +419,33 @@ \subsection{Publishing code for replication} The purpose of releasing code is to allow others to understand exactly what you have done in order to obtain your results, as well as to apply similar methods in future projects. -Therefore it should both be functional and readable. -If you've followed the recommendations in this book, -this will be much easier to do. +Therefore it should both be functional and readable +(if you've followed the recommendations in Chapter Six, +this should be easy to do!). Code is often not written this way when it is first prepared, so it is important for you to review the content and organization so that a new reader can figure out what and how your code should do. -Therefore, whereas your data should already be very clean at this stage, -your code is much less likely to be so, and this is where you need to make -time investments prior to releasing your replication package. -By contrast, replication code usually has few legal and privacy constraints. -In most cases code will not contain identifying information; -but make sure to check carefully that it does not. +Therefore, whereas your data should already be very clean by publication stage, +your code is much less likely to be so. +This is often where you need to invest time prior to releasing your replication package. + +Unlike data, code usually has few legal and privacy constraints to publication. +The research team owns the code in almost all cases, +and code is unlikely to contain identifying information +(though you must check carefully that it does not). Publishing code also requires assigning a license to it; in a majority of cases, code publishers like GitHub offer extremely permissive licensing options by default. (If you do not provide a license, nobody can use your code!) -Make sure the code functions identically on a fresh install of your chosen software. +Before releasing the code, make sure it functions identically on a fresh install of your chosen software. A new user should have no problem getting the code to execute perfectly. In either a scripts folder or in the root directory, -include a master script (dofile or R script for example). -The master script should allow the reviewer -to change a single line of code: the one setting the directory path. -After that, running the master script should run the entire project -and re-create all the raw outputs exactly as supplied. -Indicate the filename and line to change. -Check that all your code will run completely on a new computer: -Install any required user-written commands in the master script +you should include a master script that allows the reviewer to run the entire project +and re-create all raw outputsby changing only a single line of code: +the one setting the directory path. +To ensure that your code will run completely on a new computer, +you must install any required user-written commands in the master script (for example, in Stata using \texttt{ssc install} or \texttt{net install} and in R include code giving users the option to install packages, including selecting a specific version of the package if necessary). @@ -454,8 +453,8 @@ \subsection{Publishing code for replication} for any user-installed packages that are needed to ensure forward-compatibility. Make sure system settings like \texttt{version}, \texttt{matsize}, and \texttt{varabbrev} are set. -Finally, make sure that the code and its inputs and outputs are clearly identified. -A new user should, for example, be able to easily identify and remove +Finally, make sure that code inputs and outputs are clearly identified. +A new user should, for example, be able to easily find and remove any files created by the code so that they can be recreated quickly. They should also be able to quickly map all the outputs of the code to the locations where they are placed in the associated published material, @@ -463,11 +462,11 @@ \subsection{Publishing code for replication} Documentation in the master script is often used to indicate this information. For example, outputs should clearly correspond by name to an exhibit in the paper, and vice versa. (Supplying a compiling \LaTeX\ document can support this.) -Code and outputs which are not used should be removed. +Code and outputs which are not used should be removed before publication. \subsection{Releasing a replication package} -If you are at this stage, +Once your data and code are polished for public release, all you need to do is find a place to publish your materials. This is slightly easier said than done, as there are a few variables to take into consideration @@ -476,8 +475,8 @@ \subsection{Releasing a replication package} over the next few years; the specific solutions we mention here highlight some current approaches as well as their strengths and weaknesses. -GitHub provides one solution. -Making a GitHub repository public is completely free. +On option is GitHub. +Making a public GitHub repository is completely free. It can hold any file types, provide a structured download of your whole project, and allow others to look at alternate versions or histories easily. @@ -486,23 +485,22 @@ \subsection{Releasing a replication package} (However, there is a strict size restriction of 100MB per file and a restriction on the size of the repository as a whole, so larger projects will need alternative solutions.) - However, GitHub is not ideal for other reasons. It is not built to hold data in an efficient way -or to manage licenses or citations for datasets. +or to manage licenses. It does not provide a true archive service -- you can change or remove the contents at any time. -A repository such as the Harvard Dataverse\sidenote{ +It does not assign a permanent digital object identifier (DOI) link for your work. + +Another option is the Harvard Dataverse,\sidenote{ \url{https://dataverse.harvard.edu}} -addresses these issues, as it is designed to be a citable data repository; -the IPA/J-PAL field experiment repository is especially relevant.\sidenote{ - \url{https://www.povertyactionlab.org/blog/9-11-19/new-hub-data-randomized-evaluations}} +which is designed to be a citable data repository. The Open Science Framework\sidenote{ \url{https://osf.io}} -can also hold both code and data, -as can ResearchGate.\sidenote{ +and ResearchGate.\sidenote{ \url{https://https://www.researchgate.net}} -Some of these will also assign a permanent digital object identifier (DOI) link for your work. +can also hold both code and data. + Any of these locations is acceptable -- the main requirement is that the system can handle the structured directory that you are submitting, @@ -510,6 +508,8 @@ \subsection{Releasing a replication package} and report exactly what, if any, modifications you have made since initial publication. You can even combine more than one tool if you prefer, as long as they clearly point to each other. +For example, one could publish code on GitHub that points to data published on the World Bank MicroData catalog. + Emerging technologies such as the ``containerization'' approach of Docker or CodeOcean\sidenote{ \url{https://codeocean.com}} offer to store both code and data, @@ -531,4 +531,4 @@ \subsection{Releasing a replication package} sharing the preprint through. Do not use Dropbox or Google Drive for this purpose: many organizations do not allow access to these tools, -and that includes blocking staff from accessing your material. +and staff may be blocked from accessing your material. From 792a0618943031a8552378950efe74472207bc3d Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 10:19:54 -0500 Subject: [PATCH 771/854] Update data-analysis.tex Edits to intro, data management, de-identification --- chapters/data-analysis.tex | 141 +++++++++++++++++++------------------ 1 file changed, 71 insertions(+), 70 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index d868afaee..0e8e47602 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -45,36 +45,34 @@ \section{Managing data effectively} The goal of data management is to organize the components of data work so the complete process can traced, understood, and revised without massive effort. -In our experience, there are four key elements to good data management: +We focus on four key elements to good data management: folder structure, task breakdown, master scripts, and version control. -A good folder structure organizes files so that any material can be found when needed. -It reflects a task breakdown into steps with well-defined inputs, tasks, and outputs. -A master script connects folder structure and code. +A good \textbf{folder structure} organizes files so that any material can be found when needed. +It reflects a \textbf{task breakdown} into steps with well-defined inputs, tasks, and outputs. +A \textbf{master script} connects folder structure and code. It is a one-file summary of your whole project. -Finally, version histories and backups enable the team -to edit files without fear of losing information. -Smart use of version control allows you to track -how each edit affects other files in the project. +Finally, \textbf{version control} gives you clear file histories and backups, +which enable the team to edit files without fear of losing information +and track how each edit affects other files in the project. \subsection{Organizing your folder structure} There are many ways to organize research data. -Our preferred scheme reflects the task breakdown that will be outlined in this chapter. \index{data organization} Our team at DIME Analytics developed the \texttt{iefolder}\sidenote{ \url{https://dimewiki.worldbank.org/iefolder}} command (part of \texttt{ietoolkit}\sidenote{ \url{https://dimewiki.worldbank.org/ietoolkit}}) -to automatize the creation of a folder following this scheme and +to automatize the creation of folders following our preferred scheme and to standardize folder structures across teams and projects. -Standardizing folders greatly reduces the costs that PIs and RAs face when switching between projects, -because they are all organized in exactly the same way +A standardized structure greatly reduces the costs that PIs and RAs +face when switching between projects, +because folders are organized in exactly the same way and use the same filepaths, shortcuts, and macro references.\sidenote{ \url{https://dimewiki.worldbank.org/DataWork\_Folder}} We created \texttt{iefolder} based on our experience with primary data, -but it can be used for different types of data, -and adapted to fit different needs. -No matter what your team's preferences in terms of folder organization are, +but it can be used for other types of data. +Other teams may prefer a different scheme, but the principle of creating a single unified standard remains. At the top level of the structure created by \texttt{iefolder} are what we call ``round'' folders.\sidenote{ @@ -92,11 +90,10 @@ \subsection{Organizing your folder structure} \subsection{Breaking down tasks} -We divide the data work process that starts from the raw data -and builds on it to create final analysis outputs into four stages: +We divide the process of transforming raw datasets to analysis-ready datasets into four steps: de-identification, data cleaning, variable construction, and data analysis. -Though they are frequently implemented at the same time, -we find that creating separate scripts and data sets prevents mistakes. +Though they are frequently implemented concurrently, +creating separate scripts and data sets prevents mistakes. It will be easier to understand this division as we discuss what each stage comprises. What you should know for now is that each of these stages has well-defined inputs and outputs. This makes it easier to track tasks across scripts, @@ -107,19 +104,21 @@ \subsection{Breaking down tasks} So, for example, a script called \texttt{section-1-cleaning} would create a data set called \texttt{section-1-clean}. -The division of a project in stages also helps the review workflow inside your team. -The code, data and outputs of each of these stages should go through at least one round of code review. -During the code review process, team members should read and run each other's codes. -Doing this at the end of each stage helps prevent the amount of work to be reviewed to become too overwhelming. +The division of a project in stages also facilitates a review workflow inside your team. +The code, data and outputs of each of these stages should go through at least one round of code review, +in which team members read and run each other's codes. +Reviewing code at each stage, rather than waiting until the end of a project, +is preferrable as the amount of code to review is more manageable and +it allows you to correct errors in real-time (e.g. correcting errors in variable construction before analysis begins). Code review is a common quality assurance practice among data scientists. It helps to keep the quality of the outputs high, and is also a great way to learn and improve your own code. \subsection{Writing master scripts} Master scripts allow users to execute all the project code from a single file. -They briefly describe what each code does, -and map the files they require and create. -They also connect code and folder structure through macros or objects. +As discussed in Chapter 2, the master script should briefly describe what each +section of the code does, and map the files they require and create. +The master script also connects code and folder structure through macros or objects. In short, a master script is a human-readable map of the tasks, files, and folder structure that comprise a project. Having a master script eliminates the need for complex instructions to replicate results. @@ -152,13 +151,13 @@ \subsection{Implementing version control} \section{De-identifying research data} -The starting point for all tasks described in this chapter is the raw data -which should contain only information that is received directly from the field. +The starting point for all tasks described in this chapter is the raw dataset, +which should contain the exact data received, with no changes or additions. The raw data will invariably come in a variety of file formats and these files should be saved in the raw data folder \textit{exactly as they were -received}. Be mindful of how and where they are stored as they can not be +received}. Be mindful of how and where they are stored as they cannot be re-created and nearly always contain confidential data such as -personally-identifying information\index{personally-identifying information}. +\textbf{personally-identifying information}\index{personally-identifying information}. As described in the previous chapter, confidential data must always be encrypted\sidenote{\url{https://dimewiki.worldbank.org/Encryption}} and be properly backed up since every other data file you will use is created from the @@ -174,17 +173,19 @@ \section{De-identifying research data} You will only keep working from the fixed copy, but you keep both copies in case you later realize that the manual fix was done incorrectly. -Loading encrypted data frequently can be disruptive to the workflow. -To facilitate the handling of the data, remove any personally identifiable information from the data set. -This will create a de-identified data set, that can be saved in a non-encrypted folder. -De-identification,\sidenote{\url{https://dimewiki.worldbank.org/De-identification}} -at this stage, means stripping the data set of direct identifiers.\sidenote{\url{ -https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf}} -To be able to do so, you will need to go through your data set and -find all the variables that contain identifying information. -Flagging all potentially identifying variables in the questionnaire design stage +The first step in the transformation of raw data to an analysis-ready dataset is de-identification. +This simplifies workflows, as once you create a de-identified version of the dataset, +you no longer need to interact directly with the encrypted raw data. +at this stage, means stripping the data set of personally identifying information.\sidenote{ + \url{https://dimewiki.worldbank.org/De-identification}} +To do so, you will need to identify all variables that contain +identifying information.\sidenote{\url{ + https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf}} +For primary data collection, where the research team designs the survey instrument, +flagging all potentially identifying variables in the questionnaire design stage simplifies the initial de-identification process. -If you did not do that or you received the raw data from someone else, that are a few tools that can help you with it. +If you did not do that, or you received original data by another means, +there are a few tools to help flag variables with personally-identifying data. JPAL's \texttt{PII scan}, as indicated by its name, scans variable names and labels for common string patterns associated with identifying information.\sidenote{ \url{https://github.com/J-PAL/PII-Scan}} @@ -231,28 +232,34 @@ \section{De-identifying research data} \section{Cleaning data for analysis} -Data cleaning is the second stage in the transformation of data you received into data that you can analyze.\sidenote{\url{ -https://dimewiki.worldbank.org/Data\_Cleaning}} -The cleaning process involves (1) making the data set easily usable and understandable, +Data cleaning is the second stage in the transformation of raw data into data that you can analyze. +The cleaning process involves (1) making the data set easy to use and understand, and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. The cleaned data set should contain only the variables collected in the field. No modifications to data points are made at this stage, except for corrections of mistaken entries. Cleaning is probably the most time-consuming of the stages discussed in this chapter. -This is the time when you obtain an extensive understanding of the contents and structure of the data that was collected. -Explore your data set using tabulations, summaries, and descriptive plots. -You should use this time to understand the types of responses collected, both within each survey question and across respondents. +You need to acquire an extensive understanding of the contents and structure of the raw data. +Explore the data set using tabulations, summaries, and descriptive plots. Knowing your data set well will make it possible to do analysis. -\subsection{Correcting data entry errors} +\subsection{Identifying the identifier} -There are two main cases when the raw data will be modified during data cleaning. -The first one is when there are duplicated entries in the data. +The first step in the cleaning process is to understand the level of observation in the data (what makes a row), +and what variable or set of variables uniquely identifies each observations. Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} is possibly the most important step in data cleaning. -Modern survey tools create unique observation identifiers. -That, however, is not the same as having a unique ID variable for each individual in the sample. +It may be the case that the variable expected to be the unique identifier in fact is either incomplete or contains duplicates. +This could be due to duplicate observations or errors in data entry. +It could also be the case that there is no identifying variable, or the identifier is a long string, such as a name. +In this case cleaning begins by carefully creating a numeric variable that uniquely identifies the data. +As discussed in the previous chapter, +checking for duplicated entries is usually part of data quality monitoring, +and is ideally addressed as soon as data is received + +Note that while modern survey tools create unique observation identifiers, +that is not the same as having a unique ID variable for each individual in the sample. You want to make sure the data set has a unique ID variable that can be cross-referenced with other records, such as the master data set\sidenote{\url{https://dimewiki.worldbank.org/Master\_Data\_Set}} and other rounds of data collection. @@ -262,30 +269,23 @@ \subsection{Correcting data entry errors} create an automated workflow to identify, correct and document occurrences of duplicate entries. -As discussed in the previous chapter, -looking for duplicated entries is usually part of data quality monitoring, -and is typically addressed as part of that process. -So, in practice, you will start writing data cleaning code during data collection. -The other only other case when changes to the original data points are made during cleaning -is also directly connected to data quality monitoring: -it's when you need to correct mistakes in data entry. -During data quality monitoring, you will inevitably encounter data entry mistakes, +\subsection{Preparing a clean dataset} +The main output of data cleaning is the cleaned data set. +It should contain the same information as the raw data set, +with identifying variables and data entry mistakes removed. +When reviewing raw data, you will inevitably encounter data entry mistakes, such as typos and inconsistent values. These mistakes should be fixed in the cleaned data set, and you should keep a careful record of how they were identified, -and how the correct value was obtained. +and how the correct value was obtained.\sidenote{\url{ + https://dimewiki.worldbank.org/Data\_Cleaning}} -\subsection{Labeling, annotating, and finalizing clean data} - -The main output of data cleaning is the cleaned data set. -It should contain the same information as the raw data set, -with no changes to data points. -It should also be easily traced back to the survey instrument, -and be accompanied by a dictionary or codebook. +The clean dataset should always be accompanied by a dictionary or codebook. +Survey data should be easily traced back to the survey instrument. Typically, one cleaned data set will be created for each data source -or survey instrument. -Each row in the cleaned data set represents one survey entry or unit of -observation.\cite{tidy-data} +or survey instrument; and each row in the cleaned data set represents one +respondent or unit of observation.\cite{tidy-data} + If the raw data set is very large, or the survey instrument is very complex, you may want to break the data cleaning into sub-steps, and create intermediate cleaned data sets @@ -307,6 +307,7 @@ \subsection{Labeling, annotating, and finalizing clean data} This will help you organize your files and create a back up of the data, and some donors require that the data be filed as an intermediate step of the project. +\subsection{Labeling, annotating, and finalizing clean data} On average, making corrections to primary data is more time-consuming than when using secondary data. But you should always check for possible issues in any data you are about to use. The last step of data cleaning, however, From 310b53a31c64e6552191a6fa3147a79113df1892 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 10:36:49 -0500 Subject: [PATCH 772/854] Update data-analysis.tex edits and re-structuring of data cleaning section, trying to reframe to apply to non-survey data --- chapters/data-analysis.tex | 97 ++++++++++++++++++-------------------- 1 file changed, 47 insertions(+), 50 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 0e8e47602..a889343e9 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -269,10 +269,40 @@ \subsection{Identifying the identifier} create an automated workflow to identify, correct and document occurrences of duplicate entries. +\subsection{Labeling, annotating, and finalizing clean data} + +The last step of data cleaning is to label and annotate the data, +so that all users have the information needed to interact with it. +There are three key steps: renaming, labeling and recoding. +This is a key step to making the data easy to use, but it can be quite repetitive. +The \texttt{iecodebook} command suite, also part of \texttt{iefieldkit}, +is designed to make some of the most tedious components of this process easier.\sidenote{ + \url{https://dimewiki.worldbank.org/iecodebook}} +\index{iecodebook} + +First, \textbr{renaming}: for data with an accompanying survey instrument, +it is useful to keep the same variable names in the cleaned dataset as in the survey instrument. +That way it's straightforward to link variables to the relevant survey question. +Second, \textbf{labeling}: applying labels makes it easier to understand your data as you explore it, +and thus reduces the risk of small errors making their way through into the analysis stage. +Variable and value labels should be accurate and concise.\sidenote{ + \url{https://dimewiki.worldbank.org/Data\_Cleaning\#Applying\_Labels}} + +Third, \textbf{recoding}: codes for ``Don't know'', ``Refused to answer'', and +other non-responses should be recoded into extended missing values.\sidenote{ + \url{https://dimewiki.worldbank.org/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} +String variables that correspond to categorical variables should be encoded. +Open-ended responses stored as strings usually have a high risk of being identifiers, +so they should be encoded into categories as much as possible and raw data points dropped. +You can use the encrypted data as an input to a construction script +that categorizes these responses and merges them to the rest of the dataset. + \subsection{Preparing a clean dataset} The main output of data cleaning is the cleaned data set. It should contain the same information as the raw data set, with identifying variables and data entry mistakes removed. +Although primary data typically requires more extensive data cleaning than secondary data, +you should carefully explore possible issues in any data you are about to use. When reviewing raw data, you will inevitably encounter data entry mistakes, such as typos and inconsistent values. These mistakes should be fixed in the cleaned data set, @@ -295,71 +325,39 @@ \subsection{Preparing a clean dataset} This will make the cleaning faster and the data easier to handle during construction. But having a single cleaned data set will help you with sharing and publishing the data. +Finally, any additional information collected only for quality monitoring purposes, +such as notes and duration fields, can also be dropped. To make sure the cleaned data set file doesn't get too big to be handled, use commands such as \texttt{compress} in Stata to make sure the data is always stored in the most efficient format. + Once you have a cleaned, de-identified data set and the documentation to support it, -you have created the first data output of your project: -a publishable data set. +you have created the first data output of your project: a publishable data set. The next chapter will get into the details of data publication. -For now, all you need to know is that your team should consider submitting the data set for publication at this point, +For now, all you need to know is that your team should consider submitting this data set for publication, even if it will remain embargoed for some time. -This will help you organize your files and create a back up of the data, +This will help you organize your files and create a backup of the data, and some donors require that the data be filed as an intermediate step of the project. -\subsection{Labeling, annotating, and finalizing clean data} -On average, making corrections to primary data is more time-consuming than when using secondary data. -But you should always check for possible issues in any data you are about to use. -The last step of data cleaning, however, -will most likely be necessary no matter what type of data is involved. -It consists of labeling and annotating the data, -so that its users have all the information needed to interact with it. -The last step of data cleaning, however, will most likely still be necessary. -It consists of labeling and annotating the data, so that its users have all the -information needed to interact with it. -This is a key step to making the data easy to use, but it can be quite repetitive. -The \texttt{iecodebook} command suite, also part of \texttt{iefieldkit}, -is designed to make some of the most tedious components of this process, -such as renaming, relabeling, and value labeling, much easier.\sidenote{ - \url{https://dimewiki.worldbank.org/iecodebook}} -\index{iecodebook} - -We have a few recommendations on how to use this command, -and how to approach data cleaning in general. -First, we suggest keeping the same variable names in the cleaned data set as in the survey instrument, -so it's straightforward to link data points for a variable to the question that originated them. -Second, don't skip the labeling. -Applying labels makes it easier to understand what the data mean as you explore it, -and thus reduces the risk of small errors making their way through into the analysis stage. -Variable and value labels should be accurate and concise.\sidenote{ - \url{https://dimewiki.worldbank.org/Data\_Cleaning\#Applying\_Labels}} -Applying labels makes it easier to understand what the data is showing while exploring the data. -This minimizes the risk of small errors making their way through into the analysis stage. -Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/Data\_Cleaning\#Applying\_Labels}} -Third, recodes should be used to turn codes for ``Don't know'', ``Refused to answer'', and -other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} -String variables that correspond to categorical variables need to be encoded. -Open-ended responses stored as strings usually have a high risk of being identifiers, -so they should be dropped at this point. -You can use the encrypted data as an input to a construction script -that categorizes these responses and merges them to the rest of the dataset. -Finally, any additional information collected only for quality monitoring purposes, -such as notes and duration fields, can also be dropped. \subsection{Documenting data cleaning} - -Throughout the data cleaning process, you will need inputs from the field, -including enumerator manuals, survey instruments, +Throughout the data cleaning process, +you will often need extensive inputs from the people responsible for data collection. +(This could be a survey team, the government ministry responsible for administrative data systems, +the technology firm that generated remote sensing data, etc.) +You should acquire and organize all documentation of how the data was generated, such as +data collection reports, field plans, data collection manuals, survey instruments, supervisor notes, and data quality monitoring reports. These materials are essential for data documentation.\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Documentation}} \index{Documentation} They should be stored in the corresponding \texttt{Documentation} folder for easy access, as you will probably need them during analysis, -and they must be made available for publication. +and should be published along with the data. + Include in the \texttt{Documentation} folder records of any corrections made to the data, including to duplicated entries, -as well as communications from the field where theses issues are reported. +as well as communications where theses issues are reported. Be very careful not to include sensitive information in documentation that is not securely stored, or that you intend to release as part of a replication package or data publication. @@ -368,9 +366,8 @@ \subsection{Documenting data cleaning} Use tabulations, summary statistics, histograms and density plots to understand the structure of data, and look for potentially problematic patterns such as outliers, missing values and distributions that may be caused by data entry errors. -Don't spend time trying to correct data points that were not flagged during data quality monitoring. -Instead, create a record of what you observe, -then use it as a basis to discuss with your team how to address potential issues during data construction. +Create a record of what you observe, +then use it as a basis for discussions of how to address data issues during variable construction. This material will also be valuable during exploratory data analysis. \section{Constructing final indicators} From 952b3d1917a14b2711f19990c0f330563fb19a01 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 11:01:33 -0500 Subject: [PATCH 773/854] Update data-analysis.tex updates to the rest of chapter 6 --- chapters/data-analysis.tex | 104 +++++++++++++++---------------------- 1 file changed, 42 insertions(+), 62 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index a889343e9..9ec1dc470 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -346,7 +346,7 @@ \subsection{Documenting data cleaning} (This could be a survey team, the government ministry responsible for administrative data systems, the technology firm that generated remote sensing data, etc.) You should acquire and organize all documentation of how the data was generated, such as -data collection reports, field plans, data collection manuals, survey instruments, +reports from the data provider, field protocols, data collection manuals, survey instruments, supervisor notes, and data quality monitoring reports. These materials are essential for data documentation.\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Documentation}} @@ -373,28 +373,25 @@ \subsection{Documenting data cleaning} \section{Constructing final indicators} % What is construction ------------------------------------- -The third stage in the creation of analysis data is construction. -Constructing variables means processing the data points as provided in the raw data to make them suitable for analysis. +The third stage is construction of the variables of interest for analysis. It is at this stage that the raw data is transformed into analysis data. This is done by creating derived variables (dummies, indices, and interactions, to name a few), as planned during research design\index{Research design}, and using the pre-analysis plan as a guide.\index{Pre-analysis plan} To understand why construction is necessary, -let's take the example of a household survey's consumption module. +let's take the example of a consumption module from a household survey. For each item in a context-specific bundle, -this module will ask whether the household consumed any of it over a certain period of time. -If they did, it will then ask about quantities, units and expenditure for each item. -However, it is difficult to run a meaningful regression +the respondent is asked whether the household consumed each item over a certain period of time. +If they did, the respondent will be asked about the quantity consumed and the cost of the relevant unit. +It would be difficult to run a meaningful regression on the number of cups of milk and handfuls of beans that a household consumed over a week. You need to manipulate them into something that has \textit{economic} meaning, such as caloric input or food expenditure per adult equivalent. During this process, the data points will typically be reshaped and aggregated so that level of the data set goes from the unit of observation -(one item in the bundle) in the survey to the unit of analysis (the household),\sidenote{ +(one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} -so that the level of the data set goes from the unit of observation (one item in the bundle) -in the survey to the unit of analysis (the household).\sidenote{ -\url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} + A constructed data set is built to answer an analysis question. Since different pieces of analysis may require different samples, @@ -413,9 +410,11 @@ \section{Constructing final indicators} % From cleaning Construction is done separately from data cleaning for two reasons. -The first one is to clearly differentiate the data originally collected -from the result of data processing decisions. -The second is to ensure that variable definition is consistent across data sources. +The first one is to clearly differentiate correction of data entry errors +(necessary for all interactions with the data) +from creation of analysis indicators (necessary only for the analysis at hand). +It is also important to differentiate the two stages +to ensure that variable definition is consistent across data sources. Unlike cleaning, construction can create many outputs from many inputs. Let's take the example of a project that has a baseline and an endline survey. Unless the two instruments are exactly the same, @@ -424,52 +423,36 @@ \section{Constructing final indicators} and therefore will be done separately. However, you still want the constructed variables to be calculated in the same way, so they are comparable. To do this, you will require at least two cleaning scripts, -and a single one for construction -- -we will discuss how to do this in practice in a bit. +and a single one for construction. % From analysis -Ideally, indicator construction should be done right after data cleaning, +Ideally, indicator construction should be done right after data cleaning and before data analysis starts, according to the pre-analysis plan.\index{Pre-analysis plan} -In practice, however, following this principle is not always easy. -As you analyze the data, different constructed variables will become necessary, +In practice, however, as you analyze the data, +different constructed variables will become necessary, as well as subsets and other alterations to the data. -Constructing variables in a separate script from the analysis -will help you ensure consistency across different outputs. +Even if construction and analysis are done concurrently, +you should ways do the two in separate scripts. If every script that creates a table starts by loading a data set, subsetting it, and manipulating variables, any edits to construction need to be replicated in all scripts. This increases the chances that at least one of them will have a different sample or variable definition. -Therefore, even if construction ends up coming before analysis only in the order the code is run, -it's important to think of them as different steps. +Doing all variable construction in a single, separate script helps +avoid this and ensure consistency across different outputs. \subsection{Constructing analytical variables} - -The first thing that comes to mind when we talk about variable construction is, of course, creating new variables. -Do this by adding new variables to the data set instead of overwriting the original information, -and assign functional names to them. -During cleaning, you want to keep all variables consistent with the survey instrument. -But constructed variables were not present in the survey to start with, -so making their names consistent with the survey form is not as crucial. -Of course, whenever possible, having variable names that are both intuitive -\textit{and} can be linked to the survey is ideal, -but if you need to choose, prioritize functionality. +New variables created during the construction stage should be added to the data set, instead of overwriting the original information. +New variables should be assigned functional names. Ordering the data set so that related variables are together, -and adding notes to each of them as necessary will also make your data set more user-friendly. - -The most simple case of new variables to be created are aggregate indicators. -For example, you may want to add a household's income from different sources into a single total income variable, -or create a dummy for having at least one child in school. -Jumping to the step where you actually create this variables seems intuitive, -but it can also cause you a lot of problems, -as overlooking details may affect your results. -It is important to check and double-check the value-assignments of questions, -as well as their scales, before constructing new variables based on them. -This is when you will use the knowledge of the data you acquired and the documentation you created during the cleaning step the most. -It is often useful to start looking at comparisons and other documentation outside the code editor. - -Make sure there is consistency across constructed variables. -It's possible that your questionnaire asked respondents to report some answers as percentages and others as proportions, -or that in one variable \texttt{0} means ``no'' and \texttt{1} means ``yes'', +and adding notes to each of them as necessary will make your data set more user-friendly. + +Before constructing new variables, +you must check and double-check the value-assignments of questions, +as well as the units and scales +This is when you will use the knowledge of the data and the documentation you acquired during cleaning. +For example, it's possible that the survey instrument asked respondents +to report some answers as percentages and others as proportions, +or that in one question \texttt{0} means ``no'' and \texttt{1} means ``yes'', while in another one the same answers were coded as \texttt{1} and \texttt{2}. We recommend coding yes/no questions as either \texttt{1} and \texttt{0} or \texttt{TRUE} and \texttt{FALSE}, so they can be used numerically as frequencies in means and as dummies in regressions. @@ -478,17 +461,16 @@ \subsection{Constructing analytical variables} Check that non-binary categorical variables have the same value assignment, i.e., that labels and levels have the same correspondence across variables that use the same options. Finally, make sure that any numeric variables you are comparing are converted to the same scale or unit of measure. -You cannot add one hectare and two acres into a meaningful number. +You cannot add one hectare and two acres and get a meaningful number. During construction, you will also need to address some of the issues you identified in the data set as you were cleaning it. The most common of them is the presence of outliers. -How to treat outliers is a research question, -but make sure to note what was the decision made by the research team, -and how you came to it. +How to treat outliers is a question for the research team (as there are multiple possible approaches), +but make sure to note what decision was made and why. Results can be sensitive to the treatment of outliers, so keeping the original variable in the data set will allow you to test how much it affects the estimates. -All these points also apply to imputation of missing values and other distributional patterns. +These points also apply to imputation of missing values and other distributional patterns. The more complex construction tasks involve changing the structure of the data: adding new observations or variables by merging data sets, @@ -542,17 +524,14 @@ \subsection{Documenting variable construction} \section{Writing data analysis code} % Intro -------------------------------------------------------------- -Data analysis is the stage when research outputs are created. +When data is cleaned and indicators constructed, you are ready to generate analytical outputs. \index{data analysis} -Many introductions to common code skills and analytical frameworks exist, such as +There are many existing resources for data analysis, such as \textit{R for Data Science};\sidenote{\url{https://r4ds.had.co.nz}} \textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical\_introduction\_to\_stata.pdf}} \textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion}} and \textit{Causal Inference: The Mixtape}.\sidenote{\url{https://scunning.com/mixtape.html}} -This section will not include instructions on how to conduct specific analyses. -That is a research question, and requires expertise beyond the scope of this book. -Instead, we will outline the structure of writing analysis code, -assuming you have completed the process of data cleaning and variable construction. +We focus on how to \texit{code} data analysis, rather than how to conduct specific analyses. \subsection{Organizing analysis code} @@ -630,7 +609,7 @@ \subsection{Visualizing data} \texttt{r2d3},\sidenote{\url{https://rstudio.github.io/r2d3}} \texttt{leaflet},\sidenote{\url{https://rstudio.github.io/leaflet}} and \texttt{plotly},\sidenote{\url{https://plot.ly/r}} to name a few. -We have no intention of creating an exhaustive list, and this one is certainly missing very good references; but it is a good place to start. +We have no intention of creating an exhaustive list, but this is a good place to start. We attribute some of the difficulty of creating good data visualization to writing code to create them. @@ -693,7 +672,8 @@ \subsection{Exporting analysis outputs} Exporting table to \texttt{.tex} should be preferred. Excel \texttt{.xlsx} and \texttt{.csv} are also commonly used, but require the extra step of copying the tables into the final output. -The amount of work needed in a copy-paste workflow increases rapidly with the number of tables and figures included in a research output, +The amount of work needed in a copy-paste workflow increases +rapidly with the number of tables and figures included in a research output, and so do the chances of having the wrong version a result in your paper or report. If you need to create a table with a very particular format From 52032298d9788511ace3af7cc31b5ae033b13939 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 11:40:50 -0500 Subject: [PATCH 774/854] Update chapters/data-collection.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 2fc60214b..45012cc24 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -4,7 +4,7 @@ Much of the recent push toward credibility in the social sciences has focused on analytical practices. However, credible development research often depends, first and foremost, on the quality of the raw data. When you are using original data - whether collected for the first time through surveys or sensors or acquired through a unique partnership - -there is no way for others to validate that it accurately reflects reality +there is no way for anyone else in the research community to validate that it accurately reflects reality and that the indicators you have based your analysis on are meaningful. This chapter details the necessary components for a high-quality data acquisition process, no matter whether you are receiving large amounts of unique data from partners From eda521c5e8e9b6854bfd7fbd456890476a71df88 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 11:41:39 -0500 Subject: [PATCH 775/854] Update chapters/data-collection.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 45012cc24..43a854f8c 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -126,7 +126,7 @@ \subsection{Data licensing agreements} and maintains full intellectual property rights. The contract should also explicitly stipulate that the contracted firm is responsible for protecting the confidentiality of the respondents, -and that the data collection will not be distributed to any third parties +and that the data collection will not be delegated to any third parties or used by the firm or subcontractors for any purpose not expressly stated in the contract, before, during or after the assignment. The contract should also stipulate that the vendor is required to comply with From 78055b42609d534a2db48e898891c36c44f43f55 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 11:41:50 -0500 Subject: [PATCH 776/854] Update chapters/data-collection.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 43a854f8c..3f6c1e36f 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -136,7 +136,7 @@ \subsection{Data licensing agreements} Research teams that collect their own data must consider the terms under which they will release that data to other researchers or to the general public. -Will you publicly releasing the data in full (removing personal identifiers))? +Will you publicly releasing the data in full (removing personal identifiers)? Would you be okay with it being stored on servers anywhere in the world, even ones that are owned by corporations or governments abroad? Would you prefer to decided permission on a case-by-case basis, dependent on specific proposed uses? From 71ef0f46df3fb77a065f302a3573a6af6d2fc632 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 11:42:26 -0500 Subject: [PATCH 777/854] Update chapters/data-collection.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 3f6c1e36f..54a216a1d 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -138,7 +138,7 @@ \subsection{Data licensing agreements} under which they will release that data to other researchers or to the general public. Will you publicly releasing the data in full (removing personal identifiers)? Would you be okay with it being stored on servers anywhere in the world, -even ones that are owned by corporations or governments abroad? +even ones that are owned by corporations or governments in countries other than your own? Would you prefer to decided permission on a case-by-case basis, dependent on specific proposed uses? Would you expect that users of your data cite you or give you credit, or would you require them in turn to release From 5235cdcc7d2ed9e26e4f808c289aac50647ade75 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 11:42:45 -0500 Subject: [PATCH 778/854] Update chapters/data-collection.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 54a216a1d..9e4aaaf32 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -139,7 +139,7 @@ \subsection{Data licensing agreements} Will you publicly releasing the data in full (removing personal identifiers)? Would you be okay with it being stored on servers anywhere in the world, even ones that are owned by corporations or governments in countries other than your own? -Would you prefer to decided permission on a case-by-case basis, dependent on specific proposed uses? +Would you prefer to decide permission on a case-by-case basis, dependent on specific proposed uses? Would you expect that users of your data cite you or give you credit, or would you require them in turn to release their derivative data or publications under similar licenses as yours? From b3a9cec0f646e94bc2df347d63e43929237b185b Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 12:56:15 -0500 Subject: [PATCH 779/854] Update chapters/data-collection.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 9e4aaaf32..931d16d1d 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -186,7 +186,7 @@ \subsection{Receiving data from development partners} what variable(s) uniquely identify the dataset(s), and how different datasets can be linked; and a description of field procedures and quality controls. We use as a standard the Data Documentation Initiative (DDI), which is supported by the -World Bank's Microdata Catalog.\sidenote{\url{microdata.worldbank.org}} +World Bank's Microdata Catalog.\sidenote{\url{https://microdata.worldbank.org}} As soon as the requisite pieces of information are stored together, think about which ones are the components of what you would call a dataset. From 36b8d3a2e5d10ab3a197fff5c6ce0d1887590346 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 12:57:10 -0500 Subject: [PATCH 780/854] Update chapters/publication.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index bd8c0461f..82f99419c 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -101,7 +101,7 @@ \subsection{Preparing dynamic documents} Documents called ``notebooks'' (such as Jupyter\sidenote{\url{https://jupyter.org}}) work similarly, as they also use the underlying analytical software to create the document. These tools are usually appropriate for short or informal documents -because they tend to offer restricted editability outside the base software +because it tends to be difficult to edit the content unless using the tool and often have limited abilities to incorporate precise formatting. There are also simple tools for dynamic documents From 4148d77b0d41a378223f4395c2fb9a1b0d593bbe Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 12:57:44 -0500 Subject: [PATCH 781/854] Update chapters/publication.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 82f99419c..297693dbd 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -102,7 +102,7 @@ \subsection{Preparing dynamic documents} as they also use the underlying analytical software to create the document. These tools are usually appropriate for short or informal documents because it tends to be difficult to edit the content unless using the tool -and often have limited abilities to incorporate precise formatting. +and often does not have as extensive formatting option as, for example, Word. There are also simple tools for dynamic documents that do not require direct operation of the underlying code or software, From b5881fe8ee0baeff22f171ddb6dc4634d6e33837 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 12:58:27 -0500 Subject: [PATCH 782/854] Update chapters/publication.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 297693dbd..72529f1f0 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -475,7 +475,7 @@ \subsection{Releasing a replication package} over the next few years; the specific solutions we mention here highlight some current approaches as well as their strengths and weaknesses. -On option is GitHub. +One option is GitHub. Making a public GitHub repository is completely free. It can hold any file types, provide a structured download of your whole project, From 3c77040a9378d1d1963108568cd8da346c7a4d50 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 12:59:23 -0500 Subject: [PATCH 783/854] Update chapters/sampling-randomization-power.tex Co-Authored-By: Luiza Andrade --- chapters/sampling-randomization-power.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 0cf777dc7..3d2c4e4b8 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -67,7 +67,7 @@ \section{Random processes} \subsection{Implementing random processes reproducibly in Stata} -Reproducibility in statistical programming means that the outputs of random processes +For statistical programming to be considered reproducible, it must be possible for the outputs of random processes can be re-obtained at a future time.\cite{orozco2018make} For our purposes, we will focus on what you need to understand in order to produce truly random results for your project using Stata, From d16ca547bd419f54561115bd6e545b2b02366456 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 24 Feb 2020 13:41:08 -0500 Subject: [PATCH 784/854] small fixes to links --- appendix/stata-guide.tex | 2 +- chapters/data-analysis.tex | 34 ++++++++++----------- chapters/handling-data.tex | 60 +++++++++++++++++++------------------- chapters/publication.tex | 6 ++-- 4 files changed, 51 insertions(+), 51 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index f6f3bf733..5b3b41904 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -425,7 +425,7 @@ \subsection{Saving data} If there is a unique ID variable or a set of ID variables, the code should test that they are uniqueally and fully identifying the data set.\sidenote{ - \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} + \url{https://dimewiki.worldbank.org/ID_Variable_Properties}} ID variables are also perfect variables to sort on, and to \texttt{order} first in the data set. diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index d868afaee..e1b634b91 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -70,7 +70,7 @@ \subsection{Organizing your folder structure} Standardizing folders greatly reduces the costs that PIs and RAs face when switching between projects, because they are all organized in exactly the same way and use the same filepaths, shortcuts, and macro references.\sidenote{ - \url{https://dimewiki.worldbank.org/DataWork\_Folder}} + \url{https://dimewiki.worldbank.org/DataWork_Folder}} We created \texttt{iefolder} based on our experience with primary data, but it can be used for different types of data, and adapted to fit different needs. @@ -78,7 +78,7 @@ \subsection{Organizing your folder structure} the principle of creating a single unified standard remains. At the top level of the structure created by \texttt{iefolder} are what we call ``round'' folders.\sidenote{ - \url{https://dimewiki.worldbank.org/DataWork\_Survey\_Round}} + \url{https://dimewiki.worldbank.org/DataWork_Survey_Round}} You can think of a ``round'' as a single source of data, which will all be cleaned using a single script. Inside each round folder, there are dedicated folders for: @@ -87,7 +87,7 @@ \subsection{Organizing your folder structure} The folders that hold code are organized in parallel to these, so that the progression through the whole project can be followed by anyone new to the team. Additionally, \texttt{iefolder} creates \textbf{master do-files}\sidenote{ - \url{https://dimewiki.worldbank.org/Master\_Do-files}} + \url{https://dimewiki.worldbank.org/Master_Do-files}} so the structure of all project code is reflected in a top-level script. \subsection{Breaking down tasks} @@ -232,7 +232,7 @@ \section{De-identifying research data} \section{Cleaning data for analysis} Data cleaning is the second stage in the transformation of data you received into data that you can analyze.\sidenote{\url{ -https://dimewiki.worldbank.org/Data\_Cleaning}} +https://dimewiki.worldbank.org/Data_Cleaning}} The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. @@ -249,12 +249,12 @@ \subsection{Correcting data entry errors} There are two main cases when the raw data will be modified during data cleaning. The first one is when there are duplicated entries in the data. -Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} +Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/ID_Variable_Properties}} is possibly the most important step in data cleaning. Modern survey tools create unique observation identifiers. That, however, is not the same as having a unique ID variable for each individual in the sample. You want to make sure the data set has a unique ID variable -that can be cross-referenced with other records, such as the master data set\sidenote{\url{https://dimewiki.worldbank.org/Master\_Data\_Set}} +that can be cross-referenced with other records, such as the master data set\sidenote{\url{https://dimewiki.worldbank.org/Master_Data_Set}} and other rounds of data collection. \texttt{ieduplicates} and \texttt{iecompdup}, two Stata commands included in the \texttt{iefieldkit} @@ -331,12 +331,12 @@ \subsection{Labeling, annotating, and finalizing clean data} Applying labels makes it easier to understand what the data mean as you explore it, and thus reduces the risk of small errors making their way through into the analysis stage. Variable and value labels should be accurate and concise.\sidenote{ - \url{https://dimewiki.worldbank.org/Data\_Cleaning\#Applying\_Labels}} + \url{https://dimewiki.worldbank.org/Data_Cleaning\#Applying_Labels}} Applying labels makes it easier to understand what the data is showing while exploring the data. This minimizes the risk of small errors making their way through into the analysis stage. -Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/Data\_Cleaning\#Applying\_Labels}} +Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/Data_Cleaning\#Applying_Labels}} Third, recodes should be used to turn codes for ``Don't know'', ``Refused to answer'', and -other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} +other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/Data_Cleaning\#Survey_Codes_and_Missing_Values}} String variables that correspond to categorical variables need to be encoded. Open-ended responses stored as strings usually have a high risk of being identifiers, so they should be dropped at this point. @@ -351,7 +351,7 @@ \subsection{Documenting data cleaning} including enumerator manuals, survey instruments, supervisor notes, and data quality monitoring reports. These materials are essential for data documentation.\sidenote{ - \url{https://dimewiki.worldbank.org/Data\_Documentation}} + \url{https://dimewiki.worldbank.org/Data_Documentation}} \index{Documentation} They should be stored in the corresponding \texttt{Documentation} folder for easy access, as you will probably need them during analysis, @@ -393,10 +393,10 @@ \section{Constructing final indicators} During this process, the data points will typically be reshaped and aggregated so that level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household),\sidenote{ - \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} + \url{https://dimewiki.worldbank.org/Unit_of_Observation}} so that the level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ -\url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} +\url{https://dimewiki.worldbank.org/Unit_of_Observation}} A constructed data set is built to answer an analysis question. Since different pieces of analysis may require different samples, @@ -548,8 +548,8 @@ \section{Writing data analysis code} \index{data analysis} Many introductions to common code skills and analytical frameworks exist, such as \textit{R for Data Science};\sidenote{\url{https://r4ds.had.co.nz}} -\textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical\_introduction\_to\_stata.pdf}} -\textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion}} +\textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical_introduction_to_stata.pdf}} +\textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844_Mostly_Harmless_Econometrics_An_Empiricist's_Companion}} and \textit{Causal Inference: The Mixtape}.\sidenote{\url{https://scunning.com/mixtape.html}} This section will not include instructions on how to conduct specific analyses. That is a research question, and requires expertise beyond the scope of this book. @@ -607,7 +607,7 @@ \subsection{Organizing analysis code} \subsection{Visualizing data} -\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/Data\_visualization}} \index{data visualization} +\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/Data_visualization}} \index{data visualization} is increasingly popular, and is becoming a field in its own right.\cite{healy2018data,wilke2019fundamentals} Whole books have been written on how to create good data visualizations, so we will not attempt to give you advice on it. @@ -710,8 +710,8 @@ \subsection{Exporting analysis outputs} This means it should be easy to read and understand them with only the information they contain. Make sure labels and notes cover all relevant information, such as sample, unit of observation, unit of measurement and variable definition.\sidenote{ - \url{https://dimewiki.worldbank.org/Checklist:\_Reviewing\_Graphs} \\ - \url{https://dimewiki.worldbank.org/Checklist:\_Submit\_Table}} + \url{https://dimewiki.worldbank.org/Checklist:_Reviewing_Graphs} \\ + \url{https://dimewiki.worldbank.org/Checklist:_Submit_Table}} If you follow the steps outlined in this chapter, most of the data work involved in the last step of the research process diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 11615f766..3d18543f9 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -43,16 +43,16 @@ \section{Protecting confidence in development research} Development researchers should take these concerns seriously. Many development research projects are purpose-built to address specific questions, and often use unique data or small samples. -As a result, it is often the case that the data +As a result, it is often the case that the data researchers use for such studies has never been reviewed by anyone else, -so it is hard for others to verify that it was +so it is hard for others to verify that it was collected, handled, and analyzed appropriately. -Reproducible and transparent methods are key to maintaining credibility -and avoiding serious errors. -This is particularly true for research that relies on original or novel data sources, -from innovative big data sources to surveys. -The field is slowly moving in the direction of requiring greater transparency. +Reproducible and transparent methods are key to maintaining credibility +and avoiding serious errors. +This is particularly true for research that relies on original or novel data sources, +from innovative big data sources to surveys. +The field is slowly moving in the direction of requiring greater transparency. Major publishers and funders, most notably the American Economic Association, have taken steps to require that code and data are accurately reported, cited, and preserved as outputs in themselves.\sidenote{ @@ -61,20 +61,20 @@ \section{Protecting confidence in development research} \subsection{Research reproducibility} -Can another researcher reuse the same code on the same data +Can another researcher reuse the same code on the same data and get the exact same results as in your published paper?\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} -This is a standard known as \textbf{computational reproducibility}, +This is a standard known as \textbf{computational reproducibility}, and it is an increasingly common requirement for publication.\sidenote{ \url{https://www.nap.edu/resource/25303/R&R.pdf}}) -It is best practice to verify computational reproducibility before submitting a paper before publication. -This should be done by someone who is not on your research team, on a different computer, -using exactly the package of code and data files you plan to submit with your paper. -Code that is well-organized into a master script, and written to be easily run by others, -makes this task simpler. -The next chapter discusses organization of data work in detail. - -For research to be reproducible, +It is best practice to verify computational reproducibility before submitting a paper before publication. +This should be done by someone who is not on your research team, on a different computer, +using exactly the package of code and data files you plan to submit with your paper. +Code that is well-organized into a master script, and written to be easily run by others, +makes this task simpler. +The next chapter discusses organization of data work in detail. + +For research to be reproducible, all code files for data cleaning, construction and analysis should be public, unless they contain identifying information. Nobody should have to guess what exactly comprises a given index, @@ -169,7 +169,7 @@ \subsection{Research transparency} to record the decision process leading to changes and additions, track and register discussions, and manage tasks. These are flexible tools that can be adapted to different team and project dynamics. -Services that log your research process can show things like modifications made in response to referee comments, +Services that log your research process can show things like modifications made in response to referee comments, by having tagged version histories at each major revision. They also allow you to use issue trackers to document the research paths and questions you may have tried to answer @@ -247,7 +247,7 @@ \section{Ensuring privacy and security in research data} \index{data collection} This includes names, addresses, and geolocations, and extends to personal information such as email addresses, phone numbers, and financial information.\index{geodata}\index{de-identification} -It is important to keep in mind data privacy principles not only for the respondent +It is important to keep in mind data privacy principles not only for the respondent but also the PII data of their household members or other individuals who are covered under the survey. \index{privacy} In some contexts this list may be more extensive -- @@ -310,19 +310,19 @@ \subsection{Obtaining ethical approval and consent} before it happens, as it is happening, or after it has already happened. It also means that they must explicitly and affirmatively consent to the collection, storage, and use of their information for any purpose. -Therefore, the development of appropriate consent processes is of primary importance.\sidenote{url\ - {https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} +Therefore, the development of appropriate consent processes is of primary importance.\sidenote{ + \url{https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} All survey instruments must include a module in which the sampled respondent grants informed consent to participate. -Research participants must be informed of the purpose of the research, +Research participants must be informed of the purpose of the research, what their participation will entail in terms of duration and any procedures, -any foreseeable benefits or risks, +any foreseeable benefits or risks, and how their identity will be protected.\sidenote{\url {https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} There are special additional protections in place for vulnerable populations, such as minors, prisoners, and people with disabilities, and these should be confirmed with relevant authorities if your research includes them. -IRB approval should be obtained well before any data is acquired. +IRB approval should be obtained well before any data is acquired. IRBs may have infrequent meeting schedules or require several rounds of review for an application to be approved. If there are any deviations from an approved plan or expected adjustments, @@ -369,7 +369,7 @@ \subsection{Transmitting and storing data securely} enough to rely service providers' on-the-fly encryption as they need to keep a copy of the decryption key to make it automatic. When confidential data is stored on a local computer it must always remain encrypted, and confidential data may never be sent unencrypted -over email, WhatsApp, or other chat services. +over email, WhatsApp, or other chat services. The easiest way to reduce the risk of leaking personal information is to use it as rarely as possible. It is often very simple to conduct planning and analytical work @@ -418,10 +418,10 @@ \subsection{De-identifying data} You can take simple steps to avoid risks by minimizing the handling of PII. First, only collect information that is strictly needed for the research. Second, avoid the proliferation of copies of identified data. -There should never be more than one copy of the raw identified dataset in the project folder, +There should never be more than one copy of the raw identified dataset in the project folder, and it must always be encrypted. -Even within the research team, -access to PII data should be limited to team members who require it for specific analysis +Even within the research team, +access to PII data should be limited to team members who require it for specific analysis (most analysis will not depend on PII). Analysis that requires PII data is rare and can be avoided by properly linking identifiers to research information @@ -438,11 +438,11 @@ \subsection{De-identifying data} -- even if that data has had all directly identifying information removed -- by using some other data that becomes identifying when analyzed together. For this reason, we recommend de-identification in two stages. -The \textbf{initial de-identification} process strips the data of direct identifiers +The \textbf{initial de-identification} process strips the data of direct identifiers as early in the process as possible, to create a working de-identified dataset that can be shared \textit{within the research team} without the need for encryption. -This simplifies workflows. +This simplifies workflows. The \textbf{final de-identification} process involves making a decision about the trade-off between risk of disclosure and utility of the data diff --git a/chapters/publication.tex b/chapters/publication.tex index 32993eb11..c1bc94308 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -313,9 +313,9 @@ \subsection{Publishing data for replication} includes a Microdata Catalog\sidenote{ \url{https://microdata.worldbank.org}} where researchers can publish data and documentation for their projects.\sidenote{ -\url{https://dimewiki.worldbank.org/Microdata\_Catalog} +\url{https://dimewiki.worldbank.org/Microdata_Catalog} \newline -\url{https://dimewiki.worldbank.org/Checklist:\_Microdata\_Catalog\_submission} +\url{https://dimewiki.worldbank.org/Checklist:_Microdata_Catalog_submission} } The Harvard Dataverse\sidenote{ \url{https://dataverse.harvard.edu}} @@ -395,7 +395,7 @@ \subsection{Publishing data for replication} There are a number of tools developed to help researchers de-identify data and which you should use as appropriate at that stage of data collection. These include \texttt{PII\_detection}\sidenote{ - \url{https://github.com/PovertyAction/PII\_detection}} + \url{https://github.com/PovertyAction/PII_detection}} from IPA, \texttt{PII-scan}\sidenote{ \url{https://github.com/J-PAL/PII-Scan}} From 77af7d261a191dab0567458562ea8a9f4dacfce1 Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 24 Feb 2020 14:19:43 -0500 Subject: [PATCH 785/854] small link fixes --- chapters/data-collection.tex | 14 +++++++------- chapters/sampling-randomization-power.tex | 2 +- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 6a5d8e2ee..04d1ef8eb 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -204,7 +204,7 @@ \subsection{Developing a data collection instrument} such as from the World Bank's Living Standards Measurement Survey.\cite{glewwe2000designing} The focus of this section is the design of electronic field surveys, often referred to as Computer Assisted Personal Interviews (CAPI).\sidenote{ - \url{https://dimewiki.worldbank.org/Computer-Assisted\_Personal\_Interviews\_(CAPI)}} + \url{https://dimewiki.worldbank.org/Computer-Assisted_Personal_Interviews_(CAPI)}} Although most surveys are now collected electronically, by tablet, mobile phone or web browser, \textbf{questionnaire design}\sidenote{ \url{https://dimewiki.worldbank.org/Questionnaire_Design}} @@ -563,11 +563,11 @@ \section{Collecting and sharing data securely} All sensitive data must be handled in a way where there is no risk that anyone who is not approved by an Institutional Review Board (IRB)\sidenote{ - \url{https://dimewiki.worldbank.org/IRB\_Approval}} + \url{https://dimewiki.worldbank.org/IRB_Approval}} for the specific project has the ability to access the data. Data can be sensitive for multiple reasons, but the two most common reasons are that it contains personally identifiable information (PII)\sidenote{ - \url{https://dimewiki.worldbank.org/Personally\_Identifiable\_Information\_(PII)}} + \url{https://dimewiki.worldbank.org/Personally_Identifiable_Information_(PII)}} or that the partner providing the data does not want it to be released. Central to data security is \index{encryption}\textbf{data encryption}, which is a group @@ -614,7 +614,7 @@ \subsection{Collecting data securely} In field surveys, most common data collection software will automatically encrypt all data in transit (i.e., upload from field or download from server).\sidenote{ - \url{https://dimewiki.worldbank.org/Encryption\#Encryption\_in\_Transit}} + \url{https://dimewiki.worldbank.org/Encryption\#Encryption_in_Transit}} If this is implemented by the software you are using, then your data will be encrypted from the time it leaves the device (in tablet-assisted data collection) or browser (in web data collection), @@ -628,7 +628,7 @@ \subsection{Collecting data securely} Even though your data is therefore usually safe while it is being transmitted, it is not automatically secure when it is being stored. \textbf{Encryption at rest}\sidenote{ - \url{https://dimewiki.worldbank.org/Encryption\#Encryption\_at\_Rest}} + \url{https://dimewiki.worldbank.org/Encryption\#Encryption_at_Rest}} is the only way to ensure that PII data remains private when it is stored on a server on the internet. You must keep your data encrypted on the data collection server whenever PII data is collected. @@ -652,7 +652,7 @@ \subsection{Collecting data securely} never pass through the hands of a third party, including the data storage application. Most survey software implement \textbf{asymmetric encryption}\sidenote{ - \url{https://dimewiki.worldbank.org/Encryption\#Asymmetric\_Encryption}} + \url{https://dimewiki.worldbank.org/Encryption\#Asymmetric_Encryption}} where there are two keys in a public/private key pair. Only the private key can be used to decrypt the encrypted data, and the public key can only be used to encrypt the data. @@ -679,7 +679,7 @@ \subsection{Storing data securely} from the data collection device to the data collection server, it is not practical once you start interacting with the data. Instead, we use \textbf{symmetric encryption}\sidenote{ - \url{https://dimewiki.worldbank.org/Encryption\#Symmetric\_Encryption}} + \url{https://dimewiki.worldbank.org/Encryption\#Symmetric_Encryption}} where we create a secure encrypted folder, using, for example, VeraCrypt.\sidenote{\url{https://www.veracrypt.fr}} Here, a single key is used to both encrypt and decrypt the information. diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 3d2c4e4b8..2deab72fc 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -511,7 +511,7 @@ \subsection{Randomization inference} and it is interpretable as the probability that a program with no effect would have given you a result like the one actually observed. These randomization inference\sidenote{ - \url{https://dimewiki.worldbank.org/Randomization\_Inference}} + \url{https://dimewiki.worldbank.org/Randomization_Inference}} significance levels may be very different than those given by asymptotic confidence intervals, particularly in small samples (up to several hundred clusters). From eb4d950d6aa799a0fb446ecee3ef6d1e17f9fc6a Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 24 Feb 2020 14:19:43 -0500 Subject: [PATCH 786/854] Revert "small link fixes" This reverts commit 77af7d261a191dab0567458562ea8a9f4dacfce1. --- chapters/data-collection.tex | 14 +++++++------- chapters/sampling-randomization-power.tex | 2 +- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 04d1ef8eb..6a5d8e2ee 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -204,7 +204,7 @@ \subsection{Developing a data collection instrument} such as from the World Bank's Living Standards Measurement Survey.\cite{glewwe2000designing} The focus of this section is the design of electronic field surveys, often referred to as Computer Assisted Personal Interviews (CAPI).\sidenote{ - \url{https://dimewiki.worldbank.org/Computer-Assisted_Personal_Interviews_(CAPI)}} + \url{https://dimewiki.worldbank.org/Computer-Assisted\_Personal\_Interviews\_(CAPI)}} Although most surveys are now collected electronically, by tablet, mobile phone or web browser, \textbf{questionnaire design}\sidenote{ \url{https://dimewiki.worldbank.org/Questionnaire_Design}} @@ -563,11 +563,11 @@ \section{Collecting and sharing data securely} All sensitive data must be handled in a way where there is no risk that anyone who is not approved by an Institutional Review Board (IRB)\sidenote{ - \url{https://dimewiki.worldbank.org/IRB_Approval}} + \url{https://dimewiki.worldbank.org/IRB\_Approval}} for the specific project has the ability to access the data. Data can be sensitive for multiple reasons, but the two most common reasons are that it contains personally identifiable information (PII)\sidenote{ - \url{https://dimewiki.worldbank.org/Personally_Identifiable_Information_(PII)}} + \url{https://dimewiki.worldbank.org/Personally\_Identifiable\_Information\_(PII)}} or that the partner providing the data does not want it to be released. Central to data security is \index{encryption}\textbf{data encryption}, which is a group @@ -614,7 +614,7 @@ \subsection{Collecting data securely} In field surveys, most common data collection software will automatically encrypt all data in transit (i.e., upload from field or download from server).\sidenote{ - \url{https://dimewiki.worldbank.org/Encryption\#Encryption_in_Transit}} + \url{https://dimewiki.worldbank.org/Encryption\#Encryption\_in\_Transit}} If this is implemented by the software you are using, then your data will be encrypted from the time it leaves the device (in tablet-assisted data collection) or browser (in web data collection), @@ -628,7 +628,7 @@ \subsection{Collecting data securely} Even though your data is therefore usually safe while it is being transmitted, it is not automatically secure when it is being stored. \textbf{Encryption at rest}\sidenote{ - \url{https://dimewiki.worldbank.org/Encryption\#Encryption_at_Rest}} + \url{https://dimewiki.worldbank.org/Encryption\#Encryption\_at\_Rest}} is the only way to ensure that PII data remains private when it is stored on a server on the internet. You must keep your data encrypted on the data collection server whenever PII data is collected. @@ -652,7 +652,7 @@ \subsection{Collecting data securely} never pass through the hands of a third party, including the data storage application. Most survey software implement \textbf{asymmetric encryption}\sidenote{ - \url{https://dimewiki.worldbank.org/Encryption\#Asymmetric_Encryption}} + \url{https://dimewiki.worldbank.org/Encryption\#Asymmetric\_Encryption}} where there are two keys in a public/private key pair. Only the private key can be used to decrypt the encrypted data, and the public key can only be used to encrypt the data. @@ -679,7 +679,7 @@ \subsection{Storing data securely} from the data collection device to the data collection server, it is not practical once you start interacting with the data. Instead, we use \textbf{symmetric encryption}\sidenote{ - \url{https://dimewiki.worldbank.org/Encryption\#Symmetric_Encryption}} + \url{https://dimewiki.worldbank.org/Encryption\#Symmetric\_Encryption}} where we create a secure encrypted folder, using, for example, VeraCrypt.\sidenote{\url{https://www.veracrypt.fr}} Here, a single key is used to both encrypt and decrypt the information. diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 2deab72fc..3d2c4e4b8 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -511,7 +511,7 @@ \subsection{Randomization inference} and it is interpretable as the probability that a program with no effect would have given you a result like the one actually observed. These randomization inference\sidenote{ - \url{https://dimewiki.worldbank.org/Randomization_Inference}} + \url{https://dimewiki.worldbank.org/Randomization\_Inference}} significance levels may be very different than those given by asymptotic confidence intervals, particularly in small samples (up to several hundred clusters). From 0927e953228cdf8b6f9dce5865f7f838684e0aad Mon Sep 17 00:00:00 2001 From: Luiza Date: Mon, 24 Feb 2020 13:41:08 -0500 Subject: [PATCH 787/854] Revert "small fixes to links" This reverts commit d16ca547bd419f54561115bd6e545b2b02366456. --- appendix/stata-guide.tex | 2 +- chapters/data-analysis.tex | 34 ++++++++++----------- chapters/handling-data.tex | 60 +++++++++++++++++++------------------- chapters/publication.tex | 6 ++-- 4 files changed, 51 insertions(+), 51 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 5b3b41904..f6f3bf733 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -425,7 +425,7 @@ \subsection{Saving data} If there is a unique ID variable or a set of ID variables, the code should test that they are uniqueally and fully identifying the data set.\sidenote{ - \url{https://dimewiki.worldbank.org/ID_Variable_Properties}} + \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} ID variables are also perfect variables to sort on, and to \texttt{order} first in the data set. diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index e1b634b91..d868afaee 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -70,7 +70,7 @@ \subsection{Organizing your folder structure} Standardizing folders greatly reduces the costs that PIs and RAs face when switching between projects, because they are all organized in exactly the same way and use the same filepaths, shortcuts, and macro references.\sidenote{ - \url{https://dimewiki.worldbank.org/DataWork_Folder}} + \url{https://dimewiki.worldbank.org/DataWork\_Folder}} We created \texttt{iefolder} based on our experience with primary data, but it can be used for different types of data, and adapted to fit different needs. @@ -78,7 +78,7 @@ \subsection{Organizing your folder structure} the principle of creating a single unified standard remains. At the top level of the structure created by \texttt{iefolder} are what we call ``round'' folders.\sidenote{ - \url{https://dimewiki.worldbank.org/DataWork_Survey_Round}} + \url{https://dimewiki.worldbank.org/DataWork\_Survey\_Round}} You can think of a ``round'' as a single source of data, which will all be cleaned using a single script. Inside each round folder, there are dedicated folders for: @@ -87,7 +87,7 @@ \subsection{Organizing your folder structure} The folders that hold code are organized in parallel to these, so that the progression through the whole project can be followed by anyone new to the team. Additionally, \texttt{iefolder} creates \textbf{master do-files}\sidenote{ - \url{https://dimewiki.worldbank.org/Master_Do-files}} + \url{https://dimewiki.worldbank.org/Master\_Do-files}} so the structure of all project code is reflected in a top-level script. \subsection{Breaking down tasks} @@ -232,7 +232,7 @@ \section{De-identifying research data} \section{Cleaning data for analysis} Data cleaning is the second stage in the transformation of data you received into data that you can analyze.\sidenote{\url{ -https://dimewiki.worldbank.org/Data_Cleaning}} +https://dimewiki.worldbank.org/Data\_Cleaning}} The cleaning process involves (1) making the data set easily usable and understandable, and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. @@ -249,12 +249,12 @@ \subsection{Correcting data entry errors} There are two main cases when the raw data will be modified during data cleaning. The first one is when there are duplicated entries in the data. -Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/ID_Variable_Properties}} +Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} is possibly the most important step in data cleaning. Modern survey tools create unique observation identifiers. That, however, is not the same as having a unique ID variable for each individual in the sample. You want to make sure the data set has a unique ID variable -that can be cross-referenced with other records, such as the master data set\sidenote{\url{https://dimewiki.worldbank.org/Master_Data_Set}} +that can be cross-referenced with other records, such as the master data set\sidenote{\url{https://dimewiki.worldbank.org/Master\_Data\_Set}} and other rounds of data collection. \texttt{ieduplicates} and \texttt{iecompdup}, two Stata commands included in the \texttt{iefieldkit} @@ -331,12 +331,12 @@ \subsection{Labeling, annotating, and finalizing clean data} Applying labels makes it easier to understand what the data mean as you explore it, and thus reduces the risk of small errors making their way through into the analysis stage. Variable and value labels should be accurate and concise.\sidenote{ - \url{https://dimewiki.worldbank.org/Data_Cleaning\#Applying_Labels}} + \url{https://dimewiki.worldbank.org/Data\_Cleaning\#Applying\_Labels}} Applying labels makes it easier to understand what the data is showing while exploring the data. This minimizes the risk of small errors making their way through into the analysis stage. -Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/Data_Cleaning\#Applying_Labels}} +Variable and value labels should be accurate and concise.\sidenote{\url{https://dimewiki.worldbank.org/Data\_Cleaning\#Applying\_Labels}} Third, recodes should be used to turn codes for ``Don't know'', ``Refused to answer'', and -other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/Data_Cleaning\#Survey_Codes_and_Missing_Values}} +other non-responses into extended missing values.\sidenote{\url{https://dimewiki.worldbank.org/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} String variables that correspond to categorical variables need to be encoded. Open-ended responses stored as strings usually have a high risk of being identifiers, so they should be dropped at this point. @@ -351,7 +351,7 @@ \subsection{Documenting data cleaning} including enumerator manuals, survey instruments, supervisor notes, and data quality monitoring reports. These materials are essential for data documentation.\sidenote{ - \url{https://dimewiki.worldbank.org/Data_Documentation}} + \url{https://dimewiki.worldbank.org/Data\_Documentation}} \index{Documentation} They should be stored in the corresponding \texttt{Documentation} folder for easy access, as you will probably need them during analysis, @@ -393,10 +393,10 @@ \section{Constructing final indicators} During this process, the data points will typically be reshaped and aggregated so that level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household),\sidenote{ - \url{https://dimewiki.worldbank.org/Unit_of_Observation}} + \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} so that the level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ -\url{https://dimewiki.worldbank.org/Unit_of_Observation}} +\url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} A constructed data set is built to answer an analysis question. Since different pieces of analysis may require different samples, @@ -548,8 +548,8 @@ \section{Writing data analysis code} \index{data analysis} Many introductions to common code skills and analytical frameworks exist, such as \textit{R for Data Science};\sidenote{\url{https://r4ds.had.co.nz}} -\textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical_introduction_to_stata.pdf}} -\textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844_Mostly_Harmless_Econometrics_An_Empiricist's_Companion}} +\textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical\_introduction\_to\_stata.pdf}} +\textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion}} and \textit{Causal Inference: The Mixtape}.\sidenote{\url{https://scunning.com/mixtape.html}} This section will not include instructions on how to conduct specific analyses. That is a research question, and requires expertise beyond the scope of this book. @@ -607,7 +607,7 @@ \subsection{Organizing analysis code} \subsection{Visualizing data} -\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/Data_visualization}} \index{data visualization} +\textbf{Data visualization}\sidenote{\url{https://dimewiki.worldbank.org/Data\_visualization}} \index{data visualization} is increasingly popular, and is becoming a field in its own right.\cite{healy2018data,wilke2019fundamentals} Whole books have been written on how to create good data visualizations, so we will not attempt to give you advice on it. @@ -710,8 +710,8 @@ \subsection{Exporting analysis outputs} This means it should be easy to read and understand them with only the information they contain. Make sure labels and notes cover all relevant information, such as sample, unit of observation, unit of measurement and variable definition.\sidenote{ - \url{https://dimewiki.worldbank.org/Checklist:_Reviewing_Graphs} \\ - \url{https://dimewiki.worldbank.org/Checklist:_Submit_Table}} + \url{https://dimewiki.worldbank.org/Checklist:\_Reviewing\_Graphs} \\ + \url{https://dimewiki.worldbank.org/Checklist:\_Submit\_Table}} If you follow the steps outlined in this chapter, most of the data work involved in the last step of the research process diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 3d18543f9..11615f766 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -43,16 +43,16 @@ \section{Protecting confidence in development research} Development researchers should take these concerns seriously. Many development research projects are purpose-built to address specific questions, and often use unique data or small samples. -As a result, it is often the case that the data +As a result, it is often the case that the data researchers use for such studies has never been reviewed by anyone else, -so it is hard for others to verify that it was +so it is hard for others to verify that it was collected, handled, and analyzed appropriately. -Reproducible and transparent methods are key to maintaining credibility -and avoiding serious errors. -This is particularly true for research that relies on original or novel data sources, -from innovative big data sources to surveys. -The field is slowly moving in the direction of requiring greater transparency. +Reproducible and transparent methods are key to maintaining credibility +and avoiding serious errors. +This is particularly true for research that relies on original or novel data sources, +from innovative big data sources to surveys. +The field is slowly moving in the direction of requiring greater transparency. Major publishers and funders, most notably the American Economic Association, have taken steps to require that code and data are accurately reported, cited, and preserved as outputs in themselves.\sidenote{ @@ -61,20 +61,20 @@ \section{Protecting confidence in development research} \subsection{Research reproducibility} -Can another researcher reuse the same code on the same data +Can another researcher reuse the same code on the same data and get the exact same results as in your published paper?\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/what-development-economists-talk-about-when-they-talk-about-reproducibility}} -This is a standard known as \textbf{computational reproducibility}, +This is a standard known as \textbf{computational reproducibility}, and it is an increasingly common requirement for publication.\sidenote{ \url{https://www.nap.edu/resource/25303/R&R.pdf}}) -It is best practice to verify computational reproducibility before submitting a paper before publication. -This should be done by someone who is not on your research team, on a different computer, -using exactly the package of code and data files you plan to submit with your paper. -Code that is well-organized into a master script, and written to be easily run by others, -makes this task simpler. -The next chapter discusses organization of data work in detail. - -For research to be reproducible, +It is best practice to verify computational reproducibility before submitting a paper before publication. +This should be done by someone who is not on your research team, on a different computer, +using exactly the package of code and data files you plan to submit with your paper. +Code that is well-organized into a master script, and written to be easily run by others, +makes this task simpler. +The next chapter discusses organization of data work in detail. + +For research to be reproducible, all code files for data cleaning, construction and analysis should be public, unless they contain identifying information. Nobody should have to guess what exactly comprises a given index, @@ -169,7 +169,7 @@ \subsection{Research transparency} to record the decision process leading to changes and additions, track and register discussions, and manage tasks. These are flexible tools that can be adapted to different team and project dynamics. -Services that log your research process can show things like modifications made in response to referee comments, +Services that log your research process can show things like modifications made in response to referee comments, by having tagged version histories at each major revision. They also allow you to use issue trackers to document the research paths and questions you may have tried to answer @@ -247,7 +247,7 @@ \section{Ensuring privacy and security in research data} \index{data collection} This includes names, addresses, and geolocations, and extends to personal information such as email addresses, phone numbers, and financial information.\index{geodata}\index{de-identification} -It is important to keep in mind data privacy principles not only for the respondent +It is important to keep in mind data privacy principles not only for the respondent but also the PII data of their household members or other individuals who are covered under the survey. \index{privacy} In some contexts this list may be more extensive -- @@ -310,19 +310,19 @@ \subsection{Obtaining ethical approval and consent} before it happens, as it is happening, or after it has already happened. It also means that they must explicitly and affirmatively consent to the collection, storage, and use of their information for any purpose. -Therefore, the development of appropriate consent processes is of primary importance.\sidenote{ - \url{https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} +Therefore, the development of appropriate consent processes is of primary importance.\sidenote{url\ + {https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} All survey instruments must include a module in which the sampled respondent grants informed consent to participate. -Research participants must be informed of the purpose of the research, +Research participants must be informed of the purpose of the research, what their participation will entail in terms of duration and any procedures, -any foreseeable benefits or risks, +any foreseeable benefits or risks, and how their identity will be protected.\sidenote{\url {https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} There are special additional protections in place for vulnerable populations, such as minors, prisoners, and people with disabilities, and these should be confirmed with relevant authorities if your research includes them. -IRB approval should be obtained well before any data is acquired. +IRB approval should be obtained well before any data is acquired. IRBs may have infrequent meeting schedules or require several rounds of review for an application to be approved. If there are any deviations from an approved plan or expected adjustments, @@ -369,7 +369,7 @@ \subsection{Transmitting and storing data securely} enough to rely service providers' on-the-fly encryption as they need to keep a copy of the decryption key to make it automatic. When confidential data is stored on a local computer it must always remain encrypted, and confidential data may never be sent unencrypted -over email, WhatsApp, or other chat services. +over email, WhatsApp, or other chat services. The easiest way to reduce the risk of leaking personal information is to use it as rarely as possible. It is often very simple to conduct planning and analytical work @@ -418,10 +418,10 @@ \subsection{De-identifying data} You can take simple steps to avoid risks by minimizing the handling of PII. First, only collect information that is strictly needed for the research. Second, avoid the proliferation of copies of identified data. -There should never be more than one copy of the raw identified dataset in the project folder, +There should never be more than one copy of the raw identified dataset in the project folder, and it must always be encrypted. -Even within the research team, -access to PII data should be limited to team members who require it for specific analysis +Even within the research team, +access to PII data should be limited to team members who require it for specific analysis (most analysis will not depend on PII). Analysis that requires PII data is rare and can be avoided by properly linking identifiers to research information @@ -438,11 +438,11 @@ \subsection{De-identifying data} -- even if that data has had all directly identifying information removed -- by using some other data that becomes identifying when analyzed together. For this reason, we recommend de-identification in two stages. -The \textbf{initial de-identification} process strips the data of direct identifiers +The \textbf{initial de-identification} process strips the data of direct identifiers as early in the process as possible, to create a working de-identified dataset that can be shared \textit{within the research team} without the need for encryption. -This simplifies workflows. +This simplifies workflows. The \textbf{final de-identification} process involves making a decision about the trade-off between risk of disclosure and utility of the data diff --git a/chapters/publication.tex b/chapters/publication.tex index c1bc94308..32993eb11 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -313,9 +313,9 @@ \subsection{Publishing data for replication} includes a Microdata Catalog\sidenote{ \url{https://microdata.worldbank.org}} where researchers can publish data and documentation for their projects.\sidenote{ -\url{https://dimewiki.worldbank.org/Microdata_Catalog} +\url{https://dimewiki.worldbank.org/Microdata\_Catalog} \newline -\url{https://dimewiki.worldbank.org/Checklist:_Microdata_Catalog_submission} +\url{https://dimewiki.worldbank.org/Checklist:\_Microdata\_Catalog\_submission} } The Harvard Dataverse\sidenote{ \url{https://dataverse.harvard.edu}} @@ -395,7 +395,7 @@ \subsection{Publishing data for replication} There are a number of tools developed to help researchers de-identify data and which you should use as appropriate at that stage of data collection. These include \texttt{PII\_detection}\sidenote{ - \url{https://github.com/PovertyAction/PII_detection}} + \url{https://github.com/PovertyAction/PII\_detection}} from IPA, \texttt{PII-scan}\sidenote{ \url{https://github.com/J-PAL/PII-Scan}} From 64db37f38b71890fa64c9b4503f87e0b72cf1407 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 16:38:40 -0500 Subject: [PATCH 788/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 9ec1dc470..b340d4c78 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -253,7 +253,7 @@ \subsection{Identifying the identifier} It may be the case that the variable expected to be the unique identifier in fact is either incomplete or contains duplicates. This could be due to duplicate observations or errors in data entry. It could also be the case that there is no identifying variable, or the identifier is a long string, such as a name. -In this case cleaning begins by carefully creating a numeric variable that uniquely identifies the data. +In this case cleaning begins by carefully creating a variable that uniquely identifies the data. As discussed in the previous chapter, checking for duplicated entries is usually part of data quality monitoring, and is ideally addressed as soon as data is received From a000aea0ba25a7fcdcb7786621a57bb67dd271f8 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 16:38:51 -0500 Subject: [PATCH 789/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index b340d4c78..584f669a5 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -258,7 +258,7 @@ \subsection{Identifying the identifier} checking for duplicated entries is usually part of data quality monitoring, and is ideally addressed as soon as data is received -Note that while modern survey tools create unique observation identifiers, +Note that while modern survey tools create unique identifiers for each submitted data record, that is not the same as having a unique ID variable for each individual in the sample. You want to make sure the data set has a unique ID variable that can be cross-referenced with other records, such as the master data set\sidenote{\url{https://dimewiki.worldbank.org/Master\_Data\_Set}} From 9161988bc5e5968e8a2cbd10438dc0d4b17251fe Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 16:39:14 -0500 Subject: [PATCH 790/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 584f669a5..400a3a04c 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -448,7 +448,7 @@ \subsection{Constructing analytical variables} Before constructing new variables, you must check and double-check the value-assignments of questions, -as well as the units and scales +as well as the units and scales. This is when you will use the knowledge of the data and the documentation you acquired during cleaning. For example, it's possible that the survey instrument asked respondents to report some answers as percentages and others as proportions, From 066166c8168b5b1f3e46932474f654a15814590f Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 16:41:19 -0500 Subject: [PATCH 791/854] Update chapters/data-collection.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 931d16d1d..4698189ca 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -113,7 +113,7 @@ \subsection{Data licensing agreements} Keep in mind that the data owner is likely not familiar with the research process, and therefore may be surprised at some of the things you want to do if you are not clear up front. -You will typically want intellectual property rights to all derivative works developed used the data, +You will typically want intellectual property rights to all research outputs developed used the data, a license for all uses of derivative works, including public distribution (unless ethical considerations contraindicate this). This is important to allow the research team to store, catalog, and publish, in whole or in part, From 72ad9807fbcfdc8bdce069bbc0ad14b9c22f1260 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 16:43:45 -0500 Subject: [PATCH 792/854] Update chapters/data-collection.tex --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 4698189ca..e447d6eb9 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -117,7 +117,7 @@ \subsection{Data licensing agreements} a license for all uses of derivative works, including public distribution (unless ethical considerations contraindicate this). This is important to allow the research team to store, catalog, and publish, in whole or in part, -either the original licensed dataset or the derived materials. +either the original licensed dataset or datasets derived from the original. Make sure that the license you obtain from the data owner allows these uses, and that you consult with the owner if you foresee exceptions with specific portions of the data. From c9da7ceb780e5372b7d863fb49a85ade08ca2209 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 16:44:37 -0500 Subject: [PATCH 793/854] Update chapters/publication.tex --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 72529f1f0..98af09d33 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -81,7 +81,7 @@ \subsection{Preparing dynamic documents} that a mistake will be made or something will be missed. Therefore this is a broadly unsuitable way to prepare technical documents. -However, the most widely utilized software +The most widely utilized software for dynamically managing both text and results is \LaTeX\ (pronounced ``lah-tek'').\sidenote{ \url{https://github.com/worldbank/DIME-LaTeX-Templates}} \index{\LaTeX} From cf38569cc6de39717cdff38305265e3073768ccd Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 16:44:57 -0500 Subject: [PATCH 794/854] Update chapters/publication.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 98af09d33..728e0795c 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -93,7 +93,7 @@ \subsection{Preparing dynamic documents} Therefore, we recommend that you learn to use \LaTeX\ directly as soon as you are able to and provide several resources for doing so in the next section. -There are also code-based tools that can be used for dynamic documents, +There are tools that can generate dynamic documents from within your scripts, such as R's RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com}} and Stata's \texttt{dyndoc}.\sidenote{\url{https://www.stata.com/manuals/rptdyndoc.pdf}} These tools ``knit'' or ``weave'' text and code together, From e00c43376b8d731d0a1c572b99ef58abdb54578e Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 16:45:39 -0500 Subject: [PATCH 795/854] Update chapters/publication.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/publication.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/publication.tex b/chapters/publication.tex index 728e0795c..e38225d7e 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -111,6 +111,7 @@ \subsection{Preparing dynamic documents} a free online writing tool that allows linkages to files in Dropbox which are automatically updated anytime the file is replaced. They have limited functionality in terms of version control and formatting, +and may never include any references to confidential data, but can be useful for working on informal outputs, such as blogposts, with collaborators who do not code. From 23a5d06778fa9c9168994fb5bd86e02afc3ab17f Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 16:47:23 -0500 Subject: [PATCH 796/854] Update chapters/publication.tex --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index e38225d7e..8e08d9508 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -421,7 +421,7 @@ \subsection{Publishing code for replication} exactly what you have done in order to obtain your results, as well as to apply similar methods in future projects. Therefore it should both be functional and readable -(if you've followed the recommendations in Chapter Six, +(if you've followed the recommendations in this book this should be easy to do!). Code is often not written this way when it is first prepared, so it is important for you to review the content and organization From afb41dd30d82bee887ad57b3f7cc0f123fb306af Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 17:46:06 -0500 Subject: [PATCH 797/854] Update chapters/data-collection.tex Co-Authored-By: Luiza Andrade --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index e447d6eb9..291011138 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -136,7 +136,7 @@ \subsection{Data licensing agreements} Research teams that collect their own data must consider the terms under which they will release that data to other researchers or to the general public. -Will you publicly releasing the data in full (removing personal identifiers)? +Will you publicly release the data in full (removing personal identifiers)? Would you be okay with it being stored on servers anywhere in the world, even ones that are owned by corporations or governments in countries other than your own? Would you prefer to decide permission on a case-by-case basis, dependent on specific proposed uses? From 1b460e471d2a758ed5b854c8a8bf3f2853c2bae1 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 17:46:21 -0500 Subject: [PATCH 798/854] Update chapters/publication.tex Co-Authored-By: Luiza Andrade --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 8e08d9508..dc697e17d 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -49,7 +49,7 @@ \section{Collaborating on technical writing} effective collaboration requires the adoption of tools and practices that enable version control and simultaneous contribution. \textbf{Dynamic documents} are a way to significantly simplify workflows: -updates to the analytical outputs that constitute them +updates to the analytical outputs that appear in these documents, such as tables and figures, can be passed on to the final output with a single process, rather than copy-and-pasted or otherwise handled individually. Managing the writing process in this way From c31ff29f26ad33ac1a209b27c5263db2d2899076 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 17:47:46 -0500 Subject: [PATCH 799/854] Update chapters/publication.tex Co-Authored-By: Luiza Andrade --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index dc697e17d..ac9d07fde 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -201,7 +201,7 @@ \subsection{Technical writing with \LaTeX} the entire group of writers needs to be comfortable with \LaTeX\ before adopting one of these tools. They can require a lot of troubleshooting at a basic level at first, -and non-technical staff may not be willing or able to acquire the required knowledge. +and staff not used to programming may not be willing or able to acquire the necessary knowledge. Cloud-based implementations of \LaTex\, discussed in the next section, allow teams to take advantage of the features of \LaTeX, without requiring knowledge of the technical details. From d04e4f2c77ee70f53306d39e402fad0c441ba8ec Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 17:48:14 -0500 Subject: [PATCH 800/854] Update chapters/publication.tex Co-Authored-By: Luiza Andrade --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index ac9d07fde..1fff3023f 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -225,7 +225,7 @@ \subsection{Getting started with \LaTeX\ in the cloud} Second, they typically maintain a single, continuously synced, master copy of the document so that different writers do not create conflicted or out-of-sync copies, or need to deal with Git themselves to maintain that sync. -Third, they typically allow collaborators to edit in a fashion similar to Google Docs, +Third, they typically allow collaborators to edit documents simultaneously, though different services vary the number of collaborators and documents allowed at each tier. Fourth, and most usefully, some implementations provide a ``rich text'' editor that behaves pretty similarly to familiar tools like Word, From 7b6322ba88a1e6f508553ccedd3c08326faaa84e Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 17:48:49 -0500 Subject: [PATCH 801/854] Update chapters/publication.tex Co-Authored-By: Luiza Andrade --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 1fff3023f..6424c2c59 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -365,7 +365,7 @@ \subsection{Publishing data for replication} so that others can learn from your work and adapt it as they like. \subsection{De-identifying data for publication} -Therefore, before publishing data, +Before publishing data, you should carefully perform a \textbf{final de-identification}. Its objective is to create a dataset for publication that cannot be manipulated or linked to identify any individual research participant. From 0396c7f2763a3dcd4e162600964e24b8bc3928e6 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 17:49:10 -0500 Subject: [PATCH 802/854] Update chapters/publication.tex Co-Authored-By: Luiza Andrade --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 6424c2c59..0fbcb5fbf 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -483,7 +483,7 @@ \subsection{Releasing a replication package} and allow others to look at alternate versions or histories easily. It is straightforward to simply upload a fixed directory to GitHub apply a sharing license, and obtain a URL for the whole package. -(However, there is a strict size restriction of 100MB per file and +(There is a strict size restriction of 100MB per file and a restriction on the size of the repository as a whole, so larger projects will need alternative solutions.) However, GitHub is not ideal for other reasons. From 2c2d7cd720e7e320829bef3c948a62de340d5626 Mon Sep 17 00:00:00 2001 From: Maria Date: Mon, 24 Feb 2020 17:49:20 -0500 Subject: [PATCH 803/854] Update chapters/publication.tex Co-Authored-By: Luiza Andrade --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 0fbcb5fbf..7916187bd 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -367,7 +367,7 @@ \subsection{Publishing data for replication} \subsection{De-identifying data for publication} Before publishing data, you should carefully perform a \textbf{final de-identification}. -Its objective is to create a dataset for publication +Its objective is to reduce the risk of disclosing confidential information in the published data set. that cannot be manipulated or linked to identify any individual research participant. If you are following the steps outlined in this book, you have already removed any direct identifiers after collecting the data. From 9b67d499f07a250ae6fa0a1df88d928e7db428bf Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 25 Feb 2020 10:50:57 -0500 Subject: [PATCH 804/854] Update chapters/data-collection.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-collection.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 291011138..6f3b93dbc 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -13,7 +13,7 @@ The chapter begins with a discussion of some key ethical and legal descriptions to ensure that you have the right to do research using a specific dataset. -Particularly when confidential data is being collected at your behest +Particularly when confidential data is collected by you and your team or shared with you by a program implementer, government, or other partner, you need to make sure permissions are correctly granted and documented. Clearly establishing ownership and licensing of all information protects From 4aa5ec3186f697a9b1ed4dcb9bb037e9e7477c9f Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 25 Feb 2020 11:58:48 -0500 Subject: [PATCH 805/854] Update publication.tex disadvantages of cloud-based latex --- chapters/publication.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 7916187bd..3b80d3c9a 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -236,8 +236,8 @@ \subsection{Getting started with \LaTeX\ in the cloud} without needing to know a lot of the code that controls document formatting. Cloud-based implementations of \LaTeX\ also have disadvantages. -There is a small amount of up-front learning required, -continous access to the Internet is necessary, +There is still some up-front learning required, unless you're using the rich text editor. +Continuous access to the Internet is necessary, and updating figures and tables requires a bulk file upload that is tough to automate. A common problem you will face using online editors is special characters which, because of code functions, need to be handled differently than in Word. From cc13a138fc94fd7df3dc6a9af8047e88ec7a3a8b Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 25 Feb 2020 12:01:29 -0500 Subject: [PATCH 806/854] Update publication.tex removed latex special characters text --- chapters/publication.tex | 6 ------ 1 file changed, 6 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 3b80d3c9a..b77334b49 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -239,12 +239,6 @@ \subsection{Getting started with \LaTeX\ in the cloud} There is still some up-front learning required, unless you're using the rich text editor. Continuous access to the Internet is necessary, and updating figures and tables requires a bulk file upload that is tough to automate. -A common problem you will face using online editors is special characters -which, because of code functions, need to be handled differently than in Word. -Most critically, the ampersand (\texttt{\&}), percent (\texttt{\%}), and underscore (\texttt{\_}) -need to be ``escaped'' (interpreted as text and not code) in order to render. -This is done by by writing a backslash (\texttt{\textbackslash}) before them, -such as writing \texttt{40\textbackslash\%} for the percent sign to appear in text. Despite this, we believe that with minimal learning and workflow adjustments, cloud-based implementations are often the easiest way to allow coauthors to write and edit in \LaTeX\, so long as you make sure you are available to troubleshoot minor issues like these. From 90a7e953f8ab0a71754e1ad75188bee681ab764f Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 25 Feb 2020 12:04:09 -0500 Subject: [PATCH 807/854] Update publication.tex added dime collection on microdata --- chapters/publication.tex | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index b77334b49..086427e3a 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -314,7 +314,10 @@ \subsection{Publishing data for replication} The Datahub for Field Experiments in Economics and Public Policy\sidenote{\url{https://dataverse.harvard.edu/dataverse/DFEEP}} is especially relevant for impact evaluations. Both the World Bank Microdata Catalog and the Harvard Dataverse -create data citations for deposited entries. +create data citations for deposited entries. +DIME has its own collection of data sets in the Microdata Catalog, +where data from our projects is published.\sidenote{\url{ + https://microdata.worldbank.org/catalog/dime}} When your raw data is owned by someone else, or for any other reason you are not able to publish it, From 4a75b156edd45d4059ada18a5ce30972f7256a7f Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 25 Feb 2020 12:10:33 -0500 Subject: [PATCH 808/854] Update publication.tex revisions to data publication --- chapters/publication.tex | 18 +++++++----------- 1 file changed, 7 insertions(+), 11 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 086427e3a..ff5bdd1e6 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -273,11 +273,11 @@ \subsection{Publishing data for replication} Publicly documenting all original data generated as part of a research project is an important contribution in its own right. -Your paper should clearly cite the data used, -where and how it is stored, and how and under what circumstances it may be accessed. -You may not be able to publish the data itself, -due to licensing agreements or ethical concerns. -Even if you cannot release data immediately or publicly, +Publishing original datasets is a significant contribution that can be made +in addition to any publication of analysis results.\sidenote{ + \url{https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-publishing-research-data.pdf}} +If you are not able to publish the data itself, +due to licensing agreements or ethical concerns, there are often options to catalog or archive the data. These may take the form of metadata catalogs or embargoed releases. Such setups allow you to hold an archival version of your data @@ -288,12 +288,8 @@ \subsection{Publishing data for replication} They can also provide for timed future releases of datasets once the need for exclusive access has ended. -If your project collected original data, -releasing the cleaned dataset is a significant contribution that can be made -in addition to any publication of analysis results.\sidenote{ - \url{https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-publishing-research-data.pdf}} -It allows other researchers to validate the mechanical construction of your results, -to investigate what other results might be obtained from the same population, +Publishing data allows other researchers to validate the mechanical construction of your results, +investigate what other results might be obtained from the same population, and test alternative approaches or answer other questions. This fosters collaboration and may enable your team to fully explore variables and questions that you may not have time to focus on otherwise. From 09b05010f5d0473fbcec302d63493cdf38c52987 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 25 Feb 2020 12:11:45 -0500 Subject: [PATCH 809/854] Update publication.tex why not publish replication packages on github --- chapters/publication.tex | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index ff5bdd1e6..f3c02e1dd 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -479,12 +479,12 @@ \subsection{Releasing a replication package} (There is a strict size restriction of 100MB per file and a restriction on the size of the repository as a whole, so larger projects will need alternative solutions.) -However, GitHub is not ideal for other reasons. -It is not built to hold data in an efficient way -or to manage licenses. -It does not provide a true archive service -- -you can change or remove the contents at any time. -It does not assign a permanent digital object identifier (DOI) link for your work. +However, GitHub is not the ideal platform to release reproducibility packages. +It is built to version control code, and to facilitate collaboration on it. +Features to look for in a platform to release such packages and that are not offered by GitHub, include: +the possibility to store data and documentation as well as code, +the creation of a static copy of its content, that cannot be changed or removed, +and the assignment of a permanent digital object identifier (DOI) link. Another option is the Harvard Dataverse,\sidenote{ \url{https://dataverse.harvard.edu}} From 946f782b993ca5188eba496b26f743f952497a90 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 25 Feb 2020 12:15:00 -0500 Subject: [PATCH 810/854] Update data-analysis.tex revised text on extended missing values --- chapters/data-analysis.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 400a3a04c..f9e662e80 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -288,8 +288,8 @@ \subsection{Labeling, annotating, and finalizing clean data} Variable and value labels should be accurate and concise.\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Cleaning\#Applying\_Labels}} -Third, \textbf{recoding}: codes for ``Don't know'', ``Refused to answer'', and -other non-responses should be recoded into extended missing values.\sidenote{ +Third, \textbf{recoding}: survey codes for ``Don't know'', ``Refused to answer'', and +other non-responses must be removed but records of them should still be kept. In Stata that can elegantly be done using extended missing values.\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Cleaning\#Survey\_Codes\_and\_Missing\_Values}} String variables that correspond to categorical variables should be encoded. Open-ended responses stored as strings usually have a high risk of being identifiers, From 39697c72b610892c6b3a01ac0708b8f19d985ef1 Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 25 Feb 2020 12:18:07 -0500 Subject: [PATCH 811/854] Update chapters/introduction.tex Co-Authored-By: Luiza Andrade --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 132317cc0..c22d2711b 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -163,7 +163,7 @@ \section{Writing reproducible code in a collaborative environment} Elements like spacing, indentation, and naming (or lack thereof) can make your code much more (or much less) accessible to someone who is reading it for the first time and needs to understand it quickly and correctly. -/subsection{Code examples} +\subsection{Code examples} For some implementation portions where precise code is particularly important, we will provide minimal code examples either in the book or on the DIME Wiki. All code guidance is software-agnostic, but code examples are provided in Stata. From a6c7de20eed59c5d6b545284c9dd292adce73d4a Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 25 Feb 2020 13:23:28 -0500 Subject: [PATCH 812/854] Update chapters/introduction.tex Co-Authored-By: Luiza Andrade --- chapters/introduction.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index c22d2711b..ad9b99c8e 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -18,7 +18,7 @@ This book is targeted to everyone who interacts with development data: graduate students, research assistants, policymakers, and empirical researchers. It covers data workflows at all stages of the research process, from design to data acquisition and analysis. -This book is not sector-specific; it will not teach you econometrics, or how to design an impact evaluation. +Its content is not sector-specific; it will not teach you econometrics, or how to design an impact evaluation. There are many excellent existing resources on those topics. Instead, this book will teach you how to think about all aspects of your research from a data perspective, how to structure research projects to maximize data quality, From 7debf2059bb9053a0beb6aca3038451cf23a4aee Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 25 Feb 2020 13:26:57 -0500 Subject: [PATCH 813/854] As you gain experience in coding --- chapters/introduction.tex | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index ad9b99c8e..e1ba5c5cb 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -125,11 +125,11 @@ \section{Writing reproducible code in a collaborative environment} As this is fundamental to the remainder of the book's content, we provide here a brief introduction to \textbf{``good'' code} and \textbf{process standardization}. -``Good'' code has two elements: (1) it is correct, i.e. it doesn't produce any errors, -and (2) it is useful and comprehensible to someone who hasn't seen it before +``Good'' code has two elements: (1) it is correct, i.e. it doesn't produce any errors, +and (2) it is useful and comprehensible to someone who hasn't seen it before (or even yourself a few weeks, months or years later). Many researchers have been trained to code correctly. -However, when your code runs on your computer and you get the correct results, +However, when your code runs on your computer and you get the correct results, you are only half-done writing \textit{good} code. Good code is easy to read and replicate, making it easier to spot mistakes. Good code reduces sampling, randomization, and cleaning errors. @@ -163,6 +163,20 @@ \section{Writing reproducible code in a collaborative environment} Elements like spacing, indentation, and naming (or lack thereof) can make your code much more (or much less) accessible to someone who is reading it for the first time and needs to understand it quickly and correctly. +As you gain experience in coding +and get more confident with the way you implement these suggestions, +you will feel more empowered to apply critical thinking to the way you handle data. +For example, you will be able to predict which section +of your script are more likely to create errors. +This may happen intuitively, but you will improve much faster as a coder +if you do it purposefully. +Ask yourself, as you write code and explore results: +Do I believe this number? +What can go wrong in my code? +How will missing values be treated in this command? +What would happen if more observations would be added to the data set? +Can my code be made more efficient or easier to understand? + \subsection{Code examples} For some implementation portions where precise code is particularly important, we will provide minimal code examples either in the book or on the DIME Wiki. From 50ca90d3cd1c2839d4bc5a0d9f26a110459bc812 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 25 Feb 2020 13:30:22 -0500 Subject: [PATCH 814/854] Data work through code --- chapters/introduction.tex | 22 ++++++++++++++++++++-- chapters/planning-data-work.tex | 24 +++++++++++------------- 2 files changed, 31 insertions(+), 15 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index e1ba5c5cb..d04a4f3ce 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -78,10 +78,28 @@ \section{Doing credible research at scale} \section{Adopting reproducible tools} + +We assume througout all of this book +that you are going to do nearly all of your data work though code. +It may be possible to perform all relevant tasks +through the user interface in some statistical software, +or even through less field-specific software such as Excel. +However, we strongly advise against it. +The reason for that are the transparency, reproducibility and credibility principles +discussed in Chapter 1. +Writing code creates a record of every task you performed. +It also prevents direct interaction +with the data files that could lead to non-reproducible processes. +Think of the code as a recipe to create your results: +other people can follow it, reproduce it, +and even disagree with your the amount of spices you added +(or some of your coding decisions). +For these reasons, code is now considered an essential component of a research output. We will provide free, open-source, and platform-agnostic tools wherever possible, and point to more detailed instructions when relevant. -Stata is the notable exception here due to its current popularity in development economics.\sidenote{ -\url{https://aeadataeditor.github.io/presentation-20191211/\#9}} +Stata is the notable exception here +due to its current popularity in development economics.\sidenote{ + \url{https://aeadataeditor.github.io/presentation-20191211/\#9}} Most tools have a learning and adaptation process, meaning you will become most comfortable with each tool only by using it in real-world work. diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index df41e7a92..f59da9660 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -279,8 +279,6 @@ \section{Organizing code and data} without writing any code, we strongly advise against it. Writing code creates a record of every task you performed. It also prevents direct interaction with the data files that could lead to non-reproducible steps. -Good code, like a good recipe, allows other people to read and replicate it, -and this functionality is now considered an essential component of any research output. You may do some exploratory tasks by point-and-click or typing directly into the console, but anything that is included in a research output must be coded up in an organized fashion so that you can release @@ -360,7 +358,7 @@ \subsection{Organizing files and folder structures} The \texttt{DataWork} folder may be created either inside an existing project-based folder structure, or it may be created separately. It is preferable to create the \texttt{DataWork} folder -separately from the project management materials +separately from the project management materials (such as contracts, Terms of Reference, briefs and other administrative or management work). This is so the project folder can be maintained in a synced location like Dropbox, while the code folder can be maintained in a version-controlled location like GitHub. @@ -427,10 +425,10 @@ \subsection{Documenting and organizing code} They all come from the principle that code is an output by itself, not just a means to an end, and should be written thinking of how easy it will be for someone to read it later. -At the end of this section, we include a template for a master script in Stata, +At the end of this section, we include a template for a master script in Stata, to provide a concrete example of the required elements and structure. -Throughout this section, we refer to lines of the example do-file -to give concrete examples of the required code elements, organization and structure. +Throughout this section, we refer to lines of the example do-file +to give concrete examples of the required code elements, organization and structure. Code documentation is one of the main factors that contribute to readability. Start by adding a code header to every file. @@ -439,14 +437,14 @@ \subsection{Documenting and organizing code} but describe in plain language what the code is supposed to do. } that details the functionality of the entire script; -refer to lines 5-10 in the example do-file. +refer to lines 5-10 in the example do-file. This should include simple things such as the purpose of the script and the name of the person who wrote it. If you are using a version control software, the last time a modification was made and the person who made it will be recorded by that software. Otherwise, you should include it in the header. You should always track the inputs and outputs of the script, as well as the uniquely identifying variable; -refer to lines 49-51 in the example do-file. +refer to lines 49-51 in the example do-file. When you are trying to track down which code creates which data set, this will be very helpful. While there are other ways to document decisions related to creating code, the information that is relevant to understand the code should always be written in the code file. @@ -521,20 +519,20 @@ \subsection{Working with a master script} and when it does, it may take time for you to understand what's causing an error. The same applies to changes in data sets and results. -To link code, data and outputs, +To link code, data and outputs, the master script reflects the structure of the \texttt{DataWork} folder in code -through globals (in Stata) or string scalars (in R); -refer to lines 35-40 of the example do-file. +through globals (in Stata) or string scalars (in R); +refer to lines 35-40 of the example do-file. These coding shortcuts can refer to subfolders, so that those folders can be referenced without repeatedly writing out their absolute file paths. Because the \texttt{DataWork} folder is shared by the whole team, its structure is the same in each team member's computer. The only difference between machines should be -the path to the project root folder, i.e. the highest-level shared folder, +the path to the project root folder, i.e. the highest-level shared folder, which in the context of \texttt{iefolder} is the \texttt{DataWork} folder. This is reflected in the master script in such a way that the only change necessary to run the entire code from a new computer -is to change the path to the project folder to reflect the filesystem and username; +is to change the path to the project folder to reflect the filesystem and username; refer to lines 27-32 of the example do-file. The code in \texttt{stata-master-dofile.do} shows how folder structure is reflected in a master do-file. Because writing and maintaining a master script can be challenging as a project grows, From e8e02059e79df64f9e1c0ba8f731c9c3b2ac2edc Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 25 Feb 2020 13:31:59 -0500 Subject: [PATCH 815/854] Update chapters/publication.tex --- chapters/publication.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index f3c02e1dd..9c178f1cf 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -95,7 +95,7 @@ \subsection{Preparing dynamic documents} There are tools that can generate dynamic documents from within your scripts, such as R's RMarkdown\sidenote{\url{https://rmarkdown.rstudio.com}} -and Stata's \texttt{dyndoc}.\sidenote{\url{https://www.stata.com/manuals/rptdyndoc.pdf}} +Stata offers a built-in package for dynamic documents, \texttt{dyndoc}\sidenote{\url{https://www.stata.com/manuals/rptdyndoc.pdf}}, and user-written commands such \texttt{texdoc}\sidenote{\url{http://repec.sowi.unibe.ch/stata/texdoc}} and \texttt{markstat}\sidenote{\url{https://data.princeton.edu/stata/markdown}} allow for additional functionalities. These tools ``knit'' or ``weave'' text and code together, and are programmed to insert code outputs in pre-specified locations. Documents called ``notebooks'' (such as Jupyter\sidenote{\url{https://jupyter.org}}) work similarly, From 2e25f2ca1bb6b522023a56f58f99d55a57c2b181 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 25 Feb 2020 13:49:23 -0500 Subject: [PATCH 816/854] Stata and reproducible tools --- chapters/introduction.tex | 45 ++++++++++++++------------------------- 1 file changed, 16 insertions(+), 29 deletions(-) diff --git a/chapters/introduction.tex b/chapters/introduction.tex index d04a4f3ce..a8841dacb 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -94,37 +94,24 @@ \section{Adopting reproducible tools} other people can follow it, reproduce it, and even disagree with your the amount of spices you added (or some of your coding decisions). -For these reasons, code is now considered an essential component of a research output. -We will provide free, open-source, and platform-agnostic tools wherever possible, -and point to more detailed instructions when relevant. -Stata is the notable exception here -due to its current popularity in development economics.\sidenote{ - \url{https://aeadataeditor.github.io/presentation-20191211/\#9}} +Many development researchers come from economics and statistics backgrounds +and often understand code to be a means to an end rather than an output itself. +We believe that this must change somewhat: +in particular, we think that development practitioners +must begin to think about their code and programming workflows +just as methodologically as they think about their research workflows. + Most tools have a learning and adaptation process, meaning you will become most comfortable with each tool only by using it in real-world work. -Get to know them well early on, -so that you do not spend a lot of time learning through trial and error. - -Stata is the notable exception here due to its current popularity in development economics. -We focus on Stata-specific tools and instructions in this book. -Hence, we will use the terms ``script'' and ``do-file'' -interchangeably to refer to Stata code throughout. -Stata is primarily a scripting language for statistics and data, -meaning that its users often come from economics and statistics backgrounds -and understand Stata to be encoding a set of tasks as a record for the future. -We believe that this must change somewhat: -in particular, we think that practitioners of Stata -must begin to think about their code and programming workflows -just as methodologically as they think about their research workflows, -and that people who adopt this approach will be dramatically -more capable in their analytical ability. -This means that they will be more productive when managing teams, -and more able to focus on the challenges of experimental design -and econometric analysis, rather than spending excessive time -re-solving problems on the computer. -To support this goal, this book also includes -an introductory Stata Style Guide +To support your process of learning reproducible tools and workflows, +will reference free and open-source tools wherever possible, +and point to more detailed instructions when relevant. +Stata, as a proprietary software, is the notable exception here +due to its current popularity in development economics.\sidenote{ + \url{https://aeadataeditor.github.io/presentation-20191211/\#9}} +This book also includes +the DIME Analytics Stata Style Guide that we use in our work, which provides some new standards for coding so that code styles can be harmonized across teams for easier understanding and reuse of code. @@ -215,7 +202,7 @@ \subsection{Code examples} whenever you do not understand the command that is being used. We hope that these snippets will provide a foundation for your code style. Providing some standardization to Stata code style is also a goal of this team; -we provide our guidance on this in the DIME Analytics Stata Style Guide in the Appendix. +we provide our guidance on this in the Stata Style Guide in the Appendix. \section{Outline of this book} From 44ee2f4efc2ea0ee9c6378d264cd6367b8a8fc4e Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 25 Feb 2020 14:24:15 -0500 Subject: [PATCH 817/854] Update planning-data-work.tex --- chapters/planning-data-work.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index f59da9660..34d79c3f4 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -490,7 +490,7 @@ \subsection{Documenting and organizing code} into separate do-files, since there is no limit on how many you can have, how detailed their names can be, and no advantage to writing longer files. One reasonable rule of thumb is to not write do-files that have more than 200 lines. -This is an arbitrary limit, just like the standard restriction of each line to 80 characters: +This is an arbitrary limit, just like the common practice of limiting code lines to 80 characters: it seems to be ``enough but not too much'' for most purposes. \subsection{Working with a master script} From 2e8e9155373adfd54c4d496d94a9e45ac438f27f Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 25 Feb 2020 15:06:15 -0500 Subject: [PATCH 818/854] [ch 6] tex typo --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index f9e662e80..4f0082fa4 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -280,7 +280,7 @@ \subsection{Labeling, annotating, and finalizing clean data} \url{https://dimewiki.worldbank.org/iecodebook}} \index{iecodebook} -First, \textbr{renaming}: for data with an accompanying survey instrument, +First, \textbf{renaming}: for data with an accompanying survey instrument, it is useful to keep the same variable names in the cleaned dataset as in the survey instrument. That way it's straightforward to link variables to the relevant survey question. Second, \textbf{labeling}: applying labels makes it easier to understand your data as you explore it, From cafe24bf2dcf8e97ebfe8a80d173c64f35953f8d Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 25 Feb 2020 15:07:15 -0500 Subject: [PATCH 819/854] [ch 7] tex typo : \LaTeX instead of \LaTex or \LaTeX\ --- chapters/publication.tex | 50 ++++++++++++++++++++-------------------- 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/chapters/publication.tex b/chapters/publication.tex index 9c178f1cf..0cb2afd01 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -82,15 +82,15 @@ \subsection{Preparing dynamic documents} Therefore this is a broadly unsuitable way to prepare technical documents. The most widely utilized software -for dynamically managing both text and results is \LaTeX\ (pronounced ``lah-tek'').\sidenote{ +for dynamically managing both text and results is \LaTeX (pronounced ``lah-tek'').\sidenote{ \url{https://github.com/worldbank/DIME-LaTeX-Templates}} \index{\LaTeX} -\LaTeX\ is a document preparation and typesetting system with a unique syntax. +\LaTeX is a document preparation and typesetting system with a unique syntax. While this tool has a significant learning curve, its enormous flexibility in terms of operation, collaboration, output formatting, and styling make it the primary choice for most large technical outputs. -In fact, \LaTeX\ operates behind-the-scenes in many other dynamic document tools (discussed below). -Therefore, we recommend that you learn to use \LaTeX\ directly +In fact, \LaTeX operates behind-the-scenes in many other dynamic document tools (discussed below). +Therefore, we recommend that you learn to use \LaTeX directly as soon as you are able to and provide several resources for doing so in the next section. There are tools that can generate dynamic documents from within your scripts, @@ -118,7 +118,7 @@ \subsection{Preparing dynamic documents} \subsection{Technical writing with \LaTeX} -\LaTeX\ is billed as a ``document preparation system''. +\LaTeX is billed as a ``document preparation system''. What this means is worth unpacking. In {\LaTeX}, instead of writing in a ``what-you-see-is-what-you-get'' mode as you do in Word or the equivalent, @@ -127,7 +127,7 @@ \subsection{Technical writing with \LaTeX} Because it is written in a plain text file format, \texttt{.tex} can be version-controlled using Git. This is why it has become the dominant ``document preparation system'' in technical writing. -\LaTeX\ enables automatically-organized documents, +\LaTeX enables automatically-organized documents, manages tables and figures dynamically, and includes commands for simple markup like font styles, paragraph formatting, section headers and the like. @@ -136,7 +136,7 @@ \subsection{Technical writing with \LaTeX} It also allows publishers to apply global styles and templates to already-written material, allowing them to reformat entire documents in house styles with only a few keystrokes. -One of the most important tools available in \LaTeX\ +One of the most important tools available in \LaTeX is the BibTeX citation and bibliography manager.\sidenote{ \url{https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.3194&rep=rep1&type=pdf}} BibTeX keeps all the references you might use in an auxiliary file, @@ -144,8 +144,8 @@ \subsection{Technical writing with \LaTeX} The same principles that apply to figures and tables are therefore applied here: You can make changes to the references in one place (the \texttt{.bib} file), and then everywhere they are used they are updated correctly with one process. -Specifically, \LaTeX\ inserts references in text using the \texttt{\textbackslash cite\{\}} command. -Once this is written, \LaTeX\ automatically pulls all the citations into text +Specifically, \LaTeX inserts references in text using the \texttt{\textbackslash cite\{\}} command. +Once this is written, \LaTeX automatically pulls all the citations into text and creates a complete bibliography based on the citations you used whenever you compile the document. The system allows you to specify exactly how references should be displayed in text (such as superscripts, inline references, etc.) @@ -166,7 +166,7 @@ \subsection{Technical writing with \LaTeX} With these tools, you can ensure that references are handled in a format you can manage and control.\cite{flom2005latex} -\LaTeX\ has one more useful trick: +\LaTeX has one more useful trick: using \textbf{\texttt{pandoc}},\sidenote{ \url{https://pandoc.org}} you can translate the raw document into Word @@ -185,10 +185,10 @@ \subsection{Technical writing with \LaTeX} and use external tools like Word's compare feature to generate integrated tracked versions when needed. -Unfortunately, despite these advantages, \LaTeX\ can be a challenge to set up and use at first, +Unfortunately, despite these advantages, \LaTeX can be a challenge to set up and use at first, particularly if you are new to working with plain text code and file management. It is also unfortunately weak with spelling and grammar checking. -This is because \LaTeX\ requires that all formatting be done in its special code language, +This is because \LaTeX requires that all formatting be done in its special code language, and it is not particularly informative when you do something wrong. This can be off-putting very quickly for people who simply want to get to writing, like senior researchers. @@ -196,31 +196,31 @@ \subsection{Technical writing with \LaTeX} \url{https://www.texstudio.org}} and \texttt{atom-latex}\sidenote{ \url{https://atom.io/packages/atom-latex}} -offer the most flexibility to work with \LaTeX\ on your computer, +offer the most flexibility to work with \LaTeX on your computer, such as advanced integration with Git, the entire group of writers needs to be comfortable -with \LaTeX\ before adopting one of these tools. +with \LaTeX before adopting one of these tools. They can require a lot of troubleshooting at a basic level at first, and staff not used to programming may not be willing or able to acquire the necessary knowledge. -Cloud-based implementations of \LaTex\, discussed in the next section, +Cloud-based implementations of \LaTeX, discussed in the next section, allow teams to take advantage of the features of \LaTeX, without requiring knowledge of the technical details. -\subsection{Getting started with \LaTeX\ in the cloud} +\subsection{Getting started with \LaTeX in the cloud} -\LaTeX\ is a challenging tool to get started using, +\LaTeX is a challenging tool to get started using, but the control it offers over the writing process is invaluable. In order to make it as easy as possible for your team -to use \LaTeX\ without all members having to invest in new skills, -we suggest using a cloud-based implementation as your first foray into \LaTeX\ writing. +to use \LaTeX without all members having to invest in new skills, +we suggest using a cloud-based implementation as your first foray into \LaTeX writing. Most such sites offer a subscription feature with useful extensions and various sharing permissions, and some offer free-to-use versions with basic tools that are sufficient for a broad variety of applications, up to and including writing a complete academic paper with coauthors. -Cloud-based implementations of \LaTeX\ have several advantageous features. +Cloud-based implementations of \LaTeX have several advantageous features. First, since they are completely hosted online, -they avoid the inevitable troubleshooting required to set up a \LaTeX\ installation +they avoid the inevitable troubleshooting required to set up a \LaTeX installation on various personal computers run by the different members of a team. Second, they typically maintain a single, continuously synced, master copy of the document so that different writers do not create conflicted or out-of-sync copies, @@ -230,17 +230,17 @@ \subsection{Getting started with \LaTeX\ in the cloud} Fourth, and most usefully, some implementations provide a ``rich text'' editor that behaves pretty similarly to familiar tools like Word, so that collaborators can write text directly into the document without worrying too much -about the underlying \LaTeX\ coding. +about the underlying \LaTeX coding. Cloud services also usually offer a convenient selection of templates so it is easy to start up a project and see results right away without needing to know a lot of the code that controls document formatting. -Cloud-based implementations of \LaTeX\ also have disadvantages. +Cloud-based implementations of \LaTeX also have disadvantages. There is still some up-front learning required, unless you're using the rich text editor. Continuous access to the Internet is necessary, and updating figures and tables requires a bulk file upload that is tough to automate. Despite this, we believe that with minimal learning and workflow adjustments, -cloud-based implementations are often the easiest way to allow coauthors to write and edit in \LaTeX\, +cloud-based implementations are often the easiest way to allow coauthors to write and edit in \LaTeX, so long as you make sure you are available to troubleshoot minor issues like these. @@ -455,7 +455,7 @@ \subsection{Publishing code for replication} so ensure that the raw components of figures or tables are clearly identified. Documentation in the master script is often used to indicate this information. For example, outputs should clearly correspond by name to an exhibit in the paper, and vice versa. -(Supplying a compiling \LaTeX\ document can support this.) +(Supplying a compiling \LaTeX document can support this.) Code and outputs which are not used should be removed before publication. \subsection{Releasing a replication package} From c004559601b11c8c3f8dff51a4193c7242080eba Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 25 Feb 2020 15:10:25 -0500 Subject: [PATCH 820/854] [ch 6] tex typo --- chapters/data-analysis.tex | 58 +++++++++++++++++++------------------- 1 file changed, 29 insertions(+), 29 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 4f0082fa4..e077e4e3d 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -65,7 +65,7 @@ \subsection{Organizing your folder structure} \url{https://dimewiki.worldbank.org/ietoolkit}}) to automatize the creation of folders following our preferred scheme and to standardize folder structures across teams and projects. -A standardized structure greatly reduces the costs that PIs and RAs +A standardized structure greatly reduces the costs that PIs and RAs face when switching between projects, because folders are organized in exactly the same way and use the same filepaths, shortcuts, and macro references.\sidenote{ @@ -108,15 +108,15 @@ \subsection{Breaking down tasks} The code, data and outputs of each of these stages should go through at least one round of code review, in which team members read and run each other's codes. Reviewing code at each stage, rather than waiting until the end of a project, -is preferrable as the amount of code to review is more manageable and -it allows you to correct errors in real-time (e.g. correcting errors in variable construction before analysis begins). +is preferrable as the amount of code to review is more manageable and +it allows you to correct errors in real-time (e.g. correcting errors in variable construction before analysis begins). Code review is a common quality assurance practice among data scientists. It helps to keep the quality of the outputs high, and is also a great way to learn and improve your own code. \subsection{Writing master scripts} Master scripts allow users to execute all the project code from a single file. -As discussed in Chapter 2, the master script should briefly describe what each +As discussed in Chapter 2, the master script should briefly describe what each section of the code does, and map the files they require and create. The master script also connects code and folder structure through macros or objects. In short, a master script is a human-readable map of the tasks, @@ -174,18 +174,18 @@ \section{De-identifying research data} case you later realize that the manual fix was done incorrectly. The first step in the transformation of raw data to an analysis-ready dataset is de-identification. -This simplifies workflows, as once you create a de-identified version of the dataset, -you no longer need to interact directly with the encrypted raw data. +This simplifies workflows, as once you create a de-identified version of the dataset, +you no longer need to interact directly with the encrypted raw data. at this stage, means stripping the data set of personally identifying information.\sidenote{ \url{https://dimewiki.worldbank.org/De-identification}} -To do so, you will need to identify all variables that contain +To do so, you will need to identify all variables that contain identifying information.\sidenote{\url{ https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf}} For primary data collection, where the research team designs the survey instrument, flagging all potentially identifying variables in the questionnaire design stage simplifies the initial de-identification process. If you did not do that, or you received original data by another means, -there are a few tools to help flag variables with personally-identifying data. +there are a few tools to help flag variables with personally-identifying data. JPAL's \texttt{PII scan}, as indicated by its name, scans variable names and labels for common string patterns associated with identifying information.\sidenote{ \url{https://github.com/J-PAL/PII-Scan}} @@ -246,14 +246,14 @@ \section{Cleaning data for analysis} \subsection{Identifying the identifier} -The first step in the cleaning process is to understand the level of observation in the data (what makes a row), +The first step in the cleaning process is to understand the level of observation in the data (what makes a row), and what variable or set of variables uniquely identifies each observations. Ensuring that observations are uniquely and fully identified\sidenote{\url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} is possibly the most important step in data cleaning. -It may be the case that the variable expected to be the unique identifier in fact is either incomplete or contains duplicates. +It may be the case that the variable expected to be the unique identifier in fact is either incomplete or contains duplicates. This could be due to duplicate observations or errors in data entry. It could also be the case that there is no identifying variable, or the identifier is a long string, such as a name. -In this case cleaning begins by carefully creating a variable that uniquely identifies the data. +In this case cleaning begins by carefully creating a variable that uniquely identifies the data. As discussed in the previous chapter, checking for duplicated entries is usually part of data quality monitoring, and is ideally addressed as soon as data is received @@ -273,14 +273,14 @@ \subsection{Labeling, annotating, and finalizing clean data} The last step of data cleaning is to label and annotate the data, so that all users have the information needed to interact with it. -There are three key steps: renaming, labeling and recoding. +There are three key steps: renaming, labeling and recoding. This is a key step to making the data easy to use, but it can be quite repetitive. The \texttt{iecodebook} command suite, also part of \texttt{iefieldkit}, is designed to make some of the most tedious components of this process easier.\sidenote{ \url{https://dimewiki.worldbank.org/iecodebook}} \index{iecodebook} -First, \textbf{renaming}: for data with an accompanying survey instrument, +First, \textbf{renaming}: for data with an accompanying survey instrument, it is useful to keep the same variable names in the cleaned dataset as in the survey instrument. That way it's straightforward to link variables to the relevant survey question. Second, \textbf{labeling}: applying labels makes it easier to understand your data as you explore it, @@ -301,7 +301,7 @@ \subsection{Preparing a clean dataset} The main output of data cleaning is the cleaned data set. It should contain the same information as the raw data set, with identifying variables and data entry mistakes removed. -Although primary data typically requires more extensive data cleaning than secondary data, +Although primary data typically requires more extensive data cleaning than secondary data, you should carefully explore possible issues in any data you are about to use. When reviewing raw data, you will inevitably encounter data entry mistakes, such as typos and inconsistent values. @@ -311,9 +311,9 @@ \subsection{Preparing a clean dataset} https://dimewiki.worldbank.org/Data\_Cleaning}} The clean dataset should always be accompanied by a dictionary or codebook. -Survey data should be easily traced back to the survey instrument. +Survey data should be easily traced back to the survey instrument. Typically, one cleaned data set will be created for each data source -or survey instrument; and each row in the cleaned data set represents one +or survey instrument; and each row in the cleaned data set represents one respondent or unit of observation.\cite{tidy-data} If the raw data set is very large, or the survey instrument is very complex, @@ -341,11 +341,11 @@ \subsection{Preparing a clean dataset} \subsection{Documenting data cleaning} -Throughout the data cleaning process, +Throughout the data cleaning process, you will often need extensive inputs from the people responsible for data collection. -(This could be a survey team, the government ministry responsible for administrative data systems, +(This could be a survey team, the government ministry responsible for administrative data systems, the technology firm that generated remote sensing data, etc.) -You should acquire and organize all documentation of how the data was generated, such as +You should acquire and organize all documentation of how the data was generated, such as reports from the data provider, field protocols, data collection manuals, survey instruments, supervisor notes, and data quality monitoring reports. These materials are essential for data documentation.\sidenote{ @@ -413,7 +413,7 @@ \section{Constructing final indicators} The first one is to clearly differentiate correction of data entry errors (necessary for all interactions with the data) from creation of analysis indicators (necessary only for the analysis at hand). -It is also important to differentiate the two stages +It is also important to differentiate the two stages to ensure that variable definition is consistent across data sources. Unlike cleaning, construction can create many outputs from many inputs. Let's take the example of a project that has a baseline and an endline survey. @@ -428,29 +428,29 @@ \section{Constructing final indicators} % From analysis Ideally, indicator construction should be done right after data cleaning and before data analysis starts, according to the pre-analysis plan.\index{Pre-analysis plan} -In practice, however, as you analyze the data, +In practice, however, as you analyze the data, different constructed variables will become necessary, as well as subsets and other alterations to the data. -Even if construction and analysis are done concurrently, -you should ways do the two in separate scripts. +Even if construction and analysis are done concurrently, +you should ways do the two in separate scripts. If every script that creates a table starts by loading a data set, subsetting it, and manipulating variables, any edits to construction need to be replicated in all scripts. This increases the chances that at least one of them will have a different sample or variable definition. -Doing all variable construction in a single, separate script helps +Doing all variable construction in a single, separate script helps avoid this and ensure consistency across different outputs. \subsection{Constructing analytical variables} New variables created during the construction stage should be added to the data set, instead of overwriting the original information. -New variables should be assigned functional names. +New variables should be assigned functional names. Ordering the data set so that related variables are together, and adding notes to each of them as necessary will make your data set more user-friendly. -Before constructing new variables, +Before constructing new variables, you must check and double-check the value-assignments of questions, as well as the units and scales. This is when you will use the knowledge of the data and the documentation you acquired during cleaning. -For example, it's possible that the survey instrument asked respondents +For example, it's possible that the survey instrument asked respondents to report some answers as percentages and others as proportions, or that in one question \texttt{0} means ``no'' and \texttt{1} means ``yes'', while in another one the same answers were coded as \texttt{1} and \texttt{2}. @@ -531,7 +531,7 @@ \section{Writing data analysis code} \textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical\_introduction\_to\_stata.pdf}} \textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion}} and \textit{Causal Inference: The Mixtape}.\sidenote{\url{https://scunning.com/mixtape.html}} -We focus on how to \texit{code} data analysis, rather than how to conduct specific analyses. +We focus on how to \textit{code} data analysis, rather than how to conduct specific analyses. \subsection{Organizing analysis code} @@ -672,7 +672,7 @@ \subsection{Exporting analysis outputs} Exporting table to \texttt{.tex} should be preferred. Excel \texttt{.xlsx} and \texttt{.csv} are also commonly used, but require the extra step of copying the tables into the final output. -The amount of work needed in a copy-paste workflow increases +The amount of work needed in a copy-paste workflow increases rapidly with the number of tables and figures included in a research output, and so do the chances of having the wrong version a result in your paper or report. From c94241342c6a48f50fbad894fcd29586bbe2aa7e Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 25 Feb 2020 14:56:33 -0500 Subject: [PATCH 821/854] [code] masterdo : standardize indentation --- code/stata-master-dofile.do | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/code/stata-master-dofile.do b/code/stata-master-dofile.do index ee7f6530a..47887e6f9 100644 --- a/code/stata-master-dofile.do +++ b/code/stata-master-dofile.do @@ -10,7 +10,7 @@ * PART 3: Run do files * * * ******************************************************************************** - PART 1: Set standard settings and install packages + PART 1: Install user-written packages and harmonize settings ********************************************************************************/ if (0) { @@ -21,7 +21,7 @@ `r(version)' /******************************************************************************* - PART 2: Prepare folder paths and define programs + PART 2: Prepare folder paths and define programs *******************************************************************************/ * Research Assistant folder paths @@ -32,19 +32,19 @@ } - * Baseline folder globals - global bl_encrypt "${encrypted}/Round Baseline Encrypted" - global bl_dt "${dropbox}/Baseline/DataSets" - global bl_doc "${dropbox}/Baseline/Documentation" - global bl_do "${github}/Baseline/Dofiles" - global bl_out "${github}/Baseline/Output" + * Baseline folder globals + global bl_encrypt "${encrypted}/Round Baseline Encrypted" + global bl_dt "${dropbox}/Baseline/DataSets" + global bl_doc "${dropbox}/Baseline/Documentation" + global bl_do "${github}/Baseline/Dofiles" + global bl_out "${github}/Baseline/Output" /******************************************************************************* - PART 3: Run do files + PART 3: Run do files *******************************************************************************/ /*------------------------------------------------------------------------------ - PART 3.1: De-identify baseline data + PART 3.1: De-identify baseline data -------------------------------------------------------------------------------- REQUIRES: ${bl_encrypt}/Raw Identified Data/D4DI_baseline_raw_identified.dta CREATES: ${bl_dt}/Raw Deidentified/D4DI_baseline_raw_deidentified.dta @@ -53,7 +53,7 @@ do "${bl_do}/Cleaning/deidentify.do" /*------------------------------------------------------------------------------ - PART 3.2: Clean baseline data + PART 3.2: Clean baseline data -------------------------------------------------------------------------------- REQUIRES: ${bl_dt}/Raw Deidentified/D4DI_baseline_raw_deidentified.dta CREATES: ${bl_dt}/Final/D4DI_baseline_clean.dta @@ -63,7 +63,7 @@ do "${bl_do}/Cleaning/cleaning.do" /*----------------------------------------------------------------------------- - PART 3.3: Construct income indicators + PART 3.3: Construct income indicators -------------------------------------------------------------------------------- REQUIRES: ${bl_dt}/Final/D4DI_baseline_clean.dta CREATES: ${bl_out}/Raw/D4DI_baseline_income_distribution.png From 7d9d7623aecaca940347d74dbed8f3ba0a34de74 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 25 Feb 2020 14:57:29 -0500 Subject: [PATCH 822/854] [code] masterdo : add section switches --- code/stata-master-dofile.do | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/code/stata-master-dofile.do b/code/stata-master-dofile.do index 47887e6f9..2236fa89b 100644 --- a/code/stata-master-dofile.do +++ b/code/stata-master-dofile.do @@ -49,9 +49,10 @@ REQUIRES: ${bl_encrypt}/Raw Identified Data/D4DI_baseline_raw_identified.dta CREATES: ${bl_dt}/Raw Deidentified/D4DI_baseline_raw_deidentified.dta IDS VAR: hhid -------------------------------------------------------------------------------- */ - do "${bl_do}/Cleaning/deidentify.do" - +----------------------------------------------------------------------------- */ + if (0) { //Change the 0 to 1 to run the baseline de-identification dofile + do "${bl_do}/Cleaning/deidentify.do" + } /*------------------------------------------------------------------------------ PART 3.2: Clean baseline data -------------------------------------------------------------------------------- @@ -60,8 +61,9 @@ ${bl_doc}/Codebook baseline.xlsx IDS VAR: hhid ----------------------------------------------------------------------------- */ - do "${bl_do}/Cleaning/cleaning.do" - + if (0) { //Change the 0 to 1 to run the baseline cleaning dofile + do "${bl_do}/Cleaning/cleaning.do" + } /*----------------------------------------------------------------------------- PART 3.3: Construct income indicators -------------------------------------------------------------------------------- @@ -70,4 +72,6 @@ ${bl_dt}/Intermediate/D4DI_baseline_constructed_income.dta IDS VAR: hhid ----------------------------------------------------------------------------- */ - do "${bl_do}/Construct/construct_income.do" + if (0) { //Change the 0 to 1 to run the baseline variable construction dofile + do "${bl_do}/Construct/construct_income.do" + } \ No newline at end of file From a1b16c7c9d9f360e8c7376cf6e3c51db005bf5bf Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 25 Feb 2020 14:57:52 -0500 Subject: [PATCH 823/854] [code] masterdo : add comment to ieboilstart --- code/stata-master-dofile.do | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/code/stata-master-dofile.do b/code/stata-master-dofile.do index 2236fa89b..8a7146534 100644 --- a/code/stata-master-dofile.do +++ b/code/stata-master-dofile.do @@ -17,7 +17,8 @@ ssc install ietoolkit, replace } - ieboilstart, v(15.1) + *Harmonize settings accross users as much as possible + ieboilstart, v(13.1) `r(version)' /******************************************************************************* From 99bc0d35bf42073b24ee894a73c8f6ef286320b6 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 25 Feb 2020 14:58:16 -0500 Subject: [PATCH 824/854] [code] masterdo : improve user-written code install --- code/stata-master-dofile.do | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/code/stata-master-dofile.do b/code/stata-master-dofile.do index 8a7146534..6944c2831 100644 --- a/code/stata-master-dofile.do +++ b/code/stata-master-dofile.do @@ -13,10 +13,12 @@ PART 1: Install user-written packages and harmonize settings ********************************************************************************/ - if (0) { - ssc install ietoolkit, replace - } - + local user_commands ietoolkit iefieldkit //Add required user-written commands + foreach command of local user_commands { + cap which `command' + if _rc == 111 ssc install `command' + } + *Harmonize settings accross users as much as possible ieboilstart, v(13.1) `r(version)' From 2cb444802d77ccfc6345aacb7145e6470072fd77 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 25 Feb 2020 15:32:36 -0500 Subject: [PATCH 825/854] Integrating data --- chapters/data-analysis.tex | 44 +++++++++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index e077e4e3d..10e2e8e47 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -440,6 +440,48 @@ \section{Constructing final indicators} Doing all variable construction in a single, separate script helps avoid this and ensure consistency across different outputs. +\subsection{Integrating different data sources} +Often, you will combine or merge information from different data sources together +in order to create the analytical variables you are interested in. +For example, you may merge administrative data with survey data +to include demographic information in your analysis, +or you may want to integrate geographic information +in order to construct indicators or controls based on the location of observations. +To do this, you will need to consider several possible linkages. +In the simplest case, you might be merging two survey or administrative datasets +at the same unit of response using a consistent numeric ID. +You might also be combining datasets with the same unit of response, +but that use an external or string ID such as a name, +which do not always perfectly match. +In these cases, you will need to extensively analyse the merging patterns +and understand what units are present in one dataset but not the other, +as well as be able to resolve fuzzy or imperfect matching. +There are some commands such as \texttt{reclink} in Stata +that can provide some useful utilities, +but often a large amount of close examination is necessary +in order to figure out what the matching pattern should be +and how to accomplish it in practice through your code. + +In other cases, different data sources will be describing different levels of observation. +For example, you might be combining roads data with household data, +using matching methods such as distances, +or combining something like household data with individual records +from another administrative source. +Sometimes these cases are conceptually straightforward. +For example, merging a dataset of health care providers +with a dataset of patients comes with a clear linking relation between the two; +the challenge usually occurs in correctly defining statistical aggregations +if the merge is intended to result in a dataset at the provider level. +However, other cases may not be designed with the intention to be merged together, +such as a dataset of infrastructure access points such as water pumps or schools +and a datset of household locations and roads. +In those cases, a key part of the research contribution is figuring out what +a useful way to combine the datasets is. +Since these conceptual constructs are so important +and so easy to imagine different ways to do, +it is especially important that these data integrations are not treated mechanically +and are extensively documented separately from other data construction tasks. + \subsection{Constructing analytical variables} New variables created during the construction stage should be added to the data set, instead of overwriting the original information. New variables should be assigned functional names. @@ -531,7 +573,7 @@ \section{Writing data analysis code} \textit{A Practical Introduction to Stata};\sidenote{\url{https://scholar.harvard.edu/files/mcgovern/files/practical\_introduction\_to\_stata.pdf}} \textit{Mostly Harmless Econometrics};\sidenote{\url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion}} and \textit{Causal Inference: The Mixtape}.\sidenote{\url{https://scunning.com/mixtape.html}} -We focus on how to \textit{code} data analysis, rather than how to conduct specific analyses. +We focus on how to \textit{code} data analysis, rather than how to conduct specific analyses. \subsection{Organizing analysis code} From 95792e9b50e23dfc70ab9ab5c1b9757c79cb043e Mon Sep 17 00:00:00 2001 From: Maria Date: Tue, 25 Feb 2020 17:29:14 -0500 Subject: [PATCH 826/854] Update data-analysis.tex revised draft. renamed section to constructing analysis datasets and restructured. --- chapters/data-analysis.tex | 130 ++++++++++++++++++------------------- 1 file changed, 63 insertions(+), 67 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 10e2e8e47..0c84eddee 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -370,23 +370,15 @@ \subsection{Documenting data cleaning} then use it as a basis for discussions of how to address data issues during variable construction. This material will also be valuable during exploratory data analysis. -\section{Constructing final indicators} +\section{Constructing analysis datasets} % What is construction ------------------------------------- -The third stage is construction of the variables of interest for analysis. -It is at this stage that the raw data is transformed into analysis data. -This is done by creating derived variables (dummies, indices, and interactions, to name a few), +The third stage is construction of the dataset you will use for analysis. +It is at this stage that the raw data is transformed into analysis-ready data, +by integrating different datasets and creating derived variables +(dummies, indices, and interactions, to name a few), as planned during research design\index{Research design}, and using the pre-analysis plan as a guide.\index{Pre-analysis plan} -To understand why construction is necessary, -let's take the example of a consumption module from a household survey. -For each item in a context-specific bundle, -the respondent is asked whether the household consumed each item over a certain period of time. -If they did, the respondent will be asked about the quantity consumed and the cost of the relevant unit. -It would be difficult to run a meaningful regression -on the number of cups of milk and handfuls of beans that a household consumed over a week. -You need to manipulate them into something that has \textit{economic} meaning, -such as caloric input or food expenditure per adult equivalent. During this process, the data points will typically be reshaped and aggregated so that level of the data set goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ @@ -408,13 +400,12 @@ \section{Constructing final indicators} Having three separate datasets for each of these three pieces of analysis will result in much cleaner do files than if they all started from the same data set. -% From cleaning +\subsection{Fitting construction into the data workflow} Construction is done separately from data cleaning for two reasons. -The first one is to clearly differentiate correction of data entry errors +First, it clearly differentiates correction of data entry errors (necessary for all interactions with the data) -from creation of analysis indicators (necessary only for the analysis at hand). -It is also important to differentiate the two stages -to ensure that variable definition is consistent across data sources. +from creation of analysis indicators (necessary only for the specific analysis). +Second, it ensures that variable definition is consistent across data sources. Unlike cleaning, construction can create many outputs from many inputs. Let's take the example of a project that has a baseline and an endline survey. Unless the two instruments are exactly the same, @@ -425,14 +416,14 @@ \section{Constructing final indicators} To do this, you will require at least two cleaning scripts, and a single one for construction. -% From analysis -Ideally, indicator construction should be done right after data cleaning and before data analysis starts, +Construction of the analysis data should be done right after data cleaning and before data analysis starts, according to the pre-analysis plan.\index{Pre-analysis plan} In practice, however, as you analyze the data, -different constructed variables will become necessary, -as well as subsets and other alterations to the data. +different constructed variables may become necessary, +as well as subsets and other alterations to the data, +and you will need to adjust the analysis data accordingly. Even if construction and analysis are done concurrently, -you should ways do the two in separate scripts. +you should always do the two in separate scripts. If every script that creates a table starts by loading a data set, subsetting it, and manipulating variables, any edits to construction need to be replicated in all scripts. @@ -442,18 +433,27 @@ \section{Constructing final indicators} \subsection{Integrating different data sources} Often, you will combine or merge information from different data sources together -in order to create the analytical variables you are interested in. +in order to create the analysis dataset. For example, you may merge administrative data with survey data to include demographic information in your analysis, or you may want to integrate geographic information in order to construct indicators or controls based on the location of observations. -To do this, you will need to consider several possible linkages. -In the simplest case, you might be merging two survey or administrative datasets -at the same unit of response using a consistent numeric ID. -You might also be combining datasets with the same unit of response, -but that use an external or string ID such as a name, -which do not always perfectly match. -In these cases, you will need to extensively analyse the merging patterns +To do this, you will need to consider the unit of response for each dataset, +and the identifying variable, to understand how they can be merged. + +If the datasets you need to join have the same unit of response, +merging may be straightforward. +The simplest case is merging datasets at the same unit of response +which use a consistent numeric ID. +For example, in the case of a panel survey for firms, +you may merge baseline and endline data using the firm identification number. +In many cases, however, +datasets at the same unit of response may not use a consistent numeric identifier. +Identifiers that are string variables, such as names, +often contain spelling mistakes or irregularities in capitalization, spacing or ordering. +In this case, you will need to do a \textbf{fuzzy match}, +to link observations that have similar identifiers. +In these cases, you will need to extensively analyze the merging patterns and understand what units are present in one dataset but not the other, as well as be able to resolve fuzzy or imperfect matching. There are some commands such as \texttt{reclink} in Stata @@ -462,11 +462,11 @@ \subsection{Integrating different data sources} in order to figure out what the matching pattern should be and how to accomplish it in practice through your code. -In other cases, different data sources will be describing different levels of observation. -For example, you might be combining roads data with household data, -using matching methods such as distances, -or combining something like household data with individual records -from another administrative source. +In other cases, you will need to join data sources that have different units of response. +For example, you might be overlaying road location data with household data, +using a spatial match, +or combining school administrative data, such as attendance records and test scores, +with household demographic characteristics from a survey. Sometimes these cases are conceptually straightforward. For example, merging a dataset of health care providers with a dataset of patients comes with a clear linking relation between the two; @@ -474,7 +474,7 @@ \subsection{Integrating different data sources} if the merge is intended to result in a dataset at the provider level. However, other cases may not be designed with the intention to be merged together, such as a dataset of infrastructure access points such as water pumps or schools -and a datset of household locations and roads. +and a dataset of household locations and roads. In those cases, a key part of the research contribution is figuring out what a useful way to combine the datasets is. Since these conceptual constructs are so important @@ -482,52 +482,47 @@ \subsection{Integrating different data sources} it is especially important that these data integrations are not treated mechanically and are extensively documented separately from other data construction tasks. +Integrating different datasets may involve changing the structure of the data, +e.g. changing the unit of observation through collapses or reshapes. +This should always be done with great care. +Two issues to pay extra attention to are missing values and dropped observations. +Merging, reshaping and aggregating data sets can change both the total number of observations +and the number of observations with missing values. +Make sure to read about how each command treats missing observations and, +whenever possible, add automated checks in the script that throw an error message if the result is changing. +If you are subsetting your data, +drop observations explicitly, +indicating why you are doing that and how the data set changed. + \subsection{Constructing analytical variables} -New variables created during the construction stage should be added to the data set, instead of overwriting the original information. -New variables should be assigned functional names. -Ordering the data set so that related variables are together, -and adding notes to each of them as necessary will make your data set more user-friendly. +Once you have assembled your different data sources, +it's time to create the specific indicators of interest for analysis. +New variables should be assigned functional names, +and the dataset ordered such that related variables are together. +Adding notes to each variable will make your data set more user-friendly. Before constructing new variables, you must check and double-check the value-assignments of questions, as well as the units and scales. This is when you will use the knowledge of the data and the documentation you acquired during cleaning. -For example, it's possible that the survey instrument asked respondents -to report some answers as percentages and others as proportions, -or that in one question \texttt{0} means ``no'' and \texttt{1} means ``yes'', +First, check that all categorical variables have the same value assignment, i.e., +that labels and levels have the same correspondence across variables that use the same options. +For example, it's possible that in one question \texttt{0} means ``no'' and \texttt{1} means ``yes'', while in another one the same answers were coded as \texttt{1} and \texttt{2}. -We recommend coding yes/no questions as either \texttt{1} and \texttt{0} or \texttt{TRUE} and \texttt{FALSE}, +(We recommend coding binary questions as either \texttt{1} and \texttt{0} or \texttt{TRUE} and \texttt{FALSE}, so they can be used numerically as frequencies in means and as dummies in regressions. -(Note that this implies that categorical variables like \texttt{sex} -should be re-expressed as binary variables like \texttt{female}.) -Check that non-binary categorical variables have the same value assignment, i.e., -that labels and levels have the same correspondence across variables that use the same options. -Finally, make sure that any numeric variables you are comparing are converted to the same scale or unit of measure. +Note that this implies re-expressing categorical variables like \texttt{sex} to binary variables like \texttt{woman}.) +Second, make sure that any numeric variables you are comparing are converted to the same scale or unit of measure. You cannot add one hectare and two acres and get a meaningful number. -During construction, you will also need to address some of the issues -you identified in the data set as you were cleaning it. -The most common of them is the presence of outliers. +You will also need to decide how to handle any outliers or unusual values identified during data cleaning. How to treat outliers is a question for the research team (as there are multiple possible approaches), but make sure to note what decision was made and why. Results can be sensitive to the treatment of outliers, so keeping the original variable in the data set will allow you to test how much it affects the estimates. These points also apply to imputation of missing values and other distributional patterns. -The more complex construction tasks involve changing the structure of the data: -adding new observations or variables by merging data sets, -and changing the unit of observation through collapses or reshapes. -There are always ways for things to go wrong that we never anticipated, -but two issues to pay extra attention to are missing values and dropped observations. -Merging, reshaping and aggregating data sets can change both the total number of observations -and the number of observations with missing values. -Make sure to read about how each command treats missing observations and, -whenever possible, add automated checks in the script that throw an error message if the result is changing. -If you are subsetting your data, -drop observations explicitly, -indicating why you are doing that and how the data set changed. - -Finally, primary panel data involves additional timing complexities. +Primary panel data presents additional timing complexities. It is common to construct indicators soon after receiving data from a new survey round. However, creating indicators for each round separately increases the risk of using different definitions every time. Having a well-established definition for each constructed variable helps prevent that mistake, @@ -540,6 +535,7 @@ \subsection{Constructing analytical variables} In addition to preventing inconsistencies, this process will also save you time and give you an opportunity to review your original code. + \subsection{Documenting variable construction} Because data construction involves translating concrete data points to more abstract measurements, From af11287dcfe9dec7d0fcb36f096819bb2002856b Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Tue, 25 Feb 2020 17:43:48 -0500 Subject: [PATCH 827/854] Clarify non-experimental --- bibliography.bib | 20 ++++++++++++++++++++ chapters/research-design.tex | 18 +++++++++++++++--- 2 files changed, 35 insertions(+), 3 deletions(-) diff --git a/bibliography.bib b/bibliography.bib index 1640bd2c6..d0c677d00 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -159,6 +159,26 @@ @article{blischak2016quick publisher={Public Library of Science} } +@article{imbens2001estimating, + title={Estimating the effect of unearned income on labor earnings, savings, and consumption: Evidence from a survey of lottery players}, + author={Imbens, Guido W and Rubin, Donald B and Sacerdote, Bruce I}, + journal={American economic review}, + volume={91}, + number={4}, + pages={778--794}, + year={2001} +} + +@article{callen2015catastrophes, + title={Catastrophes and time preference: Evidence from the Indian Ocean Earthquake}, + author={Callen, Michael}, + journal={Journal of Economic Behavior \& Organization}, + volume={118}, + pages={199--214}, + year={2015}, + publisher={Elsevier} +} + @article{lee2010regression, title={Regression discontinuity designs in economics}, author={Lee, David S and Lemieux, Thomas}, diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 773ae1b5b..c7a90469f 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -247,15 +247,27 @@ \subsection{Cross-sectional designs} already reflect the effect of the treatment. If the study is experimental, the treatment and control groups are randomly constructed from the population that is eligible to receive each treatment. -If it is a non-randomized observational study, we present other evidence that a similar equivalence holds. -In either case, by construction, each unit's receipt of the treatment +By construction, each unit's receipt of the treatment is unrelated to any of its other characteristics and the ordinary least squares (OLS) regression of outcome on treatment, without any control variables, is an unbiased estimate of the average treatment effect. +Cross-sectional designs can also exploit variation in non-experimental data +to argue that observed correlations do in fact represent causal effects. +This can be true unconditionally -- which is to say that something random, +such as winning the lottery, is a true random process and can tell you about the effect +of getting a large amount of money.\cite{imbens2001estimating} +It can also be true conditionally -- which is to say that once the +characteristics that would affect both the likelihood of exposure to a treatment +and the outcome of interest are controlled for, +the process is as good as random: +like arguing that once risk preferences are taken into account, +exposure to an earthquake is unpredictable and post-event differences +are causally related to the event itself.\cite{callen2015catastrophes} + For cross-sectional designs, what needs to be carefully maintained in data -is the treatment randomization process itself, +is the treatment randomization process itself (whether experimental or not), as well as detailed information about differences in data quality and loss to follow-up across groups.\cite{athey2017econometrics} Only these details are needed to construct the appropriate estimator: From aa3ec29179632fbbbe1140fe081665e69726b656 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 25 Feb 2020 18:20:17 -0500 Subject: [PATCH 828/854] [ch6] unit of observation, not unit of response --- chapters/data-analysis.tex | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 0c84eddee..f3fdcdd08 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -438,17 +438,17 @@ \subsection{Integrating different data sources} to include demographic information in your analysis, or you may want to integrate geographic information in order to construct indicators or controls based on the location of observations. -To do this, you will need to consider the unit of response for each dataset, +To do this, you will need to consider the unit of observation for each dataset, and the identifying variable, to understand how they can be merged. -If the datasets you need to join have the same unit of response, +If the datasets you need to join have the same unit of observation, merging may be straightforward. -The simplest case is merging datasets at the same unit of response +The simplest case is merging datasets at the same unit of observation which use a consistent numeric ID. For example, in the case of a panel survey for firms, you may merge baseline and endline data using the firm identification number. In many cases, however, -datasets at the same unit of response may not use a consistent numeric identifier. +datasets at the same unit of observation may not use a consistent numeric identifier. Identifiers that are string variables, such as names, often contain spelling mistakes or irregularities in capitalization, spacing or ordering. In this case, you will need to do a \textbf{fuzzy match}, From 48e4baeacad81adaf87716c662433f0c9e8764ca Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 25 Feb 2020 18:23:37 -0500 Subject: [PATCH 829/854] [ch6] units of observation, not units of response --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index f3fdcdd08..611aecaab 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -462,7 +462,7 @@ \subsection{Integrating different data sources} in order to figure out what the matching pattern should be and how to accomplish it in practice through your code. -In other cases, you will need to join data sources that have different units of response. +In other cases, you will need to join data sources that have different units of observation. For example, you might be overlaying road location data with household data, using a spatial match, or combining school administrative data, such as attendance records and test scores, From ea2cc4d806c52ad64b5457e118ba63934a9a5110 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 25 Feb 2020 20:04:03 -0500 Subject: [PATCH 830/854] word search: do-file Resolves #191 --- appendix/stata-guide.tex | 2 +- chapters/data-analysis.tex | 2 +- code/stata-master-dofile.do | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index f6f3bf733..1e309402d 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -457,7 +457,7 @@ \subsection{Miscellaneous notes} \bigskip\noindent Make sure your code doesn't print very much to the results window as this is slow. This can be accomplished by using \texttt{run file.do} rather than \texttt{do file.do}. -Interactive commands like \texttt{sum} or \texttt{tab} should be used sparingly in dofiles, +Interactive commands like \texttt{sum} or \texttt{tab} should be used sparingly in do-files, unless they are for the purpose of getting \texttt{r()}-statistics. In that case, consider using the \texttt{qui} prefix to prevent printing output. It is also faster to get outputs from commands like \texttt{reg} using the \texttt{qui} prefix. diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index e077e4e3d..9ef9adfbd 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -406,7 +406,7 @@ \section{Constructing final indicators} test for plot-level productivity gains, and check if village characteristics are balanced. Having three separate datasets for each of these three pieces of analysis -will result in much cleaner do files than if they all started from the same data set. +will result in much cleaner do-files than if they all started from the same data set. % From cleaning Construction is done separately from data cleaning for two reasons. diff --git a/code/stata-master-dofile.do b/code/stata-master-dofile.do index ee7f6530a..9b68db13c 100644 --- a/code/stata-master-dofile.do +++ b/code/stata-master-dofile.do @@ -7,7 +7,7 @@ * * * OUTLINE: PART 1: Set standard settings and install packages * * PART 2: Prepare folder paths and define programs * -* PART 3: Run do files * +* PART 3: Run do-files * * * ******************************************************************************** PART 1: Set standard settings and install packages @@ -40,7 +40,7 @@ global bl_out "${github}/Baseline/Output" /******************************************************************************* - PART 3: Run do files + PART 3: Run do-files *******************************************************************************/ /*------------------------------------------------------------------------------ From cf7fb51efc4c5f2dcbb1ef143761e0f9ee430a39 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 25 Feb 2020 20:06:09 -0500 Subject: [PATCH 831/854] word search: loss to follow-up Resolves #396 --- chapters/data-collection.tex | 6 +++--- chapters/research-design.tex | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 6f3b93dbc..ed6fd5d46 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -282,8 +282,8 @@ \subsection{Developing a data collection instrument} Doing this pilot with a pen-and-paper questionnaire encourages more significant revisions, as there is no need to factor in costs of re-programming, and as a result improves the overall quality of the survey instrument. -Questionnaires must also include ways to document the reasons for \textbf{attrition}, -treatment \textbf{contamination}, and \textbf{loss to follow-up}. +Questionnaires must also include ways to document the reasons for \textbf{attrition} and +treatment \textbf{contamination}. \index{attrition}\index{contamination} These are essential data components for completing CONSORT records, a standardized system for reporting enrollment, intervention allocation, follow-up, @@ -582,7 +582,7 @@ \subsection{Finalizing data collection} This reporting should be validated and saved alongside the final raw data, and treated the same way. This information should be stored as a dataset in its own right -- a \textbf{tracking dataset} -- that records all events in which survey substitutions -and loss to follow-up occurred in the field and how they were implemented and resolved. +and attrition occurred in the field and how they were implemented and resolved. %------------------------------------------------ \section{Collecting and sharing data securely} diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 773ae1b5b..566fe1017 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -257,7 +257,7 @@ \subsection{Cross-sectional designs} For cross-sectional designs, what needs to be carefully maintained in data is the treatment randomization process itself, as well as detailed information about differences -in data quality and loss to follow-up across groups.\cite{athey2017econometrics} +in data quality and attrition across groups.\cite{athey2017econometrics} Only these details are needed to construct the appropriate estimator: clustering of the standard errors is required at the level at which the treatment is assigned to observations, @@ -348,7 +348,7 @@ \subsection{Difference-in-differences} \url{https://blogs.worldbank.org/impactevaluations/another-reason-prefer-ancova-dealing-changes-measurement-between-baseline-and-follow}} When tracking individuals over time for this purpose, maintaining sampling and tracking records is especially important, -because attrition and loss to follow-up will remove that unit's information +because attrition will remove that unit's information from all points in time, not just the one they are unobserved in. Panel-style experiments therefore require a lot more effort in field work for studies that use primary data.\sidenote{ From f19f7d0e7dacdf3b5e1200e302aed95c8b2024fc Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 25 Feb 2020 20:08:28 -0500 Subject: [PATCH 832/854] word search: pii resolves #370 --- chapters/conclusion.tex | 4 ++-- chapters/data-analysis.tex | 22 +++++++++++----------- chapters/data-collection.tex | 28 +++++++++++++++++----------- chapters/handling-data.tex | 26 ++++++++++++++++---------- chapters/publication.tex | 4 ++-- 5 files changed, 48 insertions(+), 36 deletions(-) diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex index 4ca46217f..9fafe6c8d 100644 --- a/chapters/conclusion.tex +++ b/chapters/conclusion.tex @@ -18,8 +18,8 @@ with an eye toward structuring data work. We discussed how to implement reproducible routines for sampling and randomization, and to analyze statistical power and use randomization inference. -We discussed the collection of primary data -and methods of analysis using statistical software, +We discussed data collection +and analysis methods, as well as tools and practices for making this work publicly accessible. Throughout, we emphasized that data work is a ``social process'', involving multiple team members with different roles and technical abilities. diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 9ef9adfbd..93f6275dc 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -70,7 +70,7 @@ \subsection{Organizing your folder structure} because folders are organized in exactly the same way and use the same filepaths, shortcuts, and macro references.\sidenote{ \url{https://dimewiki.worldbank.org/DataWork\_Folder}} -We created \texttt{iefolder} based on our experience with primary data, +We created \texttt{iefolder} based on our experience with survey data, but it can be used for other types of data. Other teams may prefer a different scheme, but the principle of creating a single unified standard remains. @@ -181,7 +181,7 @@ \section{De-identifying research data} To do so, you will need to identify all variables that contain identifying information.\sidenote{\url{ https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf}} -For primary data collection, where the research team designs the survey instrument, +For data collection, where the research team designs the survey instrument, flagging all potentially identifying variables in the questionnaire design stage simplifies the initial de-identification process. If you did not do that, or you received original data by another means, @@ -198,7 +198,7 @@ \section{De-identifying research data} where you can easily select which variables to keep or drop.\sidenote{ \url{https://dimewiki.worldbank.org/Iecodebook}} -Once you have a list of variables that contain PII, +Once you have a list of variables that contain confidential information, assess them against the analysis plan and first ask yourself for each variable: \textit{will this variable be needed for the analysis?} If not, the variable should be dropped. @@ -206,9 +206,9 @@ \section{De-identifying research data} as you can always go back and remove variables from the list of variables to be dropped, but you can not go back in time and drop a PII variable that was leaked because it was incorrectly kept. -Examples include respondent names, enumerator names, interview dates, and respondent phone numbers. -For each PII variable that is needed in the analysis, ask yourself: -\textit{can I encode or otherwise construct a variable that masks the PII, and +Examples include respondent names and phone numbers, enumerator names, tax payer numbers, and addresses. +For each confidential variable that is needed in the analysis, ask yourself: +\textit{can I encode or otherwise construct a variable that masks the confidential component, and then drop this variable?} This is typically the case for most identifying information. Examples include geocoordinates @@ -219,7 +219,7 @@ \section{De-identifying research data} all you need to do is write a script to drop the variables that are not required for analysis, encode or otherwise mask those that are required, and save a working version of the data. -If PII variables are strictly required for the analysis itself and can not be +If confidential information strictly required for the analysis itself and can not be masked or encoded, it will be necessary to keep at least a subset of the data encrypted through the data analysis process. @@ -301,7 +301,7 @@ \subsection{Preparing a clean dataset} The main output of data cleaning is the cleaned data set. It should contain the same information as the raw data set, with identifying variables and data entry mistakes removed. -Although primary data typically requires more extensive data cleaning than secondary data, +Although original data typically requires more extensive data cleaning than secondary data, you should carefully explore possible issues in any data you are about to use. When reviewing raw data, you will inevitably encounter data entry mistakes, such as typos and inconsistent values. @@ -358,7 +358,7 @@ \subsection{Documenting data cleaning} Include in the \texttt{Documentation} folder records of any corrections made to the data, including to duplicated entries, as well as communications where theses issues are reported. -Be very careful not to include sensitive information in documentation that is not securely stored, +Be very careful not to include confidential information in documentation that is not securely stored, or that you intend to release as part of a replication package or data publication. Another important component of data cleaning documentation are the results of data exploration. @@ -485,7 +485,7 @@ \subsection{Constructing analytical variables} drop observations explicitly, indicating why you are doing that and how the data set changed. -Finally, primary panel data involves additional timing complexities. +Finally, creating a panel with survey data involves additional timing complexities. It is common to construct indicators soon after receiving data from a new survey round. However, creating indicators for each round separately increases the risk of using different definitions every time. Having a well-established definition for each constructed variable helps prevent that mistake, @@ -697,7 +697,7 @@ \subsection{Exporting analysis outputs} If you used de-identified data for analysis, publishing the cleaned data set in a trusted repository will allow you to cite your data. Some of the documentation produced during cleaning and construction can be published -even if your data is too sensitive to be published. +even if the data cannot due to confidentiality. Your analysis code will be organized in a reproducible way, so will need to do release a replication package is a last round of code review. This will allow you to focus on what matters: diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index ed6fd5d46..a2aeea503 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -587,11 +587,11 @@ \subsection{Finalizing data collection} %------------------------------------------------ \section{Collecting and sharing data securely} -All sensitive data must be handled in a way where there is no risk that anyone who is +All confidential data must be handled in a way where there is no risk that anyone who is not approved by an Institutional Review Board (IRB)\sidenote{ \url{https://dimewiki.worldbank.org/IRB\_Approval}} for the specific project has the ability to access the data. -Data can be sensitive for multiple reasons, but the two most +Data can be confidential for multiple reasons, but the two most common reasons are that it contains personally identifiable information (PII)\sidenote{ \url{https://dimewiki.worldbank.org/Personally\_Identifiable\_Information\_(PII)}} or that the partner providing the data does not want it to be released. @@ -655,9 +655,10 @@ \subsection{Collecting data securely} it is not automatically secure when it is being stored. \textbf{Encryption at rest}\sidenote{ \url{https://dimewiki.worldbank.org/Encryption\#Encryption\_at\_Rest}} -is the only way to ensure that PII data remains private when it is stored on a +is the only way to ensure that confidential data remains private when it is stored on a server on the internet. -You must keep your data encrypted on the data collection server whenever PII data is collected. +You must keep your data encrypted on the data collection server whenever PII is collected, +or when this is required by the data sharing agreement. If you do not, the raw data will be accessible by individuals who are not approved by your IRB, such as tech support personnel, @@ -749,7 +750,7 @@ \subsection{Storing data securely} \subsection{Sharing data securely} You and your team will use your first copy of the raw data as the starting point for data cleaning and analysis of the data. -This raw data set must remain encrypted at all times if it includes PII data, +This raw data set must remain encrypted at all times if it includes confidential data, which is almost always the case. As long as the data is properly encrypted, it can be shared using insecure modes of communication @@ -760,19 +761,24 @@ \subsection{Sharing data securely} Fortunately, there is a way to simplify the workflow without compromising data security. To simplify the workflow, -the PII variables should be removed from your data at the earliest -possible opportunity creating a de-identified copy of the data. +the confidential variables should be removed from your data at the earliest possible opportunity. +This is particularly common in survey data, +as identifying variables are often only needed during data collection. +In this case, such variables may be removed as soon as the field work is completed, +creating a de-identified copy of the data. Once the data is de-identified, it no longer needs to be encrypted -- you and you team members can share it directly without having to encrypt it and handle decryption keys. The next chapter will discuss how to de-identify your data. -If PII variables are directly required for the analysis itself, +This may not be so straightforward when access to the data +is restricted by request of the data owner. +If confidential information is directly required for the analysis itself, it will be necessary to keep at least a subset of the data encrypted through the data analysis process. -The data security standards that apply when receiving PII data also apply when transferring PII data. -A common example where this is often forgotten is when sending survey information, -such as sampling lists, to a field partner. +The data security standards that apply when receiving confidential data also apply when transferring confidential data. +A common example where this is often forgotten involves sharing survey information, +such as sampling lists, with a field partner. This data is -- by all definitions -- also PII data and must be encrypted. A sampling list can often be used to reverse identify a de-identified data set, so if you were to share it using an insecure method, diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 11615f766..1df55dde9 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -269,14 +269,21 @@ \section{Ensuring privacy and security in research data} you will also become familiar with the General Data Protection Regulation (GDPR),\sidenote{ \url{http://blogs.lshtm.ac.uk/library/2018/01/15/gdpr-for-research-data}} a set of regulations governing \textbf{data ownership} and privacy standards.\sidenote{ - \textbf{Data ownership:} the set of rights governing who may accesss, alter, use, or share data, regardless of who possesses it.} + \textbf{Data ownership:} the set of rights governing who may access, alter, use, or share data, regardless of who possesses it.} \index{data ownership} + In all settings, you should have a clear understanding of who owns your data (it may not be you, even if you collect or possess it), the rights of the people whose information is reflected there, and the necessary level of caution and risk involved in storing and transferring this information. -Due to the increasing scrutiny on many organizations +Even if your research does not involve PII, +it is a prerogative of the data owner to determine who may have access to it. +Therefore, if you are using data that was provided to you by a partner, +they have the right to request that you hold it to uphold the same data security safeguards as you would to PII. +For the purposes of this book, +we will call all any data that may not be freely accessed for these or other reasons \textbf{confidential data}. +Given the increasing scrutiny on many organizations from recently advanced data rights and regulations, these considerations are critically important. Check with your organization if you have any legal questions; @@ -285,7 +292,7 @@ \section{Ensuring privacy and security in research data} \subsection{Obtaining ethical approval and consent} -For almost all data collection and research activities that involves +For almost all data collection and research activities that involve human subjects or PII data, you will be required to complete some form of \textbf{Institutional Review Board (IRB)} process.\sidenote{ \textbf{Institutional Review Board (IRB):} An institution formally responsible for ensuring that research meets ethical standards.} @@ -310,14 +317,13 @@ \subsection{Obtaining ethical approval and consent} before it happens, as it is happening, or after it has already happened. It also means that they must explicitly and affirmatively consent to the collection, storage, and use of their information for any purpose. -Therefore, the development of appropriate consent processes is of primary importance.\sidenote{url\ - {https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} +Therefore, the development of appropriate consent processes is of primary importance. All survey instruments must include a module in which the sampled respondent grants informed consent to participate. Research participants must be informed of the purpose of the research, what their participation will entail in terms of duration and any procedures, any foreseeable benefits or risks, -and how their identity will be protected.\sidenote{\url - {https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} +and how their identity will be protected.\sidenote{ + \url{https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/confidentiality/conf-language.html}} There are special additional protections in place for vulnerable populations, such as minors, prisoners, and people with disabilities, and these should be confirmed with relevant authorities if your research includes them. @@ -371,11 +377,11 @@ \subsection{Transmitting and storing data securely} computer it must always remain encrypted, and confidential data may never be sent unencrypted over email, WhatsApp, or other chat services. -The easiest way to reduce the risk of leaking personal information is to use it as rarely as possible. +The easiest way to reduce the risk of leaking confidential information is to use it as rarely as possible. It is often very simple to conduct planning and analytical work -using a subset of the data that has been \textbf{de-identified}. +using a subset of the data that does not include this type of information. We encourage this approach, because it is easy. -However, when PII is absolutely necessary to a task, +However, when confidential data is absolutely necessary to a task, such as implementing an intervention or submitting survey data, you must actively protect that information in transmission and storage. diff --git a/chapters/publication.tex b/chapters/publication.tex index 0cb2afd01..9006a2846 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -399,8 +399,8 @@ \subsection{De-identifying data for publication} The best thing you can do is make a complete record of the steps that have been taken so that the process can be reviewed, revised, and updated as necessary. -In cases where PII data is required for analysis, -we recommend embargoing the sensitive variables when publishing the data. +In cases where confidential data is required for analysis, +we recommend embargoing sensitive or access-restricted variables when publishing the data set. Access to the embargoed data could be granted for specific purposes, such as a computational reproducibility check required for publication, if done under careful data security protocols and approved by an IRB. From 8c85c31dce650e2aafe84a05f34b623a2e80726d Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 25 Feb 2020 20:09:16 -0500 Subject: [PATCH 833/854] word search: primary --- chapters/data-collection.tex | 4 ++-- chapters/research-design.tex | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index a2aeea503..99da6fc01 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -204,7 +204,7 @@ \subsection{Receiving data from development partners} that may provide even more information about specific individuals. %------------------------------------------------ -\section{Collecting primary data using electronic surveys} +\section{Collecting data using electronic surveys} If you are collecting data directly from the research subjects yourself, you are most likely designing and fielding an electronic survey. @@ -541,7 +541,7 @@ \subsection{Conducting back-checks and data validation} Careful validation of data is essential for high-quality data. Since we cannot control natural measurement error that comes from variation in the realization of key outcomes, -primary data collection provides the opportunity to make sure +original data collection provides the opportunity to make sure that there is no error arising from inaccuracies in the data itself. \textbf{Back-checks}\sidenote{\url{https://dimewiki.worldbank.org/Back_Checks}} and other validation audits help ensure that data collection is following established protocols, diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 566fe1017..fb5c31a55 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -351,7 +351,7 @@ \subsection{Difference-in-differences} because attrition will remove that unit's information from all points in time, not just the one they are unobserved in. Panel-style experiments therefore require a lot more effort in field work -for studies that use primary data.\sidenote{ +for studies that use original data.\sidenote{ \url{https://www.princeton.edu/~otorres/Panel101.pdf}} Since baseline and endline may be far apart in time, it is important to create careful records during the first round From f6d20e1782d4b9f10fae9b52d12c143cd5b106fc Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 25 Feb 2020 20:11:32 -0500 Subject: [PATCH 834/854] word search: take-up #295 --- chapters/research-design.tex | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/chapters/research-design.tex b/chapters/research-design.tex index fb5c31a55..1c4e4364a 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -186,15 +186,15 @@ \subsection{Experimental and quasi-experimental research designs} This feature is called randomization noise, and all RCTs share the need to assess how randomization noise may impact the estimates that are obtained. (More detail on this later.) -Second, takeup and implementation fidelity are extremely important, +Second, take-up and implementation fidelity are extremely important, since programs will by definition have no effect if the population intended to be treated does not accept or does not receive the treatment. Loss of statistical power occurs quickly and is highly nonlinear: -70\% takeup or efficacy doubles the required sample, and 50\% quadruples it.\sidenote{ +70\% take-up or efficacy doubles the required sample, and 50\% quadruples it.\sidenote{ \url{https://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up}} Such effects are also very hard to correct ex post, -since they require strong assumptions about the randomness or non-randomness of takeup. +since they require strong assumptions about the randomness or non-randomness of take-up. Therefore a large amount of field time and descriptive work must be dedicated to understanding how these effects played out in a given study, and may overshadow the effort put into the econometric design itself. @@ -450,12 +450,12 @@ \subsection{Instrumental variables} begin by assuming that the treatment delivered in the study in question is linked to the outcome in a pattern such that its effect is not directly identifiable. Instead, similar to regression discontinuity designs, -IV attempts to focus on a subset of the variation in treatment uptake +IV attempts to focus on a subset of the variation in treatment take-up and assesses that limited window of variation that can be argued to be unrelated to other factors.\cite{angrist2001instrumental} To do so, the IV approach selects an \textbf{instrument} for the treatment status -- an otherwise-unrelated predictor of exposure to treatment -that affects the uptake status of an individual.\sidenote{ +that affects the take-up status of an individual.\sidenote{ \url{https://dimewiki.worldbank.org/instrumental_variables}} Whereas regression discontinuity designs are ``sharp'' -- treatment status is completely determined by which side of a cutoff an individual is on -- @@ -484,8 +484,8 @@ \subsection{Instrumental variables} are usually required with an instrumental variables analysis. However, the method has special experimental cases that are significantly easier to assess: for example, a randomized treatment \textit{assignment} can be used as an instrument -for the eventual uptake of the treatment itself, -especially in cases where uptake is expected to be low, +for the eventual take-up of the treatment itself, +especially in cases where take-up is expected to be low, or in circumstances where the treatment is available to those who are not specifically assigned to it (``encouragement designs''). From 77d26799e524e24c87486be958c3907045489182 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 25 Feb 2020 20:14:14 -0500 Subject: [PATCH 835/854] word search: file path #295 --- chapters/data-analysis.tex | 2 +- chapters/planning-data-work.tex | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 93f6275dc..24afc88a5 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -68,7 +68,7 @@ \subsection{Organizing your folder structure} A standardized structure greatly reduces the costs that PIs and RAs face when switching between projects, because folders are organized in exactly the same way -and use the same filepaths, shortcuts, and macro references.\sidenote{ +and use the same file paths, shortcuts, and macro references.\sidenote{ \url{https://dimewiki.worldbank.org/DataWork\_Folder}} We created \texttt{iefolder} based on our experience with survey data, but it can be used for other types of data. diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 34d79c3f4..44f3c1f82 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -105,7 +105,7 @@ \subsection{Setting up your computer} \index{file paths} On MacOS this will be something like \path{/users/username/git/project/...}, and on Windows, \path{C:/users/username/git/project/...}. -Use forward slashes (\texttt{/}) in filepaths for folders, +Use forward slashes (\texttt{/}) in file paths for folders, and whenever possible use only A-Z (the 26 English characters), dashes (\texttt{-}), and underscores (\texttt{\_}) in folder names and filenames. For emphasis: \textit{always} use forward slashes (\texttt{/}) in file paths in code, From 24d0cf003d3dcc57492e0350234e3bc46cf17f8e Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 25 Feb 2020 20:19:55 -0500 Subject: [PATCH 836/854] word search: data set --- appendix/stata-guide.tex | 6 +++--- chapters/data-analysis.tex | 26 +++++++++++++------------- chapters/data-collection.tex | 32 ++++++++++++++++---------------- chapters/handling-data.tex | 10 +++++----- chapters/planning-data-work.tex | 10 +++++----- chapters/publication.tex | 20 ++++++++++---------- chapters/research-design.tex | 2 +- code/code.do | 2 +- code/replicability.do | 2 +- code/stata-comments.do | 4 ++-- 10 files changed, 57 insertions(+), 57 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 1e309402d..776ca58e6 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -44,7 +44,7 @@ \section{Using the code examples in this book} We use GitHub to version control everything in this book, the code included. To see the code on GitHub, go to: \url{https://github.com/worldbank/d4di/tree/master/code}. If you are familiar with GitHub you can fork the repository and clone your fork. -We only use Stata's built-in datasets in our code examples, +We only use Stata's built-in data sets in our code examples, so you do not need to download any data. If you have Stata installed on your computer, then you will already have the data files used in the code. @@ -222,7 +222,7 @@ \subsection{Abbreviating variables} Using wildcards and lists in Stata for variable lists (\texttt{*}, \texttt{?}, and \texttt{-}) is also discouraged, because the functionality of the code may change -if the dataset is changed or even simply reordered. +if the data set is changed or even simply reordered. If you intend explicitly to capture all variables of a certain type, prefer \texttt{unab} or \texttt{lookfor} to build that list in a local macro, which can then be checked to have the right variables in the right order. @@ -429,7 +429,7 @@ \subsection{Saving data} ID variables are also perfect variables to sort on, and to \texttt{order} first in the data set. -The command \texttt{compress} makes the data set smaller in terms of memory usage +The command \texttt{compress} makes the data setsmaller in terms of memory usage without ever losing any information. It optimizes the storage types for all variables and therefore makes it smaller on your computer diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 24afc88a5..b767a2230 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -90,7 +90,7 @@ \subsection{Organizing your folder structure} \subsection{Breaking down tasks} -We divide the process of transforming raw datasets to analysis-ready datasets into four steps: +We divide the process of transforming raw data sets to analysis-ready data sets into four steps: de-identification, data cleaning, variable construction, and data analysis. Though they are frequently implemented concurrently, creating separate scripts and data sets prevents mistakes. @@ -151,7 +151,7 @@ \subsection{Implementing version control} \section{De-identifying research data} -The starting point for all tasks described in this chapter is the raw dataset, +The starting point for all tasks described in this chapter is the raw data set, which should contain the exact data received, with no changes or additions. The raw data will invariably come in a variety of file formats and these files should be saved in the raw data folder \textit{exactly as they were @@ -173,8 +173,8 @@ \section{De-identifying research data} You will only keep working from the fixed copy, but you keep both copies in case you later realize that the manual fix was done incorrectly. -The first step in the transformation of raw data to an analysis-ready dataset is de-identification. -This simplifies workflows, as once you create a de-identified version of the dataset, +The first step in the transformation of raw data to an analysis-ready data set is de-identification. +This simplifies workflows, as once you create a de-identified version of the data set, you no longer need to interact directly with the encrypted raw data. at this stage, means stripping the data set of personally identifying information.\sidenote{ \url{https://dimewiki.worldbank.org/De-identification}} @@ -236,7 +236,7 @@ \section{Cleaning data for analysis} The cleaning process involves (1) making the data set easy to use and understand, and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. -The cleaned data set should contain only the variables collected in the field. +The cleaned data setshould contain only the variables collected in the field. No modifications to data points are made at this stage, except for corrections of mistaken entries. Cleaning is probably the most time-consuming of the stages discussed in this chapter. @@ -281,7 +281,7 @@ \subsection{Labeling, annotating, and finalizing clean data} \index{iecodebook} First, \textbf{renaming}: for data with an accompanying survey instrument, -it is useful to keep the same variable names in the cleaned dataset as in the survey instrument. +it is useful to keep the same variable names in the cleaned data set as in the survey instrument. That way it's straightforward to link variables to the relevant survey question. Second, \textbf{labeling}: applying labels makes it easier to understand your data as you explore it, and thus reduces the risk of small errors making their way through into the analysis stage. @@ -295,9 +295,9 @@ \subsection{Labeling, annotating, and finalizing clean data} Open-ended responses stored as strings usually have a high risk of being identifiers, so they should be encoded into categories as much as possible and raw data points dropped. You can use the encrypted data as an input to a construction script -that categorizes these responses and merges them to the rest of the dataset. +that categorizes these responses and merges them to the rest of the data set. -\subsection{Preparing a clean dataset} +\subsection{Preparing a clean data set} The main output of data cleaning is the cleaned data set. It should contain the same information as the raw data set, with identifying variables and data entry mistakes removed. @@ -310,7 +310,7 @@ \subsection{Preparing a clean dataset} and how the correct value was obtained.\sidenote{\url{ https://dimewiki.worldbank.org/Data\_Cleaning}} -The clean dataset should always be accompanied by a dictionary or codebook. +The clean data setshould always be accompanied by a dictionary or codebook. Survey data should be easily traced back to the survey instrument. Typically, one cleaned data set will be created for each data source or survey instrument; and each row in the cleaned data set represents one @@ -399,13 +399,13 @@ \section{Constructing final indicators} you may have one or multiple constructed data sets, depending on how your analysis is structured. Don't worry if you cannot create a single, ``canonical'' analysis data set. -It is common to have many purpose-built analysis datasets. +It is common to have many purpose-built analysis data sets. Think of an agricultural intervention that was randomized across villages and only affected certain plots within each village. The research team may want to run household-level regressions on income, test for plot-level productivity gains, and check if village characteristics are balanced. -Having three separate datasets for each of these three pieces of analysis +Having three separate data sets for each of these three pieces of analysis will result in much cleaner do-files than if they all started from the same data set. % From cleaning @@ -443,7 +443,7 @@ \section{Constructing final indicators} \subsection{Constructing analytical variables} New variables created during the construction stage should be added to the data set, instead of overwriting the original information. New variables should be assigned functional names. -Ordering the data set so that related variables are together, +Ordering the data setso that related variables are together, and adding notes to each of them as necessary will make your data set more user-friendly. Before constructing new variables, @@ -574,7 +574,7 @@ \subsection{Organizing analysis code} To accomplish this, you will need to make sure that you have an effective data management system, including naming, file organization, and version control. -Just like you did with each of the analysis datasets, +Just like you did with each of the analysis data sets, name each of the individual analysis files descriptively. Code files such as \path{spatial-diff-in-diff.do}, \path{matching-villages.R}, and \path{summary-statistics.py} diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 99da6fc01..65d0883e6 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -12,7 +12,7 @@ assured that your data has been obtained at high standards of both quality and security. The chapter begins with a discussion of some key ethical and legal descriptions -to ensure that you have the right to do research using a specific dataset. +to ensure that you have the right to do research using a specific data set. Particularly when confidential data is collected by you and your team or shared with you by a program implementer, government, or other partner, you need to make sure permissions are correctly granted and documented. @@ -49,7 +49,7 @@ \section{Acquiring data} private sector partnerships granting access to new data sources, including administrative and sensor data; digitization of paper records, including administrative data; primary data capture by unmanned aerial vehicles or other types of remote sensing; -or novel integration of various types of datasets, e.g. combining survey and sensor data. +or novel integration of various types of data sets, e.g. combining survey and sensor data. Except in the case of primary surveys funded by the research team, the data is typically not owned by the research team. Data ownership and licensing agreements are required @@ -91,10 +91,10 @@ \subsection{Data ownership} \subsection{Data licensing agreements} Data licensing is the formal act of giving some data rights to others -while retaining ownership of a particular dataset. -If you are not the owner of the dataset you want to analyze, +while retaining ownership of a particular data set. +If you are not the owner of the data set you want to analyze, you should enter into a licensing or terms-of-use agreement to access it for research purposes. -Similarly, when you own a dataset, +Similarly, when you own a data set, you must consider whether the data can be made accessible to other researchers, and what terms-of-use you require. @@ -117,7 +117,7 @@ \subsection{Data licensing agreements} a license for all uses of derivative works, including public distribution (unless ethical considerations contraindicate this). This is important to allow the research team to store, catalog, and publish, in whole or in part, -either the original licensed dataset or datasets derived from the original. +either the original licensed data set or data sets derived from the original. Make sure that the license you obtain from the data owner allows these uses, and that you consult with the owner if you foresee exceptions with specific portions of the data. @@ -166,7 +166,7 @@ \subsection{Receiving data from development partners} Another important consideration at this stage is proper documentation and cataloging of data and associated metadata. -It is not always clear what pieces of information jointly constitute a ``dataset'', +It is not always clear what pieces of information jointly constitute a ``data set'', and many of the sources you receive data from will not be organized for research. To help you keep organized and to put some structure on the materials you will be receiving, you should always retain the original data as received @@ -183,13 +183,13 @@ \subsection{Receiving data from development partners} what they measure, and how they are to be used. In the case of survey data, this includes the survey instrument and associated manuals; the sampling protocols and field adherence to those protocols, and any sampling weights; -what variable(s) uniquely identify the dataset(s), and how different datasets can be linked; +what variable(s) uniquely identify the data set(s), and how different data sets can be linked; and a description of field procedures and quality controls. We use as a standard the Data Documentation Initiative (DDI), which is supported by the World Bank's Microdata Catalog.\sidenote{\url{https://microdata.worldbank.org}} As soon as the requisite pieces of information are stored together, -think about which ones are the components of what you would call a dataset. +think about which ones are the components of what you would call a data set. This is more of an art than a science: you want to keep things together that belong together, but you also want to keep things apart that belong apart. @@ -200,7 +200,7 @@ \subsection{Receiving data from development partners} as you move towards the publication part of the research process. This may require you to re-check with the provider about what portions are acceptable to license, -particularly if you are combining various datasets +particularly if you are combining various data sets that may provide even more information about specific individuals. %------------------------------------------------ @@ -416,7 +416,7 @@ \subsection{Programming electronic questionnaires} All survey softwares include debugging and test options to correct syntax errors and make sure that the survey instruments will successfully compile. -This is not sufficient, however, to ensure that the resulting dataset +This is not sufficient, however, to ensure that the resulting data set will load without errors in your data analysis software of choice. We developed the \texttt{ietestform} command,\sidenote{ \url{https://dimewiki.worldbank.org/ietestform}} @@ -437,7 +437,7 @@ \subsection{Programming electronic questionnaires} A second survey pilot should be done after the questionnaire is programmed. The objective of this \textbf{data-focused pilot}\sidenote{ \url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)}} -is to validate the programming and export a sample dataset. +is to validate the programming and export a sample data set. Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. It is important to plan for multiple days of piloting, @@ -452,7 +452,7 @@ \section{Data quality assurance} it is important to make sure that data faithfully reflects ground realities. Data quality assurance requires a combination of real-time data checks and back-checks or validation audits, which often means tracking down -the people whose information is in the dataset. +the people whose information is in the data set. \subsection{Implementing high frequency quality checks} @@ -572,7 +572,7 @@ \subsection{Conducting back-checks and data validation} \subsection{Finalizing data collection} When all data collection is complete, the survey team should prepare a final field report, -which should report reasons for any deviations between the original sample and the dataset collected. +which should report reasons for any deviations between the original sample and the data set collected. Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of survey data. It is important to structure this reporting in a way that not only @@ -580,8 +580,8 @@ \subsection{Finalizing data collection} but also collects all the detailed, open-ended responses to questions the field team can provide for any observations that they were unable to complete. This reporting should be validated and saved alongside the final raw data, and treated the same way. -This information should be stored as a dataset in its own right --- a \textbf{tracking dataset} -- that records all events in which survey substitutions +This information should be stored as a data set in its own right +-- a \textbf{tracking data set} -- that records all events in which survey substitutions and attrition occurred in the field and how they were implemented and resolved. %------------------------------------------------ diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 1df55dde9..f6c1e053b 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -100,7 +100,7 @@ \subsection{Research reproducibility} producing these kinds of resources can lead to that as well. Therefore, your code should be written neatly with clear instructions and published openly. It should be easy to read and understand in terms of structure, style, and syntax. -Finally, the corresponding dataset should be openly accessible +Finally, the corresponding data setshould be openly accessible unless for legal or ethical reasons it cannot be.\sidenote{ \url{https://dimewiki.worldbank.org/Publishing_Data}} @@ -424,7 +424,7 @@ \subsection{De-identifying data} You can take simple steps to avoid risks by minimizing the handling of PII. First, only collect information that is strictly needed for the research. Second, avoid the proliferation of copies of identified data. -There should never be more than one copy of the raw identified dataset in the project folder, +There should never be more than one copy of the raw identified data set in the project folder, and it must always be encrypted. Even within the research team, access to PII data should be limited to team members who require it for specific analysis @@ -435,7 +435,7 @@ \subsection{De-identifying data} Therefore, once data is securely collected and stored, the first thing you will generally do is \textbf{de-identify} it, -that is, remove direct identifiers of the individuals in the dataset.\sidenote{ +that is, remove direct identifiers of the individuals in the data set.\sidenote{ \url{https://dimewiki.worldbank.org/De-identification}} \index{de-identification} Note, however, that it is in practice impossible to \textbf{anonymize} data. @@ -446,13 +446,13 @@ \subsection{De-identifying data} For this reason, we recommend de-identification in two stages. The \textbf{initial de-identification} process strips the data of direct identifiers as early in the process as possible, -to create a working de-identified dataset that +to create a working de-identified data set that can be shared \textit{within the research team} without the need for encryption. This simplifies workflows. The \textbf{final de-identification} process involves making a decision about the trade-off between risk of disclosure and utility of the data -before publicly releasing a dataset.\sidenote{ +before publicly releasing a data set.\sidenote{ \url{https://sdcpractice.readthedocs.io/en/latest/SDC\_intro.html\#need-for-sdc}} We will provide more detail about the process and tools available for initial and final de-identification in Chapters 6 and 7, respectively. diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index 44f3c1f82..e24e79651 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -219,8 +219,8 @@ \subsection{Choosing software} Take into account the different levels of techiness of team members, how important it is to access files offline constantly, as well as the type of data you will need to access and the security needed. -Big datasets require additional infrastructure and may overburden -the traditional tools used for small datasets, +Big data sets require additional infrastructure and may overburden +the traditional tools used for small data sets, particularly if you are trying to sync or collaborate on them. Also consider the cost of licenses, the time to learn new tools, and the stability of the tools. @@ -246,7 +246,7 @@ \subsection{Choosing software} Next, think about how and where you write and execute code. This book is intended to be agnostic to the size or origin of your data, -but we are going to broadly assume that you are using desktop-sized datasets +but we are going to broadly assume that you are using desktop-sized data sets in one of the two most popular desktop-based packages: R or Stata. (If you are using another language, like Python, or working with big data projects on a server installation, @@ -366,7 +366,7 @@ \subsection{Organizing files and folder structures} be stored in a synced folder that is shared with other people. Those two types of collaboration tools function very differently and will almost always create undesired functionality if combined.) -Nearly all code files and raw outputs (not datasets) are best managed this way. +Nearly all code files and raw outputs (not data sets) are best managed this way. This is because code files are always \textbf{plaintext} files, and non-code-compatiable files are usually \textbf{binary} files.\index{plaintext}\index{binary files} It's also becoming more and more common for written outputs such as reports, @@ -416,7 +416,7 @@ \subsection{Organizing files and folder structures} % ---------------------------------------------------------------------------------------------- \subsection{Documenting and organizing code} Once you start a project's data work, -the number of scripts, datasets, and outputs that you have to manage will grow very quickly. +the number of scripts, data sets, and outputs that you have to manage will grow very quickly. This can get out of hand just as quickly, so it's important to organize your data work and follow best practices from the beginning. Adjustments will always be needed along the way, diff --git a/chapters/publication.tex b/chapters/publication.tex index 9006a2846..c4b523cde 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -267,13 +267,13 @@ \section{Preparing a complete replication package} and the replication file should not include any documentation or data you would not share publicly. This usually means removing project-related documentation such as contracts and details of data collection and other field work, -and double-checking all datasets for potentially identifying information. +and double-checking all data sets for potentially identifying information. \subsection{Publishing data for replication} Publicly documenting all original data generated as part of a research project is an important contribution in its own right. -Publishing original datasets is a significant contribution that can be made +Publishing original data sets is a significant contribution that can be made in addition to any publication of analysis results.\sidenote{ \url{https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-publishing-research-data.pdf}} If you are not able to publish the data itself, @@ -282,10 +282,10 @@ \subsection{Publishing data for replication} These may take the form of metadata catalogs or embargoed releases. Such setups allow you to hold an archival version of your data which your publication can reference, -and provide information about the contents of the datasets +and provide information about the contents of the data sets and how future users might request permission to access them (even if you are not the person to grant that permission). -They can also provide for timed future releases of datasets +They can also provide for timed future releases of data sets once the need for exclusive access has ended. Publishing data allows other researchers to validate the mechanical construction of your results, @@ -317,7 +317,7 @@ \subsection{Publishing data for replication} When your raw data is owned by someone else, or for any other reason you are not able to publish it, -in many cases you will still have the right to release derivate datasets, +in many cases you will still have the right to release derivate data sets, even if it is just the indicators you constructed and their documentation.\sidenote{ \url{https://guide-for-data-archivists.readthedocs.io}} If you have questions about your rights over original or derived materials, @@ -331,7 +331,7 @@ \subsection{Publishing data for replication} \url{https://microdata.worldbank.org/index.php/terms-of-use}} Open Access data is freely available to anyone, and simply requires attribution. Direct Access data is to registered users who agree to use the data for statistical and scientific research purposes only, -to cite the data appropriately, and to not attempt to identify respondents or data providers or link to other datasets that could allow for re-identification. +to cite the data appropriately, and to not attempt to identify respondents or data providers or link to other data sets that could allow for re-identification. Licensed access data is restricted to bona fide users, who submit a documented application for how they will use the data and sign an agreement governing data use. The user must be acting on behalf of an organization, which will be held responsible in the case of any misconduct. Keep in mind that you may or may not own your data, @@ -340,9 +340,9 @@ \subsection{Publishing data for replication} is at the time that data collection or sharing agreements are signed. Published data should be released in a widely recognized format. -While software-specific datasets are acceptable accompaniments to the code +While software-specific data sets are acceptable accompaniments to the code (since those precise materials are probably necessary), -you should also consider releasing generic datasets +you should also consider releasing generic data sets such as CSV files with accompanying codebooks, since these can be used by any researcher. Additionally, you should also release @@ -351,7 +351,7 @@ \subsection{Publishing data for replication} collected directly in the field and which are derived. If possible, you should publish both a clean version of the data which corresponds exactly to the original database or questionnaire -as well as the constructed or derived dataset used for analysis. +as well as the constructed or derived data set used for analysis. You should also release the code that constructs any derived measures, particularly where definitions may vary, @@ -388,7 +388,7 @@ \subsection{De-identifying data for publication} There will almost always be a trade-off between accuracy and privacy. For publicly disclosed data, you should favor privacy. -Stripping identifying variables from a dataset may not be sufficient to protect respondent privacy, +Stripping identifying variables from a data set may not be sufficient to protect respondent privacy, due to the risk of re-identification. One potential solution is to add noise to data, as the US Census Bureau has proposed.\cite{abowd2018us} This makes the trade-off between data accuracy and privacy explicit. diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 1c4e4364a..4f2df3fd2 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -597,7 +597,7 @@ \subsection{Synthetic controls} The counterfactual blend is chosen by optimizing the prediction of past outcomes based on the potential input characteristics, and typically selects a small set of comparators to weight into the final analysis. -These datasets therefore may not have a large number of variables or observations, +These data sets therefore may not have a large number of variables or observations, but the extent of the time series both before and after the implementation of the treatment are key sources of power for the estimate, as are the number of counterfactual units available. diff --git a/code/code.do b/code/code.do index 4c217674f..f18b9adc3 100644 --- a/code/code.do +++ b/code/code.do @@ -1,4 +1,4 @@ -* Load the auto dataset +* Load the auto data set sysuse auto.dta , clear * Run a simple regression diff --git a/code/replicability.do b/code/replicability.do index b398efa0f..c1d166b5b 100644 --- a/code/replicability.do +++ b/code/replicability.do @@ -2,7 +2,7 @@ ieboilstart , v(13.1) `r(version)' -* Load the auto dataset (auto.dta is a test data set included in all Stata installations) +* Load the auto data set (auto.dta is a test data set included in all Stata installations) sysuse auto.dta , clear * SORTING - sort on the uniquely identifying variable "make" diff --git a/code/stata-comments.do b/code/stata-comments.do index 911d48e2a..e494462ff 100644 --- a/code/stata-comments.do +++ b/code/stata-comments.do @@ -15,5 +15,5 @@ TYPE 2: TYPE 3: -* Open the dataset - sysuse auto.dta // Built in dataset (This comment is used to document a single line) +* Open the data set + sysuse auto.dta // Built in data set (This comment is used to document a single line) From 3264a2da1ff46bd832b5fdf2bfd36ebc93616305 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 25 Feb 2020 20:21:39 -0500 Subject: [PATCH 837/854] word search: field word #295 --- chapters/data-collection.tex | 6 +++--- chapters/sampling-randomization-power.tex | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 65d0883e6..33b6a61a5 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -346,7 +346,7 @@ \subsection{Designing surveys for electronic deployment} but allows for much more comprehensible top-line statistics and data quality checks. Rigorous field testing is required to ensure that answer categories are comprehensive; however, it is best practice to include an \textit{other, specify} option. -Keep track of those responses in the first few weeks of fieldwork. +Keep track of those responses in the first few weeks of field work. Adding an answer category for a response frequently showing up as \textit{other} can save time, as it avoids extensive post-coding. @@ -423,7 +423,7 @@ \subsection{Programming electronic questionnaires} part of the Stata package \texttt{iefieldkit}, to implement a form-checking routine for \textbf{SurveyCTO}, a proprietary implementation of the \textbf{Open Data Kit (ODK)} software. -Intended for use during questionnaire programming and before fieldwork, +Intended for use during questionnaire programming and before field work, \texttt{ietestform} tests for best practices in coding, naming and labeling, and choice lists. Although \texttt{ietestform} is software-specific, many of the tests it runs are general and important to consider regardless of software choice. @@ -533,7 +533,7 @@ \subsection{Implementing high frequency quality checks} by adding scripts to link the HFCs with a messaging program such as WhatsApp. Any of these solutions are possible: what works best for your team will depend on such variables as -cellular networks in fieldwork areas, whether field supervisors have access to laptops, +cellular networks in field work areas, whether field supervisors have access to laptops, internet speed, and coding skills of the team preparing the HFC workflows. \subsection{Conducting back-checks and data validation} diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 3d2c4e4b8..2f7dbdd51 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -24,7 +24,7 @@ Power calculation and randomization inference are the main methods by which these probabilities of error are assessed. These analytical dimensions are particularly important in the initial phases of development research -- -typically conducted well before any actual fieldwork occurs -- +typically conducted well before any actual field work occurs -- and often have implications for feasibility, planning, and budgeting. In this chapter, we first cover the necessary practices to ensure that random processes are reproducible. From c78c3d038df021a4d420e1e8e9f1d0b3e3462770 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 25 Feb 2020 20:25:47 -0500 Subject: [PATCH 838/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 611aecaab..ba49aaeda 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -374,7 +374,7 @@ \section{Constructing analysis datasets} % What is construction ------------------------------------- The third stage is construction of the dataset you will use for analysis. -It is at this stage that the raw data is transformed into analysis-ready data, +It is at this stage that the cleaned data is transformed into analysis-ready data, by integrating different datasets and creating derived variables (dummies, indices, and interactions, to name a few), as planned during research design\index{Research design}, From b48571df16a4d777b430cb9cca53fb3235a5130b Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 25 Feb 2020 20:28:39 -0500 Subject: [PATCH 839/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index ba49aaeda..d1270140f 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -444,7 +444,7 @@ \subsection{Integrating different data sources} If the datasets you need to join have the same unit of observation, merging may be straightforward. The simplest case is merging datasets at the same unit of observation -which use a consistent numeric ID. +which use a consistent, uniquely and fully identifying ID variable. For example, in the case of a panel survey for firms, you may merge baseline and endline data using the firm identification number. In many cases, however, From ab13ae60668ae2fa4342de45f6fd82464f8b6c77 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 25 Feb 2020 20:34:17 -0500 Subject: [PATCH 840/854] Update appendix/stata-guide.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- appendix/stata-guide.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index 776ca58e6..b0109a86f 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -429,7 +429,7 @@ \subsection{Saving data} ID variables are also perfect variables to sort on, and to \texttt{order} first in the data set. -The command \texttt{compress} makes the data setsmaller in terms of memory usage +The command \texttt{compress} makes the data set smaller in terms of memory usage without ever losing any information. It optimizes the storage types for all variables and therefore makes it smaller on your computer From c7ffcb2bb1f756292c0289863ea64ed04534d817 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 25 Feb 2020 20:34:50 -0500 Subject: [PATCH 841/854] Update chapters/data-analysis.tex --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index b767a2230..31fd25269 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -236,7 +236,7 @@ \section{Cleaning data for analysis} The cleaning process involves (1) making the data set easy to use and understand, and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. -The cleaned data setshould contain only the variables collected in the field. +The cleaned data set should contain only the variables collected in the field. No modifications to data points are made at this stage, except for corrections of mistaken entries. Cleaning is probably the most time-consuming of the stages discussed in this chapter. From 98c69b46d2fb9be123a61a2546f5ac97467dcd12 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 25 Feb 2020 20:35:15 -0500 Subject: [PATCH 842/854] Update chapters/data-analysis.tex --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 31fd25269..a9e711fbf 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -310,7 +310,7 @@ \subsection{Preparing a clean data set} and how the correct value was obtained.\sidenote{\url{ https://dimewiki.worldbank.org/Data\_Cleaning}} -The clean data setshould always be accompanied by a dictionary or codebook. +The cleaned data set should always be accompanied by a dictionary or codebook. Survey data should be easily traced back to the survey instrument. Typically, one cleaned data set will be created for each data source or survey instrument; and each row in the cleaned data set represents one From 3d45d8de8c57cca02e715f361c0dd5252660643c Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 25 Feb 2020 20:35:48 -0500 Subject: [PATCH 843/854] Update chapters/data-analysis.tex --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index a9e711fbf..ea11f4b4e 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -443,7 +443,7 @@ \section{Constructing final indicators} \subsection{Constructing analytical variables} New variables created during the construction stage should be added to the data set, instead of overwriting the original information. New variables should be assigned functional names. -Ordering the data setso that related variables are together, +Ordering the data set so that related variables are together, and adding notes to each of them as necessary will make your data set more user-friendly. Before constructing new variables, From 267bf97e59d496384baac988cd9316fe003c1567 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 25 Feb 2020 20:36:32 -0500 Subject: [PATCH 844/854] Update chapters/handling-data.tex --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index f6c1e053b..6cf11690f 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -100,7 +100,7 @@ \subsection{Research reproducibility} producing these kinds of resources can lead to that as well. Therefore, your code should be written neatly with clear instructions and published openly. It should be easy to read and understand in terms of structure, style, and syntax. -Finally, the corresponding data setshould be openly accessible +Finally, the corresponding data set should be openly accessible unless for legal or ethical reasons it cannot be.\sidenote{ \url{https://dimewiki.worldbank.org/Publishing_Data}} From 37e3f12f4530ec16ad8657205d0ff4a131b913c5 Mon Sep 17 00:00:00 2001 From: Luiza Date: Tue, 25 Feb 2020 20:44:03 -0500 Subject: [PATCH 845/854] word search: dataset! --- appendix/stata-guide.tex | 16 +-- chapters/data-analysis.tex | 126 +++++++++++----------- chapters/data-collection.tex | 40 +++---- chapters/handling-data.tex | 12 +-- chapters/introduction.tex | 2 +- chapters/planning-data-work.tex | 20 ++-- chapters/publication.tex | 26 ++--- chapters/research-design.tex | 2 +- chapters/sampling-randomization-power.tex | 14 +-- code/code.do | 2 +- code/replicability.do | 2 +- code/simple-sample.do | 2 +- code/stata-before-saving.do | 2 +- code/stata-comments.do | 4 +- 14 files changed, 135 insertions(+), 135 deletions(-) diff --git a/appendix/stata-guide.tex b/appendix/stata-guide.tex index b0109a86f..526fbadb7 100644 --- a/appendix/stata-guide.tex +++ b/appendix/stata-guide.tex @@ -44,7 +44,7 @@ \section{Using the code examples in this book} We use GitHub to version control everything in this book, the code included. To see the code on GitHub, go to: \url{https://github.com/worldbank/d4di/tree/master/code}. If you are familiar with GitHub you can fork the repository and clone your fork. -We only use Stata's built-in data sets in our code examples, +We only use Stata's built-in datasets in our code examples, so you do not need to download any data. If you have Stata installed on your computer, then you will already have the data files used in the code. @@ -222,7 +222,7 @@ \subsection{Abbreviating variables} Using wildcards and lists in Stata for variable lists (\texttt{*}, \texttt{?}, and \texttt{-}) is also discouraged, because the functionality of the code may change -if the data set is changed or even simply reordered. +if the dataset is changed or even simply reordered. If you intend explicitly to capture all variables of a certain type, prefer \texttt{unab} or \texttt{lookfor} to build that list in a local macro, which can then be checked to have the right variables in the right order. @@ -417,19 +417,19 @@ \subsection{Using boilerplate code} \subsection{Saving data} -There are good practices that should be followed before saving any data set. -These are to \texttt{sort} and \texttt{order} the data set, +There are good practices that should be followed before saving any dataset. +These are to \texttt{sort} and \texttt{order} the dataset, dropping intermediate variables that are not needed, -and compressing the data set to save disk space and network bandwidth. +and compressing the dataset to save disk space and network bandwidth. If there is a unique ID variable or a set of ID variables, the code should test that they are uniqueally and -fully identifying the data set.\sidenote{ +fully identifying the dataset.\sidenote{ \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} ID variables are also perfect variables to sort on, -and to \texttt{order} first in the data set. +and to \texttt{order} first in the dataset. -The command \texttt{compress} makes the data set smaller in terms of memory usage +The command \texttt{compress} makes the dataset smaller in terms of memory usage without ever losing any information. It optimizes the storage types for all variables and therefore makes it smaller on your computer diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index ea11f4b4e..95ac67fe9 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -90,19 +90,19 @@ \subsection{Organizing your folder structure} \subsection{Breaking down tasks} -We divide the process of transforming raw data sets to analysis-ready data sets into four steps: +We divide the process of transforming raw datasets to analysis-ready datasets into four steps: de-identification, data cleaning, variable construction, and data analysis. Though they are frequently implemented concurrently, -creating separate scripts and data sets prevents mistakes. +creating separate scripts and datasets prevents mistakes. It will be easier to understand this division as we discuss what each stage comprises. What you should know for now is that each of these stages has well-defined inputs and outputs. This makes it easier to track tasks across scripts, and avoids duplication of code that could lead to inconsistent results. -For each stage, there should be a code folder and a corresponding data set. -The names of codes, data sets and outputs for each stage should be consistent, +For each stage, there should be a code folder and a corresponding dataset. +The names of codes, datasets and outputs for each stage should be consistent, making clear how they relate to one another. So, for example, a script called \texttt{section-1-cleaning} would create -a data set called \texttt{section-1-clean}. +a dataset called \texttt{section-1-clean}. The division of a project in stages also facilitates a review workflow inside your team. The code, data and outputs of each of these stages should go through at least one round of code review, @@ -141,17 +141,17 @@ \subsection{Implementing version control} you can output plain text files such as \texttt{.tex} tables and metadata saved in \texttt{.txt} or \texttt{.csv} to that directory. Binary files that compile the tables, -as well as the complete data sets, on the other hand, +as well as the complete datasets, on the other hand, should be stored in your team's shared folder. Whenever data cleaning or data construction codes are edited, use the master script to run all the code for your project. -Git will highlight the changes that were in data sets and results that they entail. +Git will highlight the changes that were in datasets and results that they entail. %------------------------------------------------ \section{De-identifying research data} -The starting point for all tasks described in this chapter is the raw data set, +The starting point for all tasks described in this chapter is the raw dataset, which should contain the exact data received, with no changes or additions. The raw data will invariably come in a variety of file formats and these files should be saved in the raw data folder \textit{exactly as they were @@ -161,7 +161,7 @@ \section{De-identifying research data} As described in the previous chapter, confidential data must always be encrypted\sidenote{\url{https://dimewiki.worldbank.org/Encryption}} and be properly backed up since every other data file you will use is created from the -raw data. The only data sets that can not be re-created are the raw data +raw data. The only datasets that can not be re-created are the raw data themselves. The raw data files should never be edited directly. This is true even in the @@ -173,10 +173,10 @@ \section{De-identifying research data} You will only keep working from the fixed copy, but you keep both copies in case you later realize that the manual fix was done incorrectly. -The first step in the transformation of raw data to an analysis-ready data set is de-identification. -This simplifies workflows, as once you create a de-identified version of the data set, +The first step in the transformation of raw data to an analysis-ready dataset is de-identification. +This simplifies workflows, as once you create a de-identified version of the dataset, you no longer need to interact directly with the encrypted raw data. -at this stage, means stripping the data set of personally identifying information.\sidenote{ +at this stage, means stripping the dataset of personally identifying information.\sidenote{ \url{https://dimewiki.worldbank.org/De-identification}} To do so, you will need to identify all variables that contain identifying information.\sidenote{\url{ @@ -194,7 +194,7 @@ \section{De-identifying research data} as well as allowing for more sophisticated disclosure risk calculations.\sidenote{ \url{https://sdctools.github.io/sdcMicro/articles/sdcMicro.html}} The \texttt{iefieldkit} command \texttt{iecodebook} -lists all variables in a data set and exports an Excel sheet +lists all variables in a dataset and exports an Excel sheet where you can easily select which variables to keep or drop.\sidenote{ \url{https://dimewiki.worldbank.org/Iecodebook}} @@ -225,7 +225,7 @@ \section{De-identifying research data} the data analysis process. The resulting de-identified data will be the underlying source for all cleaned and constructed data. -This is the data set that you will interact with directly during the remaining tasks described in this chapter. +This is the dataset that you will interact with directly during the remaining tasks described in this chapter. Because identifying information is typically only used during data collection, when teams need to find and confirm the identity of interviewees, de-identification should not affect the usability of the data. @@ -233,16 +233,16 @@ \section{De-identifying research data} \section{Cleaning data for analysis} Data cleaning is the second stage in the transformation of raw data into data that you can analyze. -The cleaning process involves (1) making the data set easy to use and understand, +The cleaning process involves (1) making the dataset easy to use and understand, and (2) documenting individual data points and patterns that may bias the analysis. The underlying data structure does not change. -The cleaned data set should contain only the variables collected in the field. +The cleaned dataset should contain only the variables collected in the field. No modifications to data points are made at this stage, except for corrections of mistaken entries. Cleaning is probably the most time-consuming of the stages discussed in this chapter. You need to acquire an extensive understanding of the contents and structure of the raw data. -Explore the data set using tabulations, summaries, and descriptive plots. -Knowing your data set well will make it possible to do analysis. +Explore the dataset using tabulations, summaries, and descriptive plots. +Knowing your dataset well will make it possible to do analysis. \subsection{Identifying the identifier} @@ -260,8 +260,8 @@ \subsection{Identifying the identifier} Note that while modern survey tools create unique identifiers for each submitted data record, that is not the same as having a unique ID variable for each individual in the sample. -You want to make sure the data set has a unique ID variable -that can be cross-referenced with other records, such as the master data set\sidenote{\url{https://dimewiki.worldbank.org/Master\_Data\_Set}} +You want to make sure the dataset has a unique ID variable +that can be cross-referenced with other records, such as the master dataset\sidenote{\url{https://dimewiki.worldbank.org/Master\_Data\_Set}} and other rounds of data collection. \texttt{ieduplicates} and \texttt{iecompdup}, two Stata commands included in the \texttt{iefieldkit} @@ -281,7 +281,7 @@ \subsection{Labeling, annotating, and finalizing clean data} \index{iecodebook} First, \textbf{renaming}: for data with an accompanying survey instrument, -it is useful to keep the same variable names in the cleaned data set as in the survey instrument. +it is useful to keep the same variable names in the cleaned dataset as in the survey instrument. That way it's straightforward to link variables to the relevant survey question. Second, \textbf{labeling}: applying labels makes it easier to understand your data as you explore it, and thus reduces the risk of small errors making their way through into the analysis stage. @@ -295,46 +295,46 @@ \subsection{Labeling, annotating, and finalizing clean data} Open-ended responses stored as strings usually have a high risk of being identifiers, so they should be encoded into categories as much as possible and raw data points dropped. You can use the encrypted data as an input to a construction script -that categorizes these responses and merges them to the rest of the data set. +that categorizes these responses and merges them to the rest of the dataset. -\subsection{Preparing a clean data set} -The main output of data cleaning is the cleaned data set. -It should contain the same information as the raw data set, +\subsection{Preparing a clean dataset} +The main output of data cleaning is the cleaned dataset. +It should contain the same information as the raw dataset, with identifying variables and data entry mistakes removed. Although original data typically requires more extensive data cleaning than secondary data, you should carefully explore possible issues in any data you are about to use. When reviewing raw data, you will inevitably encounter data entry mistakes, such as typos and inconsistent values. -These mistakes should be fixed in the cleaned data set, +These mistakes should be fixed in the cleaned dataset, and you should keep a careful record of how they were identified, and how the correct value was obtained.\sidenote{\url{ https://dimewiki.worldbank.org/Data\_Cleaning}} -The cleaned data set should always be accompanied by a dictionary or codebook. +The cleaned dataset should always be accompanied by a dictionary or codebook. Survey data should be easily traced back to the survey instrument. -Typically, one cleaned data set will be created for each data source -or survey instrument; and each row in the cleaned data set represents one +Typically, one cleaned dataset will be created for each data source +or survey instrument; and each row in the cleaned dataset represents one respondent or unit of observation.\cite{tidy-data} -If the raw data set is very large, or the survey instrument is very complex, +If the raw dataset is very large, or the survey instrument is very complex, you may want to break the data cleaning into sub-steps, -and create intermediate cleaned data sets +and create intermediate cleaned datasets (for example, one per survey module). When dealing with complex surveys with multiple nested groups, -is is also useful to have each cleaned data set at the smallest unit of observation inside a roster. +is is also useful to have each cleaned dataset at the smallest unit of observation inside a roster. This will make the cleaning faster and the data easier to handle during construction. -But having a single cleaned data set will help you with sharing and publishing the data. +But having a single cleaned dataset will help you with sharing and publishing the data. Finally, any additional information collected only for quality monitoring purposes, such as notes and duration fields, can also be dropped. -To make sure the cleaned data set file doesn't get too big to be handled, +To make sure the cleaned dataset file doesn't get too big to be handled, use commands such as \texttt{compress} in Stata to make sure the data is always stored in the most efficient format. -Once you have a cleaned, de-identified data set and the documentation to support it, -you have created the first data output of your project: a publishable data set. +Once you have a cleaned, de-identified dataset and the documentation to support it, +you have created the first data output of your project: a publishable dataset. The next chapter will get into the details of data publication. -For now, all you need to know is that your team should consider submitting this data set for publication, +For now, all you need to know is that your team should consider submitting this dataset for publication, even if it will remain embargoed for some time. This will help you organize your files and create a backup of the data, and some donors require that the data be filed as an intermediate step of the project. @@ -362,7 +362,7 @@ \subsection{Documenting data cleaning} or that you intend to release as part of a replication package or data publication. Another important component of data cleaning documentation are the results of data exploration. -As clean your data set, take the time to explore the variables in it. +As clean your dataset, take the time to explore the variables in it. Use tabulations, summary statistics, histograms and density plots to understand the structure of data, and look for potentially problematic patterns such as outliers, missing values and distributions that may be caused by data entry errors. @@ -388,25 +388,25 @@ \section{Constructing final indicators} You need to manipulate them into something that has \textit{economic} meaning, such as caloric input or food expenditure per adult equivalent. During this process, the data points will typically be reshaped and aggregated -so that level of the data set goes from the unit of observation +so that level of the dataset goes from the unit of observation (one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} -A constructed data set is built to answer an analysis question. +A constructed dataset is built to answer an analysis question. Since different pieces of analysis may require different samples, or even different units of observation, -you may have one or multiple constructed data sets, +you may have one or multiple constructed datasets, depending on how your analysis is structured. -Don't worry if you cannot create a single, ``canonical'' analysis data set. -It is common to have many purpose-built analysis data sets. +Don't worry if you cannot create a single, ``canonical'' analysis dataset. +It is common to have many purpose-built analysis datasets. Think of an agricultural intervention that was randomized across villages and only affected certain plots within each village. The research team may want to run household-level regressions on income, test for plot-level productivity gains, and check if village characteristics are balanced. -Having three separate data sets for each of these three pieces of analysis -will result in much cleaner do-files than if they all started from the same data set. +Having three separate datasets for each of these three pieces of analysis +will result in much cleaner do-files than if they all started from the same dataset. % From cleaning Construction is done separately from data cleaning for two reasons. @@ -433,7 +433,7 @@ \section{Constructing final indicators} as well as subsets and other alterations to the data. Even if construction and analysis are done concurrently, you should ways do the two in separate scripts. -If every script that creates a table starts by loading a data set, +If every script that creates a table starts by loading a dataset, subsetting it, and manipulating variables, any edits to construction need to be replicated in all scripts. This increases the chances that at least one of them will have a different sample or variable definition. @@ -441,10 +441,10 @@ \section{Constructing final indicators} avoid this and ensure consistency across different outputs. \subsection{Constructing analytical variables} -New variables created during the construction stage should be added to the data set, instead of overwriting the original information. +New variables created during the construction stage should be added to the dataset, instead of overwriting the original information. New variables should be assigned functional names. -Ordering the data set so that related variables are together, -and adding notes to each of them as necessary will make your data set more user-friendly. +Ordering the dataset so that related variables are together, +and adding notes to each of them as necessary will make your dataset more user-friendly. Before constructing new variables, you must check and double-check the value-assignments of questions, @@ -464,26 +464,26 @@ \subsection{Constructing analytical variables} You cannot add one hectare and two acres and get a meaningful number. During construction, you will also need to address some of the issues -you identified in the data set as you were cleaning it. +you identified in the dataset as you were cleaning it. The most common of them is the presence of outliers. How to treat outliers is a question for the research team (as there are multiple possible approaches), but make sure to note what decision was made and why. Results can be sensitive to the treatment of outliers, -so keeping the original variable in the data set will allow you to test how much it affects the estimates. +so keeping the original variable in the dataset will allow you to test how much it affects the estimates. These points also apply to imputation of missing values and other distributional patterns. The more complex construction tasks involve changing the structure of the data: -adding new observations or variables by merging data sets, +adding new observations or variables by merging datasets, and changing the unit of observation through collapses or reshapes. There are always ways for things to go wrong that we never anticipated, but two issues to pay extra attention to are missing values and dropped observations. -Merging, reshaping and aggregating data sets can change both the total number of observations +Merging, reshaping and aggregating datasets can change both the total number of observations and the number of observations with missing values. Make sure to read about how each command treats missing observations and, whenever possible, add automated checks in the script that throw an error message if the result is changing. If you are subsetting your data, drop observations explicitly, -indicating why you are doing that and how the data set changed. +indicating why you are doing that and how the dataset changed. Finally, creating a panel with survey data involves additional timing complexities. It is common to construct indicators soon after receiving data from a new survey round. @@ -491,10 +491,10 @@ \subsection{Constructing analytical variables} Having a well-established definition for each constructed variable helps prevent that mistake, but the best way to guarantee it won't happen is to create the indicators for all rounds in the same script. Say you constructed variables after baseline, and are now receiving midline data. -Then the first thing you should do is create a cleaned panel data set, +Then the first thing you should do is create a cleaned panel dataset, ignoring the previous constructed version of the baseline data. The \texttt{iecodebook append} subcommand will help you reconcile and append the cleaned survey rounds. -After that, adapt a single variable construction script so it can be used on the panel data set as a whole. +After that, adapt a single variable construction script so it can be used on the panel dataset as a whole. In addition to preventing inconsistencies, this process will also save you time and give you an opportunity to review your original code. @@ -510,10 +510,10 @@ \subsection{Documenting variable construction} This can be part of a wider discussion with your team about creating protocols for variable definition, which will guarantee that indicators are defined consistently across projects. When all your final variables have been created, -you can use the \texttt{iecodebook export} subcommand to list all variables in the data set, +you can use the \texttt{iecodebook export} subcommand to list all variables in the dataset, and complement it with the variable definitions you wrote during construction to create a concise metadata document. Documentation is an output of construction as relevant as the code and the data. -Someone unfamiliar with the project should be able to understand the contents of the analysis data sets, +Someone unfamiliar with the project should be able to understand the contents of the analysis datasets, the steps taken to create them, and the decision-making process through your documentation. The construction documentation will complement the reports and notes created during data cleaning. @@ -542,7 +542,7 @@ \subsection{Organizing analysis code} The way you deal with code and outputs for exploratory and final analysis is different. During exploratory data analysis, you will be tempted to write lots of analysis into one big, impressive, start-to-finish script. -It subtly encourages poor practices such as not clearing the workspace and not reloading the constructed data set before each analysis task. +It subtly encourages poor practices such as not clearing the workspace and not reloading the constructed dataset before each analysis task. To avoid mistakes, it's important to take the time to organize the code that you want to use again in a clean manner. @@ -562,7 +562,7 @@ \subsection{Organizing analysis code} All research questions and statistical decisions should be very explicit in the code, and should be very easy to detect from the way the code is written. This includes clustering, sampling, and control variables, to name a few. -If you have multiple analysis data sets, +If you have multiple analysis datasets, each of them should have a descriptive name about its sample and unit of observation. As your team comes to a decision about model specification, you can create globals or objects in the master script to use across scripts. @@ -574,7 +574,7 @@ \subsection{Organizing analysis code} To accomplish this, you will need to make sure that you have an effective data management system, including naming, file organization, and version control. -Just like you did with each of the analysis data sets, +Just like you did with each of the analysis datasets, name each of the individual analysis files descriptively. Code files such as \path{spatial-diff-in-diff.do}, \path{matching-villages.R}, and \path{summary-statistics.py} @@ -619,7 +619,7 @@ \subsection{Visualizing data} This is why we created the \textbf{Stata Visual Library}\sidenote{ \url{https://worldbank.github.io/Stata-IE-Visual-Library}}, which has examples of graphs created in Stata and curated by us.\sidenote{A similar resource for R is \textit{The R Graph Gallery}. \\\url{https://www.r-graph-gallery.com}} -The Stata Visual Library includes example data sets to use with each do-file, +The Stata Visual Library includes example datasets to use with each do-file, so you get a good sense of what your data should look like before you can start writing code to create a visualization. @@ -695,7 +695,7 @@ \subsection{Exporting analysis outputs} most of the data work involved in the last step of the research process -- publication -- will already be done. If you used de-identified data for analysis, -publishing the cleaned data set in a trusted repository will allow you to cite your data. +publishing the cleaned dataset in a trusted repository will allow you to cite your data. Some of the documentation produced during cleaning and construction can be published even if the data cannot due to confidentiality. Your analysis code will be organized in a reproducible way, diff --git a/chapters/data-collection.tex b/chapters/data-collection.tex index 33b6a61a5..5fac43875 100644 --- a/chapters/data-collection.tex +++ b/chapters/data-collection.tex @@ -12,7 +12,7 @@ assured that your data has been obtained at high standards of both quality and security. The chapter begins with a discussion of some key ethical and legal descriptions -to ensure that you have the right to do research using a specific data set. +to ensure that you have the right to do research using a specific dataset. Particularly when confidential data is collected by you and your team or shared with you by a program implementer, government, or other partner, you need to make sure permissions are correctly granted and documented. @@ -49,7 +49,7 @@ \section{Acquiring data} private sector partnerships granting access to new data sources, including administrative and sensor data; digitization of paper records, including administrative data; primary data capture by unmanned aerial vehicles or other types of remote sensing; -or novel integration of various types of data sets, e.g. combining survey and sensor data. +or novel integration of various types of datasets, e.g. combining survey and sensor data. Except in the case of primary surveys funded by the research team, the data is typically not owned by the research team. Data ownership and licensing agreements are required @@ -91,10 +91,10 @@ \subsection{Data ownership} \subsection{Data licensing agreements} Data licensing is the formal act of giving some data rights to others -while retaining ownership of a particular data set. -If you are not the owner of the data set you want to analyze, +while retaining ownership of a particular dataset. +If you are not the owner of the dataset you want to analyze, you should enter into a licensing or terms-of-use agreement to access it for research purposes. -Similarly, when you own a data set, +Similarly, when you own a dataset, you must consider whether the data can be made accessible to other researchers, and what terms-of-use you require. @@ -117,7 +117,7 @@ \subsection{Data licensing agreements} a license for all uses of derivative works, including public distribution (unless ethical considerations contraindicate this). This is important to allow the research team to store, catalog, and publish, in whole or in part, -either the original licensed data set or data sets derived from the original. +either the original licensed dataset or datasets derived from the original. Make sure that the license you obtain from the data owner allows these uses, and that you consult with the owner if you foresee exceptions with specific portions of the data. @@ -166,7 +166,7 @@ \subsection{Receiving data from development partners} Another important consideration at this stage is proper documentation and cataloging of data and associated metadata. -It is not always clear what pieces of information jointly constitute a ``data set'', +It is not always clear what pieces of information jointly constitute a ``dataset'', and many of the sources you receive data from will not be organized for research. To help you keep organized and to put some structure on the materials you will be receiving, you should always retain the original data as received @@ -183,13 +183,13 @@ \subsection{Receiving data from development partners} what they measure, and how they are to be used. In the case of survey data, this includes the survey instrument and associated manuals; the sampling protocols and field adherence to those protocols, and any sampling weights; -what variable(s) uniquely identify the data set(s), and how different data sets can be linked; +what variable(s) uniquely identify the dataset(s), and how different datasets can be linked; and a description of field procedures and quality controls. We use as a standard the Data Documentation Initiative (DDI), which is supported by the World Bank's Microdata Catalog.\sidenote{\url{https://microdata.worldbank.org}} As soon as the requisite pieces of information are stored together, -think about which ones are the components of what you would call a data set. +think about which ones are the components of what you would call a dataset. This is more of an art than a science: you want to keep things together that belong together, but you also want to keep things apart that belong apart. @@ -200,7 +200,7 @@ \subsection{Receiving data from development partners} as you move towards the publication part of the research process. This may require you to re-check with the provider about what portions are acceptable to license, -particularly if you are combining various data sets +particularly if you are combining various datasets that may provide even more information about specific individuals. %------------------------------------------------ @@ -416,7 +416,7 @@ \subsection{Programming electronic questionnaires} All survey softwares include debugging and test options to correct syntax errors and make sure that the survey instruments will successfully compile. -This is not sufficient, however, to ensure that the resulting data set +This is not sufficient, however, to ensure that the resulting dataset will load without errors in your data analysis software of choice. We developed the \texttt{ietestform} command,\sidenote{ \url{https://dimewiki.worldbank.org/ietestform}} @@ -437,7 +437,7 @@ \subsection{Programming electronic questionnaires} A second survey pilot should be done after the questionnaire is programmed. The objective of this \textbf{data-focused pilot}\sidenote{ \url{https://dimewiki.worldbank.org/index.php?title=Checklist:_Refine_the_Questionnaire_(Data)}} -is to validate the programming and export a sample data set. +is to validate the programming and export a sample dataset. Significant desk-testing of the instrument is required to debug the programming as fully as possible before going to the field. It is important to plan for multiple days of piloting, @@ -452,7 +452,7 @@ \section{Data quality assurance} it is important to make sure that data faithfully reflects ground realities. Data quality assurance requires a combination of real-time data checks and back-checks or validation audits, which often means tracking down -the people whose information is in the data set. +the people whose information is in the dataset. \subsection{Implementing high frequency quality checks} @@ -549,7 +549,7 @@ \subsection{Conducting back-checks and data validation} For back-checks and validation audits, a random subset of the main data is selected, and a subset of information from the full survey is verified through a brief targeted survey with the original respondent -or a cross-referenced data set from another source (if the original data is not a field survey). +or a cross-referenced dataset from another source (if the original data is not a field survey). Design of the back-checks or validations follows the same survey design principles discussed above: you should use the analysis plan or list of key outcomes to establish which subset of variables to prioritize, @@ -572,7 +572,7 @@ \subsection{Conducting back-checks and data validation} \subsection{Finalizing data collection} When all data collection is complete, the survey team should prepare a final field report, -which should report reasons for any deviations between the original sample and the data set collected. +which should report reasons for any deviations between the original sample and the dataset collected. Identification and reporting of \textbf{missing data} and \textbf{attrition} is critical to the interpretation of survey data. It is important to structure this reporting in a way that not only @@ -580,8 +580,8 @@ \subsection{Finalizing data collection} but also collects all the detailed, open-ended responses to questions the field team can provide for any observations that they were unable to complete. This reporting should be validated and saved alongside the final raw data, and treated the same way. -This information should be stored as a data set in its own right --- a \textbf{tracking data set} -- that records all events in which survey substitutions +This information should be stored as a dataset in its own right +-- a \textbf{tracking dataset} -- that records all events in which survey substitutions and attrition occurred in the field and how they were implemented and resolved. %------------------------------------------------ @@ -750,7 +750,7 @@ \subsection{Storing data securely} \subsection{Sharing data securely} You and your team will use your first copy of the raw data as the starting point for data cleaning and analysis of the data. -This raw data set must remain encrypted at all times if it includes confidential data, +This raw dataset must remain encrypted at all times if it includes confidential data, which is almost always the case. As long as the data is properly encrypted, it can be shared using insecure modes of communication @@ -780,7 +780,7 @@ \subsection{Sharing data securely} A common example where this is often forgotten involves sharing survey information, such as sampling lists, with a field partner. This data is -- by all definitions -- also PII data and must be encrypted. -A sampling list can often be used to reverse identify a de-identified data set, +A sampling list can often be used to reverse identify a de-identified dataset, so if you were to share it using an insecure method, then that would be your weakest link that could render useless all the other steps you have taken to ensure the privacy of the respondents. @@ -794,7 +794,7 @@ \subsection{Sharing data securely} Remember that you must always share passwords and keys in a secure way like password managers. At this point, the raw data securely stored and backed up. -It can now be transformed into your final analysis data set, +It can now be transformed into your final analysis dataset, through the steps described in the next chapter. Once the data collection is over, you typically will no longer need to interact with the identified data. diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 6cf11690f..5bc4ccd06 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -100,7 +100,7 @@ \subsection{Research reproducibility} producing these kinds of resources can lead to that as well. Therefore, your code should be written neatly with clear instructions and published openly. It should be easy to read and understand in terms of structure, style, and syntax. -Finally, the corresponding data set should be openly accessible +Finally, the corresponding dataset should be openly accessible unless for legal or ethical reasons it cannot be.\sidenote{ \url{https://dimewiki.worldbank.org/Publishing_Data}} @@ -353,7 +353,7 @@ \subsection{Transmitting and storing data securely} inside that secure environment. However, password-protection alone is not sufficient, because if the underlying data is obtained through a leak the information itself remains usable. -Data sets that include confidential information +datasets that include confidential information \textit{must} therefore be \textbf{encrypted}\sidenote{ \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked, or any other type of unauthorized access is obtained. \url{https://dimewiki.worldbank.org/Encryption}} @@ -424,7 +424,7 @@ \subsection{De-identifying data} You can take simple steps to avoid risks by minimizing the handling of PII. First, only collect information that is strictly needed for the research. Second, avoid the proliferation of copies of identified data. -There should never be more than one copy of the raw identified data set in the project folder, +There should never be more than one copy of the raw identified dataset in the project folder, and it must always be encrypted. Even within the research team, access to PII data should be limited to team members who require it for specific analysis @@ -435,7 +435,7 @@ \subsection{De-identifying data} Therefore, once data is securely collected and stored, the first thing you will generally do is \textbf{de-identify} it, -that is, remove direct identifiers of the individuals in the data set.\sidenote{ +that is, remove direct identifiers of the individuals in the dataset.\sidenote{ \url{https://dimewiki.worldbank.org/De-identification}} \index{de-identification} Note, however, that it is in practice impossible to \textbf{anonymize} data. @@ -446,13 +446,13 @@ \subsection{De-identifying data} For this reason, we recommend de-identification in two stages. The \textbf{initial de-identification} process strips the data of direct identifiers as early in the process as possible, -to create a working de-identified data set that +to create a working de-identified dataset that can be shared \textit{within the research team} without the need for encryption. This simplifies workflows. The \textbf{final de-identification} process involves making a decision about the trade-off between risk of disclosure and utility of the data -before publicly releasing a data set.\sidenote{ +before publicly releasing a dataset.\sidenote{ \url{https://sdcpractice.readthedocs.io/en/latest/SDC\_intro.html\#need-for-sdc}} We will provide more detail about the process and tools available for initial and final de-identification in Chapters 6 and 7, respectively. diff --git a/chapters/introduction.tex b/chapters/introduction.tex index a8841dacb..4b34244d3 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -179,7 +179,7 @@ \section{Writing reproducible code in a collaborative environment} Do I believe this number? What can go wrong in my code? How will missing values be treated in this command? -What would happen if more observations would be added to the data set? +What would happen if more observations would be added to the dataset? Can my code be made more efficient or easier to understand? \subsection{Code examples} diff --git a/chapters/planning-data-work.tex b/chapters/planning-data-work.tex index e24e79651..60f0a18b4 100644 --- a/chapters/planning-data-work.tex +++ b/chapters/planning-data-work.tex @@ -6,7 +6,7 @@ and the collaboration platforms and processes for your team. In order to be prepared to work on the data you receive with a group, you need to structure your workflow in advance. -This means knowing which data sets and outputs you need at the end of the process, +This means knowing which datasets and outputs you need at the end of the process, how they will stay organized, what types of data you'll acquire, and whether the data will require special handling due to size or privacy considerations. Identifying these details will help you map out the data needs for your project, @@ -219,8 +219,8 @@ \subsection{Choosing software} Take into account the different levels of techiness of team members, how important it is to access files offline constantly, as well as the type of data you will need to access and the security needed. -Big data sets require additional infrastructure and may overburden -the traditional tools used for small data sets, +Big datasets require additional infrastructure and may overburden +the traditional tools used for small datasets, particularly if you are trying to sync or collaborate on them. Also consider the cost of licenses, the time to learn new tools, and the stability of the tools. @@ -246,7 +246,7 @@ \subsection{Choosing software} Next, think about how and where you write and execute code. This book is intended to be agnostic to the size or origin of your data, -but we are going to broadly assume that you are using desktop-sized data sets +but we are going to broadly assume that you are using desktop-sized datasets in one of the two most popular desktop-based packages: R or Stata. (If you are using another language, like Python, or working with big data projects on a server installation, @@ -347,7 +347,7 @@ \subsection{Organizing files and folder structures} and for the files that manage final analytical work. The command has some flexibility for the addition of folders for other types of data sources, although this is less well developed -as the needs for larger data sets tend to be very specific. +as the needs for larger datasets tend to be very specific. The \texttt{ietoolkit} package also includes the \texttt{iegitaddmd} command, which can place \texttt{README.md} placeholder files in your folders so that your folder structure can be shared using Git. Since these placeholder files are @@ -366,7 +366,7 @@ \subsection{Organizing files and folder structures} be stored in a synced folder that is shared with other people. Those two types of collaboration tools function very differently and will almost always create undesired functionality if combined.) -Nearly all code files and raw outputs (not data sets) are best managed this way. +Nearly all code files and raw outputs (not datasets) are best managed this way. This is because code files are always \textbf{plaintext} files, and non-code-compatiable files are usually \textbf{binary} files.\index{plaintext}\index{binary files} It's also becoming more and more common for written outputs such as reports, @@ -416,7 +416,7 @@ \subsection{Organizing files and folder structures} % ---------------------------------------------------------------------------------------------- \subsection{Documenting and organizing code} Once you start a project's data work, -the number of scripts, data sets, and outputs that you have to manage will grow very quickly. +the number of scripts, datasets, and outputs that you have to manage will grow very quickly. This can get out of hand just as quickly, so it's important to organize your data work and follow best practices from the beginning. Adjustments will always be needed along the way, @@ -445,7 +445,7 @@ \subsection{Documenting and organizing code} Otherwise, you should include it in the header. You should always track the inputs and outputs of the script, as well as the uniquely identifying variable; refer to lines 49-51 in the example do-file. -When you are trying to track down which code creates which data set, this will be very helpful. +When you are trying to track down which code creates which dataset, this will be very helpful. While there are other ways to document decisions related to creating code, the information that is relevant to understand the code should always be written in the code file. @@ -517,7 +517,7 @@ \subsection{Working with a master script} This may break another script that refers to this variable. But unless you run both of them when the change is made, it may take time for that to happen, and when it does, it may take time for you to understand what's causing an error. -The same applies to changes in data sets and results. +The same applies to changes in datasets and results. To link code, data and outputs, the master script reflects the structure of the \texttt{DataWork} folder in code @@ -599,7 +599,7 @@ \subsection{Managing outputs} Document output creation in the master script that runs these files, so that before the line that runs a particular analysis script there are a few lines of comments listing -data sets and functions that are necessary for it to run, +datasets and functions that are necessary for it to run, as well as all outputs created by that script. % What software to use diff --git a/chapters/publication.tex b/chapters/publication.tex index c4b523cde..a281a3df8 100644 --- a/chapters/publication.tex +++ b/chapters/publication.tex @@ -267,13 +267,13 @@ \section{Preparing a complete replication package} and the replication file should not include any documentation or data you would not share publicly. This usually means removing project-related documentation such as contracts and details of data collection and other field work, -and double-checking all data sets for potentially identifying information. +and double-checking all datasets for potentially identifying information. \subsection{Publishing data for replication} Publicly documenting all original data generated as part of a research project is an important contribution in its own right. -Publishing original data sets is a significant contribution that can be made +Publishing original datasets is a significant contribution that can be made in addition to any publication of analysis results.\sidenote{ \url{https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-publishing-research-data.pdf}} If you are not able to publish the data itself, @@ -282,10 +282,10 @@ \subsection{Publishing data for replication} These may take the form of metadata catalogs or embargoed releases. Such setups allow you to hold an archival version of your data which your publication can reference, -and provide information about the contents of the data sets +and provide information about the contents of the datasets and how future users might request permission to access them (even if you are not the person to grant that permission). -They can also provide for timed future releases of data sets +They can also provide for timed future releases of datasets once the need for exclusive access has ended. Publishing data allows other researchers to validate the mechanical construction of your results, @@ -311,13 +311,13 @@ \subsection{Publishing data for replication} is especially relevant for impact evaluations. Both the World Bank Microdata Catalog and the Harvard Dataverse create data citations for deposited entries. -DIME has its own collection of data sets in the Microdata Catalog, +DIME has its own collection of datasets in the Microdata Catalog, where data from our projects is published.\sidenote{\url{ https://microdata.worldbank.org/catalog/dime}} When your raw data is owned by someone else, or for any other reason you are not able to publish it, -in many cases you will still have the right to release derivate data sets, +in many cases you will still have the right to release derivate datasets, even if it is just the indicators you constructed and their documentation.\sidenote{ \url{https://guide-for-data-archivists.readthedocs.io}} If you have questions about your rights over original or derived materials, @@ -331,7 +331,7 @@ \subsection{Publishing data for replication} \url{https://microdata.worldbank.org/index.php/terms-of-use}} Open Access data is freely available to anyone, and simply requires attribution. Direct Access data is to registered users who agree to use the data for statistical and scientific research purposes only, -to cite the data appropriately, and to not attempt to identify respondents or data providers or link to other data sets that could allow for re-identification. +to cite the data appropriately, and to not attempt to identify respondents or data providers or link to other datasets that could allow for re-identification. Licensed access data is restricted to bona fide users, who submit a documented application for how they will use the data and sign an agreement governing data use. The user must be acting on behalf of an organization, which will be held responsible in the case of any misconduct. Keep in mind that you may or may not own your data, @@ -340,9 +340,9 @@ \subsection{Publishing data for replication} is at the time that data collection or sharing agreements are signed. Published data should be released in a widely recognized format. -While software-specific data sets are acceptable accompaniments to the code +While software-specific datasets are acceptable accompaniments to the code (since those precise materials are probably necessary), -you should also consider releasing generic data sets +you should also consider releasing generic datasets such as CSV files with accompanying codebooks, since these can be used by any researcher. Additionally, you should also release @@ -351,7 +351,7 @@ \subsection{Publishing data for replication} collected directly in the field and which are derived. If possible, you should publish both a clean version of the data which corresponds exactly to the original database or questionnaire -as well as the constructed or derived data set used for analysis. +as well as the constructed or derived dataset used for analysis. You should also release the code that constructs any derived measures, particularly where definitions may vary, @@ -360,7 +360,7 @@ \subsection{Publishing data for replication} \subsection{De-identifying data for publication} Before publishing data, you should carefully perform a \textbf{final de-identification}. -Its objective is to reduce the risk of disclosing confidential information in the published data set. +Its objective is to reduce the risk of disclosing confidential information in the published dataset. that cannot be manipulated or linked to identify any individual research participant. If you are following the steps outlined in this book, you have already removed any direct identifiers after collecting the data. @@ -388,7 +388,7 @@ \subsection{De-identifying data for publication} There will almost always be a trade-off between accuracy and privacy. For publicly disclosed data, you should favor privacy. -Stripping identifying variables from a data set may not be sufficient to protect respondent privacy, +Stripping identifying variables from a dataset may not be sufficient to protect respondent privacy, due to the risk of re-identification. One potential solution is to add noise to data, as the US Census Bureau has proposed.\cite{abowd2018us} This makes the trade-off between data accuracy and privacy explicit. @@ -400,7 +400,7 @@ \subsection{De-identifying data for publication} so that the process can be reviewed, revised, and updated as necessary. In cases where confidential data is required for analysis, -we recommend embargoing sensitive or access-restricted variables when publishing the data set. +we recommend embargoing sensitive or access-restricted variables when publishing the dataset. Access to the embargoed data could be granted for specific purposes, such as a computational reproducibility check required for publication, if done under careful data security protocols and approved by an IRB. diff --git a/chapters/research-design.tex b/chapters/research-design.tex index 4f2df3fd2..1c4e4364a 100644 --- a/chapters/research-design.tex +++ b/chapters/research-design.tex @@ -597,7 +597,7 @@ \subsection{Synthetic controls} The counterfactual blend is chosen by optimizing the prediction of past outcomes based on the potential input characteristics, and typically selects a small set of comparators to weight into the final analysis. -These data sets therefore may not have a large number of variables or observations, +These datasets therefore may not have a large number of variables or observations, but the extent of the time series both before and after the implementation of the treatment are key sources of power for the estimate, as are the number of counterfactual units available. diff --git a/chapters/sampling-randomization-power.tex b/chapters/sampling-randomization-power.tex index 2f7dbdd51..62ba2811e 100644 --- a/chapters/sampling-randomization-power.tex +++ b/chapters/sampling-randomization-power.tex @@ -114,8 +114,8 @@ \subsection{Implementing random processes reproducibly in Stata} Since the exact order must be unchanged, the underlying data itself must be unchanged as well between runs. This means that if you expect the number of observations to change (for example increase during ongoing data collection) your randomization will not be stable unless you split your data up into -smaller fixed data sets where the number of observations does not change. You can combine all -those smaller data sets after your randomization. +smaller fixed datasets where the number of observations does not change. You can combine all +those smaller datasets after your randomization. In Stata, the only way to guarantee a unique sorting order is to use \texttt{isid [id\_variable], sort}. (The \texttt{sort, stable} command is insufficient.) You can additionally use the \texttt{datasignature} command to make sure the @@ -184,16 +184,16 @@ \subsection{Sampling} \url{https://dimewiki.worldbank.org/Sampling_\%26_Power_Calculations}} \index{sampling} That master list may be called a \textbf{sampling universe}, a \textbf{listing frame}, or something similar. -We recommend that this list be organized in a \textbf{master data set}\sidenote{ +We recommend that this list be organized in a \textbf{master dataset}\sidenote{ \url{https://dimewiki.worldbank.org/Master_Data_Set}}, creating an authoritative source for the existence and fixed characteristics of each of the units that may be surveyed.\sidenote{ \url{https://dimewiki.worldbank.org/Unit_of_Observation}} -The master data set indicates how many individuals are eligible for data collection, +The master dataset indicates how many individuals are eligible for data collection, and therefore contains statistical information about the likelihood that each will be chosen. The simplest form of random sampling is \textbf{uniform-probability random sampling}. -This means that every observation in the master data set +This means that every observation in the master dataset has an equal probability of being included in the sample. The most explicit method of implementing this process is to assign random numbers to all your potential observations, @@ -301,8 +301,8 @@ \subsection{Clustering} Clustering is procedurally straightforward in Stata, although it typically needs to be performed manually. To cluster a sampling or randomization, -create or use a data set where each cluster unit is an observation, -randomize on that data set, and then merge back the results. +create or use a dataset where each cluster unit is an observation, +randomize on that dataset, and then merge back the results. When sampling or randomization is conducted using clusters, the clustering variable should be clearly identified since it will need to be used in subsequent statistical analysis. diff --git a/code/code.do b/code/code.do index f18b9adc3..4c217674f 100644 --- a/code/code.do +++ b/code/code.do @@ -1,4 +1,4 @@ -* Load the auto data set +* Load the auto dataset sysuse auto.dta , clear * Run a simple regression diff --git a/code/replicability.do b/code/replicability.do index c1d166b5b..fe05a0778 100644 --- a/code/replicability.do +++ b/code/replicability.do @@ -2,7 +2,7 @@ ieboilstart , v(13.1) `r(version)' -* Load the auto data set (auto.dta is a test data set included in all Stata installations) +* Load the auto dataset (auto.dta is a test dataset included in all Stata installations) sysuse auto.dta , clear * SORTING - sort on the uniquely identifying variable "make" diff --git a/code/simple-sample.do b/code/simple-sample.do index e4237b088..38c148a7e 100644 --- a/code/simple-sample.do +++ b/code/simple-sample.do @@ -11,7 +11,7 @@ sort sample_rand // Sort based on the random number * Use the sort order to sample 20% (0.20) of the observations. _N in -* Stata is the number of observations in the active data set , and _n +* Stata is the number of observations in the active dataset , and _n * is the row number for each observation. The bpwide.dta has 120 * observations, 120*0.20 = 24, so (_n <= _N * 0.20) is 1 for observations * with a row number equal to or less than 24, and 0 for all other diff --git a/code/stata-before-saving.do b/code/stata-before-saving.do index 49a268bac..85f14a42a 100644 --- a/code/stata-before-saving.do +++ b/code/stata-before-saving.do @@ -1,4 +1,4 @@ -* If the data set has ID variables, test if they uniquely identifying the observations. +* If the dataset has ID variables, test if they uniquely identifying the observations. local idvars household_ID household_member year isid `idvars' diff --git a/code/stata-comments.do b/code/stata-comments.do index e494462ff..911d48e2a 100644 --- a/code/stata-comments.do +++ b/code/stata-comments.do @@ -15,5 +15,5 @@ TYPE 2: TYPE 3: -* Open the data set - sysuse auto.dta // Built in data set (This comment is used to document a single line) +* Open the dataset + sysuse auto.dta // Built in dataset (This comment is used to document a single line) From 95f58883fbd3a74e34444203cb7ee0a49aa9ff6a Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Tue, 25 Feb 2020 20:48:14 -0500 Subject: [PATCH 846/854] Update chapters/handling-data.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/handling-data.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/handling-data.tex b/chapters/handling-data.tex index 5bc4ccd06..50c494f82 100644 --- a/chapters/handling-data.tex +++ b/chapters/handling-data.tex @@ -353,7 +353,7 @@ \subsection{Transmitting and storing data securely} inside that secure environment. However, password-protection alone is not sufficient, because if the underlying data is obtained through a leak the information itself remains usable. -datasets that include confidential information +Datasets that include confidential information \textit{must} therefore be \textbf{encrypted}\sidenote{ \textbf{Encryption:} Methods which ensure that files are unreadable even if laptops are stolen, databases are hacked, or any other type of unauthorized access is obtained. \url{https://dimewiki.worldbank.org/Encryption}} From 5daefd3fd13ad9a0f3365d35af018ccaa5eb36c1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Tue, 25 Feb 2020 21:37:48 -0500 Subject: [PATCH 847/854] [ch6] side note looked bad with a line brake --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index d1270140f..1ffd66223 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -726,7 +726,7 @@ \subsection{Exporting analysis outputs} This means it should be easy to read and understand them with only the information they contain. Make sure labels and notes cover all relevant information, such as sample, unit of observation, unit of measurement and variable definition.\sidenote{ - \url{https://dimewiki.worldbank.org/Checklist:\_Reviewing\_Graphs} \\ + \url{https://dimewiki.worldbank.org/Checklist:\_Reviewing\_Graphs} and \url{https://dimewiki.worldbank.org/Checklist:\_Submit\_Table}} If you follow the steps outlined in this chapter, From 9d59978d7848d4ac8f89c07e927ca3af4d83d2c1 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 26 Feb 2020 11:13:19 -0500 Subject: [PATCH 848/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 1ffd66223..5b648a7a3 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -489,7 +489,7 @@ \subsection{Integrating different data sources} Merging, reshaping and aggregating data sets can change both the total number of observations and the number of observations with missing values. Make sure to read about how each command treats missing observations and, -whenever possible, add automated checks in the script that throw an error message if the result is changing. +whenever possible, add automated checks in the script that throw an error message if the result is different than what you expect. If you are subsetting your data, drop observations explicitly, indicating why you are doing that and how the data set changed. From 6a7bff83f5c2fae830eeab6b144479c00a09ed13 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 26 Feb 2020 11:13:27 -0500 Subject: [PATCH 849/854] Update chapters/data-analysis.tex MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Kristoffer Bjärkefur --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 5b648a7a3..d98110934 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -473,7 +473,7 @@ \subsection{Integrating different data sources} the challenge usually occurs in correctly defining statistical aggregations if the merge is intended to result in a dataset at the provider level. However, other cases may not be designed with the intention to be merged together, -such as a dataset of infrastructure access points such as water pumps or schools +such as a dataset of infrastructure access points, for example, water pumps or schools and a dataset of household locations and roads. In those cases, a key part of the research contribution is figuring out what a useful way to combine the datasets is. From b18ab1bf5bf5dc4549d1508e1058632121587281 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Wed, 26 Feb 2020 11:29:57 -0500 Subject: [PATCH 850/854] Update chapters/data-analysis.tex Co-Authored-By: Luiza Andrade --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index f5e00a22a..f3c335e9d 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -90,7 +90,7 @@ \subsection{Organizing your folder structure} \subsection{Breaking down tasks} -We divide the process of transforming raw datasets to analysis-ready datasets into four steps: +We divide the process of transforming raw datasets to analysis-ready datasets to research results into four steps: de-identification, data cleaning, variable construction, and data analysis. Though they are frequently implemented concurrently, creating separate scripts and datasets prevents mistakes. From ee7302e962a8ea28d67d0be0bc7ea88112b3fc1c Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 26 Feb 2020 13:58:32 -0500 Subject: [PATCH 851/854] [ch6] taxpayer not tax payer --- chapters/data-analysis.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index b0cd618f1..3604b3c28 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -206,7 +206,8 @@ \section{De-identifying research data} as you can always go back and remove variables from the list of variables to be dropped, but you can not go back in time and drop a PII variable that was leaked because it was incorrectly kept. -Examples include respondent names and phone numbers, enumerator names, tax payer numbers, and addresses. +Examples include respondent names and phone numbers, enumerator names, taxpayer +numbers, and addresses. For each confidential variable that is needed in the analysis, ask yourself: \textit{can I encode or otherwise construct a variable that masks the confidential component, and then drop this variable?} From 41361235c41b3c568979f06ab92b0259ae8880ba Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 26 Feb 2020 13:59:37 -0500 Subject: [PATCH 852/854] [ch6] missing "you" --- chapters/data-analysis.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 3604b3c28..11f0bda71 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -363,7 +363,7 @@ \subsection{Documenting data cleaning} or that you intend to release as part of a replication package or data publication. Another important component of data cleaning documentation are the results of data exploration. -As clean your dataset, take the time to explore the variables in it. +As you clean your dataset, take the time to explore the variables in it. Use tabulations, summary statistics, histograms and density plots to understand the structure of data, and look for potentially problematic patterns such as outliers, missing values and distributions that may be caused by data entry errors. From 08416b1df393d84c30326efd3fbc4b5abb890a98 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 26 Feb 2020 14:02:47 -0500 Subject: [PATCH 853/854] [ch6] remove ref to example that were removed --- chapters/data-analysis.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 11f0bda71..3eb609d77 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -381,9 +381,9 @@ \section{Constructing analysis datasets} as planned during research design\index{Research design}, and using the pre-analysis plan as a guide.\index{Pre-analysis plan} During this process, the data points will typically be reshaped and aggregated -so that level of the dataset goes from the unit of observation -(one item in the bundle) in the survey to the unit of analysis (the household).\sidenote{ - \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} +so that level of the dataset goes from the unit of observation in the survey +to the unit of analysis.\sidenote{\url{ +https://dimewiki.worldbank.org/Unit\_of\_Observation}} A constructed dataset is built to answer an analysis question. From c783462fdd2420da73fc2460fefe7f7099b9deaa Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 26 Feb 2020 14:04:34 -0500 Subject: [PATCH 854/854] [ch6] raw to research --- chapters/data-analysis.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/data-analysis.tex b/chapters/data-analysis.tex index 3eb609d77..62e63da89 100644 --- a/chapters/data-analysis.tex +++ b/chapters/data-analysis.tex @@ -90,7 +90,8 @@ \subsection{Organizing your folder structure} \subsection{Breaking down tasks} -We divide the process of transforming raw datasets to analysis-ready datasets to research results into four steps: +We divide the process of transforming raw datasets to research outputs into +four steps: de-identification, data cleaning, variable construction, and data analysis. Though they are frequently implemented concurrently, creating separate scripts and datasets prevents mistakes.