02-project-intro.html

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="generator" content="pandoc">
    <title>Software Carpentry: R for reproducible scientific analysis</title>
    <link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-responsive.css" />
    <link rel="stylesheet" type="text/css" href="css/swc.css" />
    <link rel="stylesheet" type="text/css" href="css/swc-workshop-and-lesson.css" />
    <link rel="stylesheet" type="text/css" href="css/lesson.css" />
    <link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
    <meta charset="UTF-8" />
    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
    <!--[if lt IE 9]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </head>
  <body class="lesson">
    <div class="container container-full-width card">
      <div class="banner">
        <a href="http://software-carpentry.org" title="Software Carpentry">
          <img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
        </a>
      </div>
      <div class="row-fluid">
        <div class="span10 offset1">
          <h1 class="title">R for reproducible scientific analysis</h1>
          <h2 class="subtitle">Project management with RStudio</h2>
<div id="learning-objectives" class="objectives">
<h2>Learning Objectives</h2>
<ul>
<li>To be able to create self-contained projects in RStudio</li>
<li>To be able to use git from within RStudio</li>
</ul>
</div>
<h3 id="introduction">Introduction</h3>
<p>The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.</p>
<blockquote class="twitter-tweet">
<p>
Managing your projects in a reproducible fashion doesn't just make your science reproducible, it makes your life easier.
</p>
— Vince Buffalo (<span class="citation">@vsbuffalo</span>) <a href="https://twitter.com/vsbuffalo/status/323638476153167872">April 15, 2013</a>
</blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Most people tend to organize their projects like this:</p>
<div class="figure">
<img src="img/bad_layout.png" />

</div>
<p>There are many reasons why we should <em>ALWAYS</em> avoid this:</p>
<ol style="list-style-type: decimal">
<li>It is really hard to tell which version of your data is the original and which is the modified;</li>
<li>It gets really messy because it mixes files with various extensions together;</li>
<li>It probably takes you a lot of time to actually find things, and relate the correct figures to the exact code that has been used to generate it;</li>
</ol>
<p>A good project layout will ultimately make your life easier:</p>
<ul>
<li>It will help ensure the integrity of your data;</li>
<li>It makes it simpler to share your code with someone else (a lab-mate, collaborator, or supervisor);</li>
<li>It allows you to easily upload your code with your manuscript submission;</li>
<li>It makes it easier to pick the project back up after a break.</li>
</ul>
<h3 id="the-rstudio-solution">The RStudio solution</h3>
<p>Fortunately, RStudio can help you manage your code effectively.</p>
<p>One of the most powerful and useful aspects of RStudio is its project management functionality. We'll be using this today to create a self-contianed, reproducible project.</p>
<div id="challenge-installing-packages" class="challenge">
<h4>Challenge: Installing Packages</h4>
<p>The first thing we're going to do is to install a third-party package, <code>packrat</code>. This allows RStudio to create self-contained packages: any further packages you download will be contained within their respective projects. This is really useful, as different versions of packages can change results as new knowledge is gained. This allows you to easily keep track of the version used for your analyses.</p>
<p>Packages can be installed using the <code>install.packages</code> function:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">install.packages</span>(<span class="st">&quot;packrat&quot;</span>) </code></pre>
</div>
<div id="challenge-creating-a-self-contained-project" class="challenge">
<h4>Challenge: Creating a self-contained project</h4>
<p>Now we're going to create a new project in RStudio:</p>
<ol style="list-style-type: decimal">
<li>Click the &quot;File&quot; menu button, then &quot;New Project&quot;.</li>
<li>Click &quot;New Directory&quot;.</li>
<li>Click &quot;Empty Project&quot;.</li>
<li>Type in the name of the directory to store your project, e.g. &quot;my_project&quot;.</li>
<li>Make sure that the checkboxes for &quot;Create a git repository&quot; and &quot;Use packrat with this project&quot; are selected.</li>
<li>Click the &quot;Create Project&quot; button.</li>
</ol>
</div>
<p>Now when we start R in this project directory, or open this project with RStudio, all of our work on this project will be entirely self-contained in this directory. By installing <code>packrat</code> and telling RStudio to use <code>packrat</code> with this project, any third-party packages will be installed in a separate library in the <code>packrat/</code> subdirectory of our project. This means we don't have to worry about package versions changing, especially when returning to a project after a long period of time (for example when writing up your thesis!).</p>
<div id="tip-packrat" class="callout">
<h4>Tip: packrat</h4>
<p>Any libraries you already have installed outside of your project will need to be reinstalled in each packrat project.</p>
<p>Packrat will also analyse your script files and warn you if youre using any libraries not installed and managed inside your project. This is useful if you reuse code between projects.</p>
<p>Packrat also allows you to easily bundle up a project to share with someone else.</p>
<p>RStudio has a more detailed <a href="http://rstudio.github.io/packrat/">packrat tutorial</a></p>
</div>
<p>Now lets load the packrat library:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(<span class="st">&quot;packrat&quot;</span>)</code></pre>
<p>Here we've called the function <code>library</code> and used it to load those packages into our local namespace (our interactive R session). This means all of their functions are now available to us.</p>
<p>The main function you'll encounter in <code>packrat</code> is the <code>status</code> function:</p>
<pre class="sourceCode r"><code class="sourceCode r">packrat::<span class="kw">status</span>()</code></pre>
<pre class="output"><code>Up to date.</code></pre>
<p>Here I've put the name of the library in front of its function, separated by a <code>::</code>. This explicitly tells R to call the function from that library. This can be useful to make your code clearer (<code>status</code> is fairly generic function name and might be used by other packages), and useful when two packages have functions with the same names (in which case order of library loading becomes important), or you've written your own function or variable with the same name (you should try to avoid this).</p>
<p>You'll want to run this periodically (after installing libraries and writing new code) to make sure your project is still self-contained.</p>
<h3 id="best-practices-for-project-organisation">Best practices for project organisation</h3>
<p>Although there is no &quot;best&quot; way to lay out a project, there are some general principles to adhere to that will make project management easier:</p>
<h4 id="treat-data-as-read-only">Treat data as read only</h4>
<p>This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as &quot;read-only&quot;.</p>
<h4 id="data-cleaning">Data Cleaning</h4>
<p>In many cases your data will be &quot;dirty&quot;: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. This task is sometimes called &quot;data munging&quot;. I find it useful to store these scripts in a separate folder, and create a second &quot;read-only&quot; data folder to hold the &quot;cleaned&quot; data sets.</p>
<h4 id="treat-generated-output-as-disposable">Treat generated output as disposable</h4>
<p>Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts.</p>
<p>There are lots of different was to manage this output. I find it useful to have an output folder with different sub-directories for each separate analysis. This makes it easier later, as many of my analyses are exploratory and don't end up being used in the final project, and some of the analyses get shared between projects.</p>
<h4 id="separate-function-definition-and-application">Separate function definition and application</h4>
<p>The most effective way I find to work in R, is to play around in the interactive session, then copy commands across to a script file when I'm sure they work and do what I want. You can also save all the commands you've entered using the <code>history</code> command, but I don't find it useful because when I'm typing its 90% trial and error.</p>
<p>When your project is new and shiny, the script file usually contains many lines of directly executed code. As it matures, reusable chunks get pulled into their own functions. It's a good idea to separate these into separate folders; one to store useful functions that you'll reuse across analyses and projects, and one to store the analysis scripts.</p>
<div id="tip-avoiding-duplication" class="callout">
<h4>Tip: avoiding duplication</h4>
<p>You may find yourself using data or analysis scripts across several projects. Typically you want to avoid duplication to save space and avoid having to make updates to code in multiple places.</p>
<p>In this case I find it useful to make &quot;symbolic links&quot;, which are essentially shortcuts to files somewhere else on a filesystem. On linux and OSX you can use the <code>ln -s</code> command, and on windows you can either create a shortcut or use the <code>mklink</code> command from the windows terminal.</p>
</div>
<h4 id="version-control">Version Control</h4>
<p>We also set up our project to integrate with git, putting it under version control. RStudio has a nicer interface to git than shell, but is very limited in what it can do, so you will find yourself occasionally needing to use the shell. Let's go through and make an initial commit of our template files.</p>
<p>The workspace/history pane has a tab for &quot;Git&quot;. We can stage each file by checking the box: you will see a Green &quot;A&quot; next to stage files and folders, and yellow question marks next to files or folders git doesn't know about yet. RStudio also nicely shows you the difference between files from different commits.</p>
<div id="tip-versioning-disposable-output" class="callout">
<h4>Tip: versioning disposable output</h4>
<p>Generally you do not want to version disposable output (or read-only data). You should modify the <code>.gitignore</code> file to tell git to ignore these files and directories.</p>
</div>
<div id="challenge-1" class="challenge">
<h4>Challenge 1</h4>
<p>Use packrat to install the packages we'll be using later:</p>
<ul>
<li>ggplot2</li>
<li>plyr</li>
</ul>
<p>Note: if you run <code>packrat::status</code> it will warn you that these libraries are unecessary because they're not used in any project code.</p>
</div>
<div id="challenge-2" class="challenge">
<h4>Challenge 2</h4>
<p>Create the following directories (either through bash or RStudio) for the project:</p>
<ul>
<li><code>data/</code> for storing read-only data</li>
<li><code>munge/</code> for storing preprocessing scripts</li>
<li><code>cleaned-data/</code> for storing read-only preprocessed data</li>
<li><code>functions/</code> for storing any functions we write</li>
<li><code>scripts/</code> for storing analysis scripts</li>
<li><code>output/</code> for storing any output we generate</li>
</ul>
</div>
<div id="challenge-3" class="challenge">
<h4>Challenge 3</h4>
<p>Modify the <code>.gitignore</code> file to contain <code>data/</code>, <code>cleaned-data</code> and <code>output/</code> so that disposable output isn't versioned.</p>
<p>Add the newly created folders to version control using the git interface.</p>
</div>
        </div>
      </div>
      <div class="footer">
        <a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
        <a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
        <a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
        <a class="label swc-blue-bg" href="LICENSE.html">License</a>
      </div>
    </div>
    <!-- Javascript placed at the end of the document so the pages load faster -->
    <script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
    <script src="http://software-carpentry.org/v5/js/bootstrap/bootstrap.min.js"></script>
  </body>
</html>