Skip to content

A cufflinks based RNAseq differential expression workflow automated using Makefiles.

Notifications You must be signed in to change notification settings

MarcusWalz/MakeRNAseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 

Repository files navigation

Maake-based RNA-seq Analysis Workflow

Prerequisites

This guide assumes the reader has some basic unix knowledge. At a minimum the reader should have some understanding of the unix command line, unix file system, and secure shell.

Using this Guide

This guide, example files, and an example implementation are included in a git repository. You can get a local copy by running the following command:

git clone [email protected]:MarcusWalz/MakeRNAseq.git

Some of the features of this repository include:

  • README.mkd this very file
  • expirement/Makefile the RNAseq Makefile itself.
  • examples/ a directory containing several example makefiles to help you get familiar with Make's features.

Intro

Traditional bioinformatic workflows are implemented using either scripting (e.g. bash, python, etc.) or a GUI program such as Galaxy to compose different analysis tools together.

While galaxy works well for simple analysis, it requires a high amount of expertise to integrate custom and unsupported analysis programs into the galaxy workflows--meaning that it is often simpler to script or to just execute a workflow by hand. However, scripting or executing a workflow manually is difficult since it requires the developer to imperatively encode the workflow from start to finish; iteratively developing such a pipeline is difficult since it creates an organizational and computational bottleneck. For larger datasets, scripting is further complicated by the need to use high performance computing infrastructure.

We were able to simplify the construction of custom bioinformatic workflows by keeping track of input files and output files of each step of the analysis to implicitly determine the order of execution. We use a common program called GNU Make (henceforth called simply Make) which is found on nearly every unix computer.

Make is:

  • Reproducible, Make ensures that the workflows output is actually a representative by the most up-to-date code. In addition, make file supports start-to-finish execution of the complete workflow.
  • Simple, Make determines the workflows execution order on it's own, meaning that the developer does not need to worry about complicated for loops.
  • Efficient, Make only performs computations that are required to produce the desired outputs. In addition make can execute the pipeline in parallel automatically.
  • Robust, Make can usually detect when a program crashed and stop downstream execution.

However Make is designed for software compilation, not bioinformatic pipelines. At times, Make using make will be a bit awkward.

We have succesfully used this pipeline to analyize 500 gigabytes of RNAseq data.

Make Basics

Make at it's core is very simple. Simply by tracking dependencies a lot of workflow programming becomes implicit.

Terminology

Make is pretty simple, it boils down to the following:

Dependency (a.k.a Prerequisite)
A dependency is an input file needed to construct a target. Dependencies can, themselves, be targets.
Target
A target is the file that will be generated by a rule. Make can only generated one target per rule.
Command
The shell command used to construct the target. Commands must be offset with a tab.
Rule
A rule is a single step in a workflow. It consists of a list of dependencies, a target, and a command.
Makefile
A file named `Makefile` that stores multiple rules. Makefiles must be saved as `makefile` or `Makefile`.
Make Directory
The directory where the Makefile lives. The `make` command can only be called from this directory.
a ### A simple make file

You can try running everything below yourself, assuming you already cloned the repository. Simply cd to the directory examples/simple.

a
# this rule creates a file called Hello.
Hello : ;
	echo Hello > Hello
aa
# this rule creates a file called World.
World : ;
	echo World > World

# This rule combines the files Hello and World.
# n.b. Hello and World are dependencies for the rule below.
HelloWorld : Hello World ;
	cat Hello World > HelloWorld

The target Hello can now be constructed (or "maked") by executing the following command from the Makefiles directory:

$ make Hello

Make will output the commands used to construct Hello, i.e.:

echo Hello > Hello

The target HelloWorld can be constructed by the command:

$ make HelloWorld

Then Make executes two commands.

echo World > World
cat Hello World > HelloWorld
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Notice how Make does not reconstruct the target `Hello`. 
Make is lazy and avoids superfluous computations.

To demonstrate, let's make `HelloWorld` again.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ make HelloWorld 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Make outputs:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make: Target up-to-date
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Meaning that the target is younger than it's dependencies in terms 
of the dependency files modification timestamp. If the dependency
is a target, make ensures that the dependency is also up-to-date.

As an exercise, edit the file hello and make `HelloWorld` again.

Make only computes what it needs to compute. So that previous step
will only reconstruct the target `HelloWorld` again. On a larger 
scale, for example, consider how a Make based workflow would reaact
to switching to a more up-to-date genome. Make will see that the 
genome files are younger than the targets that depend on these files and
reconstruct only those targets. Make won't bother to run any read
level quality control operation since these targets are completely independent
of the genome data. Make knows exactly which steps of a workflow 
to run and not to rerun.

Make is able to declaretively encode a bioinformatic workflow quite
easily, simply by tracking the input and output files that a bioinformatic
program uses or produces.

Makefile syntax is very straightforward and very similar to shell scripts. Note that:

* Commands need to be offset by one or more old fashioned tabs, i.e. `\t`. Spaces don't work.
* Variables need to be surrounded by parenthesis, i.e.  `$(foo)` except for single character
    macro variables like `$<` and `$@`.
* Arrays are simply variables with unescaped whitespace--just like shell scripts.
* Some variables work more like functions and can be used to manipulate other variables. E.g.
    `$(addsuffix .txt, hello)` is equivalent to `hello.txt`. The section on variables has more
    info on this.

### The n most important rules to bioinformatic Make files.

Make is powerful and flexible. From experience, it seems that make
work's best when you follow the commandments:

1. Though shalt not set targets to directories.
2. Though shalt not use recursion.
3. Though shalt not use some obscure Makefile extensions.
4. Though shalt not modify a dependency within rule's command.
5. Though shalt not modify a dependency's dependency within rule's command.
6. Though shalt not modify a dependency's dependency's dependency within rule's command.
7. etc.

Rules 4 through n can be summarized as: No interdependent relationships between
targets and dependencies. That is, in math lingo, Make works only for computations
that can be reduced to an acylic graph. 

### Make, High Performance Computing, Remote Execution

By using a scheduler such as `slurm` or a remote access tool 
such as `ssh` we are able to outsource work to other computers.

Some very expensive proprietary schedulers are even able to use makefiles as
job submission scripts.


The first command: 

As long as a command "blocks" i.e. does not terminate 
until it is finished executing it can be used within the workflow. 
This means we can't simply submit "batch scripts" to the scheduler,
instead we submit interactive jobs. Interactive jobs work identically
to connecting to a remote server via ssh. However, there are some
syntaxual caveats to be aware of.

Namely piped commands need to be enclosed in single quotes in order
for the entire sequence of computations to take place remotely.

Consider the following example: 

$ ssh -t myserver date > remote_time # example 1 $ ssh -t myserver 'date > my_time' # example 2


The first command:

1. Connects to `myserver` 
2. On myserver executes the command `date` 
3. Sends the output of `date` back to your computer.
4. Your computer sends the output to a file named `remote_time`.

I.e. the output is sent directly back at us.

While the second command:

1. Connects to `myserver`
2. Executes the command `date`
3. Saves the time as `my_time` on `myserver`'s filesystem.

I.e. the output of the command `date` never leaves the server.

But on an HPC cluster with a shared filesystem both my_time
and time will appear in your current directory. The first 
command is significantly less efficient since the data `date` produced gets
rerouted back to the local computer and then sent over the file system. In
the second command, `date` bypasses the local computer and gets sent directly to
data. In high throughput situations, the first example is disastrous and has 
resulted in the workflow crashing. It's best to avoid piping altogether.

Using raw ssh gives us the ability to have certain steps of a workflow
execute on very particular computers (e.g. your very laptop or
the labs web server). This is useful for less computational intense 
work that requires software that is difficult to get working on a 
HPC cluster; e.g., graphing and gene set analysis software. Note that,
you need to mount the remote file system on your local computer or
you must move input files need to be moved to the remote server
explicitly and output files need to be moved back. This is easier
said than done. It's probably simplest to keep two make files:
one Makefile for high throughput analysis and second Makefile for
the analysis that can take place on a personal computer.

Otherwise, for UWM's Avi cluster, we prepend `salloc <scheduler PARAMS>` to the command
to execute remotely. E.g. the following will request and wait for a node with 8 cores
and nearly 22,000 megabytes of memory and then execute `my_command`.

~~~~~~~~~~~~~~~~~~~
salloc -c 8 -N 1 --mem 22000 srun my_command
~~~~~~~~~~~~~~~~~~~

Note this is a single "conjugate" command. `salloc` executes `srun` and `srun` executes `my_command` 
on the resources `salloc` requisitioned. Once `my_command` has terminated, resources are freed.

Make will run sequentially and not in parallel unless the `-j` parameter is supplied with the maximum
number of rules to execute concurrently. E.g., the following constructs a target executing a maximum of
48 concurrent rules:

~~~~~~~~~~~~~~~~~~~~~
make -j 48 my_target
~~~~~~~~~~~~~~~~~~~~~

### Organizing the experiment #########

Another potential caveat of the Makefile approach is that
it requires you to organize your experiments data files in a
wae that is--well--very organized. See the example Makefile
which goes in depth for a generic RNAseq design that analyses
multiple experiments in parallel.

In short, files need to be predictable places. As you move down
the directory tree you should move from more "general files" to
more specific files.

As a rule of thumb, if you're Makefile has a lot of experiment
specific rules or a lot of funky string manipulation going on, chances
are moving around a few files will make a simpler Makefile.
a
### Writing Rules #####################

Makefiles should be easy to read and well documented. The great
thing about using Makefile is that an entire workflow can be
encoded into a single file.

The order of rules doesn't matter, but they should appear in the
same order they would execute.

E.g.:

1. Rule for Downloading a Sample form the sequencer
2. Rule for generating a FastQC report
3. Rule for removing adapters
4. Rule for generating a FastQC report for cleaned adapters.
5. Rule to align reads to genome.

But rules 2 and 4 are identical. It's best to move more general
rules to either the start or end of the Makefile. This means
certain bioinformatic operations become implicit based on the 
dependency requested. Here are some examples of general rules on
alignment files: 

TODO Replace working Makefile

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Convert Serial (Sam) aligments to Binary (Bam) alignments
*.sam : *.bam ;
     samtools view -b $< > $@ 
a
# Sort a bam file by locus
*.sorted.bam : %.bam ;
     samtools sort $< $@ 

# Index a bam file
*.sorted.bam.bai : *.sorted.bam
     samtools faidx $<
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These are called suffix rules. 
The percent sign is a wild card that works similar to `*` wildcard in shell scripts.
`$<` is a macro for the first dependency. `$@` is a macro for the current target. 

You can expirement with macros in `examples/macros`.

Suppose we have a file called `test.sam` then running:

~~~~~~~~~~~~~~~~~~~~~~~~
$ make test.sorted.bam.bai
~~~~~~~~~~~~~~~~~~~~~~~~

Will execute the following commands:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
samtools view -b test.sam > test.bam
samtools sort test.bam test.sorted.bam
samtools faidx test.sorted.bam
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This order of execution gets determined by searching for a
dependency's dependency until Make finds a rule where all
dependencies are satisfied. Make matches "wildcard" rules
as a last resort, that is make defaults to rules for targets
without wildcards whenever possible. 


Here is an example of a more complicated rule:

~~~~~~~~~~~~~~~~
#Align to genome via tophat via the script scripts/tophat.sh
samples/%/tophat/accepted_hits.bam : \
	$(genome_idx) \
	$(genome_gtf) \
	scripts/tophat.sh \
	samples/%/cleaned_reads/r1.fq.gz \
	samples/%/cleaned_reads/r2.fq.gz ;
	salloc -c 8 -N 1 --mem 21000 -J tophat srun \
	       scripts/tophat.sh $(@D)

~~~~~~~~~~~~~~~~~~~

Since there are a lot of dependencies, we keep the code readable by 
spanning the list of over multiple lines and escaping the new line
character with a `\ `. The scheduler-related portion of the command
has it's own line. Finally, the actual command is offset by an additional 
tab. 


### Make Variables and Built-in-functions

Variables in Make are very important, especially for Bioinformatic workflows.
Variables work very similar to shell scripting. Like scripting, "Arrays" are
simply variables with unescaped whitespace.

Variables can be assigned in two ways:

1. Using `=`, in which case the variable gets evaluated each time it appears in a rule
2. Or using `:=`, which evaluates on assignment.

Consider the following examples:

~~~~~~~~~~~~~~~~~~~~~
bar = Hello
foo = $(bar)
bar = $(ugh)
ugh = Huh?
~~~~~~~~~~~~~~~~~~~~~

Here `$(foo)` is equivalent to `Huh?`. Changing the assignment operators to `:=` makes 
`foo` equivalent to `Hello`. 

Generally, `:=` is going to be the correct operator to use, especially when stateful 
functions are being used.

Another confusing thing about Make is that some built-in functions are called 
by variables. E.g.: `$(join Hello, World)` is function and produces
a value equivelent to `HelloWorld`. If this is troubling, functions can be
differentiated from variables using curly brackets; e.g., `${join Hello, World}`.

Functions generally map over each element of an array. E.g. `${addsuffix .txt, file1 file2}`
produces the array: `file1.txt file2.txt`. 

### Dealing with multiple output files

Make was built on the assumption that a rule constructs only a single output file or target.
In softwarere compilation this is an accurate assumption to make. But in bioinformatics, not so
much.

The most obvious approach is to use an output directory as a target. However, it's hard to 
control the modification date of a directory. Anytime a file gets added, removed, or renamed in a
directory, the directory's modification date is changed. Many programs (such as text editors)
will create and delete temporary file within the directory thus updating the target's modification
time and makeing it appear as if downstream computations are out of date. 

For isolated components of the workflow that produce many files, using directories as targets
may be a reasonable appoarch. 

A better option is to choose one output file as a representive for the rule. Then create phony
rules (i.e. rules without commands) that depend only on the representive file. The representive
file should usually be the last file to be generated by a rule. For example in the tophat rule,
`accepted_hits.bam` made a good representive not because it was the last file to be generated, but
because tophat would often silently fail while producing this file. 

### Modifing rules

Make doesn't track changes made to the Makefile itself. So changing a parameter in a command will not
make a target out-of-date. We need to associate the rule with a dummy dependency file that can be
updated manually whenever it's corresponding rule is modified. In traditional Make upstream changes are
made by using a `clean` rule which deletes all intermediate files the Makefile generated.


### Polymorphic Behavior Using Symlinks

About

A cufflinks based RNAseq differential expression workflow automated using Makefiles.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages