RNA-Seq and the Tuxedo Suite


Let's say we're interested in the level of transcription of one or more genes. There are several technologies in common use today, so let's take a moment to talk about them, and which ones you might choose, and when:

Exquisitely sensitive
One gene at a time. Must know gene sequence. Reasonable to scale up to ~dozens of genes x ~ dozens of conditions
in situ hybridization
Preserves spatial localization
One gene at a time. Must know gene sequence. Small dynamic range, prone to hybridization artifacts
Expression Microarrays
Thousands of probes. Cheap: ~ $100/chip
Non-linear output. Must know sequence. Limited resolution. Relatively large input requirements
Sensitive, linear output at high resolution. No a priori genome necessary. Lots of data. Very tiny input requirements.
Relatively expensive (~$800-1500/lane + ~$60/sample). Lots of data!
Can look at microRNA content, levels
Different bench approach from looking at genes. I don't know what software to point you to.

I put down "Lots of data" as both a pro and a con of RNA-seq, since it's often hard to make sense of the piles and piles of data you get, but you know what makes that easier? Programming! Good thing we've spent a week and a half learning that!

There are some commonly known biases to mRNA-seq data. For instance, GC rich fragments are going to be, on average, harder to amplify; there is some software that can make corrections to the final expression values you get based on the overall "mappability" of the sequence. In addition, the 5' ends of transcripts tend to be better represented; some of this is due to efficiency of the RT step, and is partially helped by the fragmentation, but it definitely seems like there's more to it than that. At any rate, some of these biases are corrected for in many of the tools available for RNA-seq analysis, but some are not and it's a good idea to keep that in mind.

It's worth pointing out that there are two major (non-mutually exclusive) approaches for analyzing RNA-seq data. The first is for generating a transcript inventory: determining which parts of a genome can be expressed as RNA in exons. If you're working with a common laboratory species, then chances are there's already a pretty well annotated inventory, or will be soon (which you might be able to get access to). The second, and probably more common application is determining differential expression: which parts of a genome are expressed at different levels in different samples. That last sentence was intentionally vague about which parts: you can look at genes, transcript isoforms, exons, or even (conceivably) individual base pairs. Quantifying eukaryotic transcript isoforms is the hardest, since you have to first determine which isoform a particular read comes from when it maps to an exon present in multiple isoforms.

We've already talked about Bowtie, which uses the Burroughs-Wheeler transform to do mapping really quickly. Now we're going to use a couple extra tools for doing RNA-seq expression analysis named Tophat and Cufflinks. Because of the somewhat whimsical names (which is not uncommon in software), the whole collection of tools is sometimes called the Tuxedo suite.

Before we get into the swing of things, let's take a few moments here to install the programs we'll need for the day. Cufflinks and Tophat are in your ProgramFiles packages and you install them similarly to Bowtie, Blast, and Samtools earlier this week.

$ cd ~/PythonCourse/ProgramFiles/
$ tar -xvzf cufflinks-2.2.1.OSX_x86_64.tar.gz
$ tar -xvzf tophat-2.1.0.OSX_x86_64.tar.gz

We'll also need to add in a line in our .bash_profile file:

$ echo 'PATH=$PATH:~/PythonCourse/ProgramFiles/bowtie2-2.2.5/' >> ~/.bash_profile
$ echo 'PATH=$PATH:~/PythonCourse/ProgramFiles/cufflinks-2.2.1.OSX_x86_64' >> ~/.bash_profile
$ echo 'PATH=$PATH:~/PythonCourse/ProgramFiles/tophat-2.1.0.OSX_x86_64 ' >> ~/.bash_profile

We'll also need BedTools, which was not included in your installer packages:

Download from here

$ tar -xvzf BEDTools.v2.17.0.tar.gz
$ cd bedtools-2.17.0/
$ make

And one more tool: bedGraphToBigWig, which you can get here (following the appropriate links for your OS).

$ chmod u+x bedGraphToBigWig

And make sure bedtools-2.17.0/ and the directory holding bedGraphToBigWig are in your PATH. Don't forget to source your bash_profile:

$ echo 'PATH=$PATH:~/PythonCourse/ProgramFiles/bedtools-2.17.0/bin/' >> ~/.bash_profile
$ echo 'PATH=$PATH:~/PythonCourse/ProgramFiles/' >> ~/.bash_profile

$ echo 'export PATH' >> ~/.bash_profile
$ source ~/.bash_profile

Introduction to experiment

Today we're going to analyze an RNA-seq experiment done in E. coli. I found these data by browsing the SRA for something that would be good for this course. The authors performed RNA-seq on E. coli after heat shock and after phosphorus starvation, in order to investigate how the transcriptome changes (reference: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3734233/).

Download the E. coli genome and annotation file:

Ensembl annotation

Finally, choose one of the following samples. What we'll do is have everyone work through one of the data sets, then combine the results.

HS_control: 37C control https://www.dropbox.com/s/gvzfcjjjinyjxt2/SRR915686.fastq.gz?dl=0
HS_15min: 15 min after heat shock (48C) https://www.dropbox.com/s/lo1kqyithwbedy4/SRR915687.fastq.gz?dl=0
HS_30min: 30 min after heat shock (48C) https://www.dropbox.com/s/rw7rx3hbrp76axd/SRR915690.fastq.gz?dl=0
HS_60min: 60 min after heat shock (48C) https://www.dropbox.com/s/bnmeka8e9epfd2g/SRR915692.fastq.gz?dl=0
MP_0h: 0 time point of phosphorus starvation https://www.dropbox.com/s/bjzmw214m4bzhd0/SRR915694.fastq.gz?dl=0
MP_2h: 2 hours of phosphorus starvation https://www.dropbox.com/s/rw7rx3hbrp76axd/SRR915690.fastq.gz?dl=0
MP_4h: 4 hours of phosphorus starvation https://www.dropbox.com/s/5vca9qfu5j7eby5/SRR915698.fastq.gz?dl=0


Let's say you're mapping things to the genome, and you find that suddenly, in the middle of a read, you jump several kb in the genome. If your read was of genomic DNA, this would probably tell you that the genome is pretty messed up. On the other hand, if you're working with RNA-seq data, probably all it means is that you've got a eukaryote and you hit a splice junction. Tophat is a read mapper that is aware of splice junctions, and is aware of lots of other subtleties that we might see in RNA-seq data. It actually uses Bowtie to do the mapping, but it wraps the output nicely so that we can just plug it into the next step in the pipeline, Cufflinks.

Even though our data is in E. coli, I'm going to run you through how to use Tophat since many people will need this for their eukaryotic organisms. To find out how Tophat takes its options, we can run it without any arguments:

$ tophat

which will give us a long list of data on how to use it. If we scroll up to the top, however, we see this message:

TopHat maps short sequences from spliced transcripts to whole genomes.
    tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] \
                                    [quals1,[quals2,...]] [quals1[,quals2,...]]
So, like Bowtie, we can call it with a relatively few number of required things (in the greater than and less than symbols), and one of about a bazillion options, which there's some detail on below the main usage.

A typical Tophat run looks something like this:

$ tophat -o mapped_reads/ E_coli_index reads_1_1,reads_2_1 reads_1_2,reads_2_2

Let's break this down. We want Tophat to:

  • Put output in the folder called mapped_reads, which it will make if it doesn't already exist. This is optional, and will default to tophat_out.
  • We're using the index file E_coli_index which, if you remember, we made with bowtie2_build.
  • We've got four data files: reads1_1, reads_2_1, reads_1_2, and reads_2_2. In this example, we're using paired-end reads, although for our project, we only sequenced one end of the read, so there's no equivalent to reads_1_2 and reads_2_2.

For those of you clever enough to be working with well-studied model organisms, there's another option you probably want to keep in mind:

$ tophat --no-novel-juncs -G E_coli_transcripts.gtf -o mapped_reads/ E_coli_index reads_1_1,reads_2_1 reads_1_2,reads_2_2

This tells Tophat that you only want to use already annotated gene models. In principle, this should make it run ever-so-slightly faster, and you'll have better confidence in your data.

Tophat produces a few different files of output. For our purposes, the most important of these is accepted_hits.bam, which is a list of read alignments. It's in the binary version of the SAM format, which is not really readable by hand. You can use samtools to convert these to human readable format if you want, although that does take up quite a bit more space.

Often for RNA-seq data, we have something like 7-10 different files for each index that we want to run Tophat on. Now, it's doable to take each one and manually put a comma between them, but exercise 1 today will be to generate the command line string that actually runs the Tophat, even though we will not need it for today.

BedTools and BigWigs

Something that's useful for visualizing your data is a listing of the coverage for each base pair in the genome. This is pretty straightforward to generate, as long as we have the right tools installed. For us, the right tool is in the BedTools package, which will take the BAM file that Tophat gives us, and makes a BedGraph.

A BedGraph file is a 4-column text file that has the chromosome, the start and stop positions, and then the coverage at that position. To make that file, we use:

$ genomeCoverageBed -bga -ibam accepted_hits.bam > cov.bed

from within the folder that Tophat wrote the accepted hits into.

One more change we're going to make to this is to turn it into a BigWig file. BigWig is a non-human readable format, but it has the advantage that it's really fast for computers to look into, so if you have the right software, you can ask for the coverage at an arbitrary position, and don't have to stupidly read every line in the file in order to find the one line you're looking for.

To do that, we need a file that has the size of the chromosomes, which I've called chrom.sizes:

$ cat chrom.sizes

Chromosome 4639675
F 99159

Note that those are tabs between the name and the size.

$ bedGraphToBigWig cov.bed chroms.sizes cov.bw

If you have a genome viewer (like IGB, but not covered in this course), we could load that up and look at the actual transcription levels as a function of location. Another option that I use if possible (not all genomes are available) is the UCSC Genome Browser


(Use the unzipped index files you downloaded at the beginning of the section)

Exercise 1

Often, for RNA-seq data right off the sequencer, you'll have something like 7-10 different files for each sample. Now, it's doable to take each one and manually put a comma between them, but this exercise will be to generate the command line string that actually runs Tophat. It's not necessary for the E. coli data but will give you practice. Put your fastq file in its own directory and then use the function listdir in the os module to create a command string that you can then feed into the subprocessing module to run Tophat.

Run the command string you generated to map your reads! This may take a while, so make sure it works before you go to lunch, then leave your computer running. As a general rule, if Tophat makes it to the "mapping left_kept_reads" stage, it should be able to make it all the way through (possibly even "preparing reads" is fine). More often, it will simply fail right away.

import os
import sys
import subprocess as sp
allfiles = os.listdir(sys.argv[1])
bowtie_index = sys.argv[2]
gtf = sys.argv[3]
tophat = "tophat"
def orderFiles(list1, list2):
 ordered = []
 new2 = []
 for f1 in list1:
  flnm = f1.split("/")[-1].replace(".fastq", "")
  flnmL = flnm.split("_")
  sampl = "_".join(flnmL[0:-2])
  fnum = flnmL[-1]
  for fl in ordered:
   ok = 0
   for f2 in list2:
    flnm2 = f2.split("/")[-1].replace(".fastq", "")
    flnmL2 = flnm2.split("_")
    sampl2 = "_".join(flnmL[0:-2])
    fnum2 = flnmL2[-1]
    if fl == sampl2 + fnum2:
     ok = 1
   if not ok:
    print >>sys.stedd, "unexpected unpaired fastq file!!!"
    return list1, list2
 return list1, new2
for index in sys.argv[4:]:
 seqfiles1 = []
 seqfiles2 = []
 for filename in allfiles:
  if index in filename and filename.endswith('.fastq'):
   if "_R1_" in filename:
    seqfiles1.append(os.path.join(sys.argv[1], filename))
   elif "_R2_" in filename:
    seqfiles2.append(os.path.join(sys.argv[1], filename))
 if seqfiles2:
  seqfiles1, seqfiles2 = orderFiles(seqfiles1, seqfiles2)
 tophat_command = [tophat,
                   '--output-dir', index,
                   '-G', gtf,
 print " ".join(tophat_command)
 proc = sp.Popen(tophat_command)

Exercise 2 (optional)

Here's an optional exercise to get more practice. Using the operons file found here (operons.gff), let's figure out where all of the 3' UTRs are. This sounds simple, but if one operon comes right after another one, there's not much in the way of UTR available.

For this exercise, make another GFF file that has entries for operons, then entries for up to 300 bases of the 3' UTRs for each operon. The 3' UTRs should not overlap any operon, and any UTR that's less than 50 bp should not have an entry. The UTR should, of course, have the same strand as the operon it's downstream of.

gff_fname = 'operons.gff'
out_fname = 'features.gff'
def annotate_genome(gff_fname):
 gff_file = open(gff_fname)
 empty = ''
 annotation = []
 for line in gff_file:
  data = line.strip().split('\t')
  start = int(data[3])
  stop = int(data[4])
  desc = data[-1].split('"')
  operon_name = desc[desc.index('operonName ') + 1]
  while len(annotation) < start:
  for pos in range(start, stop):
   if len(annotation) + 1 == pos:
    print "BE A LERT! Your country needs more lerts", annotation[pos]
 return annotation
genome = annotate_genome(gff_fname)
GFF_base = '\t'.join(['Chromosome', 'MicrobesOnline', 'downstream_utr',
                      '%d', # low end position
                      '%d', # high end position
                      '.', '%s', # strand
                      '.', '%s utr_len %d;\n'])
out_file = open(out_fname, 'w')
for line in open(gff_fname):
 data = line.strip().split('\t')
 start = int(data[3])
 stop = int(data[4])
 strand = data[6]
 desc = data[-1]
 for i in range(300):
  if strand == '+':
   lo_pos = stop
   hi_pos = stop + i
   pos = hi_pos
  elif strand == '-':
   lo_pos = start - i - 1
   hi_pos = start
   pos = lo_pos
   assert False
  if lo_pos < 0 or hi_pos >= len(genome):
  if genome[pos]:
  genome[pos] = 'UTR'
 if i > 50:
  GFF_line = GFF_base % (lo_pos, hi_pos, strand, desc, i+1)