In Depth RNA-seq analysis

This morning, we left off having all the tools we'd need to do the RNA-seq analysis, and asked you to map the reads for your respective samples. Now we're going to combine our data, and look at what happens to gene expression during heat shock and phosphorus starvation.

First, let's check the results of Tophat. Whenever I have lab meeting with new sequencing data, I always have a standard table format, where I have
  • # of reads from each sample
  • # of mapped reads from each sample and % of reads that mapped
  • Some estimate of the library complexity: # of unique reads

If your Tophat run hasn't finished, you can get the accepted_hits.bam and align_summary.txt files below:

Take a look inside the align_summary.txt file of your mapped reads. Do you see anything to be concerned about?


It's all great to have the reads aligned to the genome, but there's still a few things we need to be aware of. First off, there's thousands of genes in the genome, and we would like to have a single number for each one. That's still a lot of data points, but it becomes tractable to deal with that number. The community has settled on RPKM (Reads per Kilobase per Million Reads) and FPKM (the paired-end equivalent, where one pair of reads is one fragment). This corrects for a few of the most obvious biases one might encounter in RNAseq. By dividing by the length of the gene, you correct for a longer transcript producing more sequencing reads than a shorter one; and by dividing by the number of reads, you correct for different sequencing runs having different numbers of reads, but potentially even being derived from the same library. While the term FPKM is still generally used, there are more robust ways for normalizing read depth between samples than using 'per million reads'. We'll discuss these a bit later.

Let's furthermore say that you are interested in whether differential splicing is happening in your sample. This is an even harder problem because calculating the FPKM for a particular isoform is difficult because you must somehow assign a read that aligns to a common exon to one of multiple isoforms.

Add in the fact that there are a number of biases present in the library preparation process, and you almost start to wonder why you thought RNAseq was a good idea in the first place. One program that deals with some of these issues is Cufflinks, the last major tool in the Tuxedo suite. Cufflinks does three main things: assemble reads into transcripts (if you care about novel isoforms), assign FPKM values to genes and isoforms, and perform differential expression tests. (see Cufflinks website for workflow:


This tool can assemble transcript structures from aligned reads. Importantly, it can find novel isoforms based on reads that cross splice junctions. We don't care about this for our E. coli dataset so we won't use it, but I'll quickly describe how you might run it for, say, human RNA-seq data where you're interested in finding new things. Cufflinks will also report FPKM values but I don't recommend using them because the normalization procedure is not ideal. Use Cuffquant/Cuffnorm to get FPKMs.

$ cufflinks
Usage: cufflinks [options] <hits.sam>

There are many options Cufflinks can take, including ones like -p (threads) and -o (output directory). You should check --library-type for your dataset. One option I recommend is -g/--GTF-guide <reference_annotation.(gtf/gff)> which uses the known transcript structure to assemble transcripts and help find novel isoforms. The output you'll use going forward would be transcripts.gtf which contains the assembled transcripts for the particular sample you ran Cufflinks on.


The program Cuffmerge merges the GTF files that Cufflinks generates. You first run Cufflinks on each sample independently and then use Cuffmerge to merge them all into one 'universal' transcript assembly to use for comparing sample in future analyses. You should also give Cuffmerge the reference annotation gtf and it will annotate the transcripts with the appropriate gene names, and classify isoforms as known, novel, etc. Cuffmerge is a wrapper script for the tool Cuffcompare (which can also be run independently). Check out the manual information for Cuffcompare to get descriptions of the Cuffmerge output files.

$ cuffmerge
cuffmerge [options]* <assembly_GTF_list.txt>

The options you probably want here are -g/--ref-gtf (the reference annotation) and -s/--ref-sequence (genome sequence used to remove artifacts).


The newest version of Cufflinks (which we just downloaded) has modularized the functions of the Cuffdiff to make the workflow more efficient. Cuffdiff was used to (and still can be) calculate FPKM values and perform differential expression testing. The part that takes a long time is getting the read counts for each isoform and gene (the FPK part), so now you can Cuffquant to do this once, and then run Cuffdiff or Cuffnorm for differential testing or just calculating FPKMs. This is the step we will start with today for our E. coli data.

$ cuffquant
Usage: cuffquant [options]* <annotation.(gtf/gff)> <aligned_reads.(sam/bam)>

Cuffquant takes a gtf. This can be the output of Cuffmerge if you want to use a novel transcript assembly, or a reference annotation which is what we'll use. It also takes the aligned read file and counts up the reads aligning to each transcript in the gtf. There are two options that help correct for known biases in RNA-seq data that I recommend you always use. They are -b/--frag-bias-correct (controls for GC and 5' UTR biases and requires genome sequence) and -u/--multi-read-correct (better weights reads mapping to multiple places in the genome). The output is a human non-readable abundances.cbx file which is then fed into Cuffnorm or Cuffdiff.


Cuffnorm does something very simple - it normalizes the read counts across samples to control for read depth and allow comparisons. Here is the step where you combine all your samples into one file.

$ cuffnorm
cuffnorm [options]* <transcripts.gtf> <sample1_replicate1.sam[,...,sample1_replicateM.sam]> <sample2_replicate1.sam[,...,sample2_replicateM.sam]>... [sampleN.sam_replicate1.sam[,...,sample2_replicateM.sam]]

Cuffnorm again takes in a gtf (either Cuffmerge output or reference annotation) and the reads. While the read files indicate they are SAM files, they should be the abundances.cbx files if you have them. Here you can take into account replicates to get FPKMs per condition if desired. There are three normalization methods available. 'classic-fpkm' is exactly as it sounds, but subject to outlier bias and other artifacts and is not recommended. The other two options are 'geometric' and 'quartile' and both should work. We can use the default 'geometric' today. The output of Cuffnorm is an easily parse-able tab file with expression values for all genes in all conditions. I recommend using the -L/--labels option to keep track of the different samples. You can use the condition names instead of the index names.


Finally we come to probably the most important tool: Cuffdiff. Cuffdiff performs sophisticated statistical testing in order to find significantly differentially expressed genes and isoforms. The output will contain, for each pair of conditions you input, and for each gene, the FPKM values in the conditions, the log(fold change) between the conditions, whether or not the gene had sufficient data for a test (I would ignore anything not tagged 'OK'), and the corrected p-value for significant differences.

$ cuffdiff
cuffdiff [options]* <transcripts.gtf> <sample1_replicate1.sam[,...,sample1_replicateM.sam]> <sample2_replicate1.sam[,...,sample2_replicateM.sam]>... [sampleN.sam_replicate1.sam[,...,sample2_replicateM.sam]]

You'll note the usage is pretty much the same as for Cuffnorm. Again, you should use abundances.cbx files instead of SAM files if you have them. If you're using SAM/BAM files, use the -u and -b options. Use the usual -o and -L options and use the same library type as for quantnorm.


Exercise 1: Find FPKM values for all genes

NOTE: All commands posted as an answer are just examples. You'll need to change things so that it'll run properly on your computer.


Run Cuffquant on the accepted_hits.bam output from Tophat using the Ensembl annotation . Remember to use the -b and -u options.

$ cuffquant -o HS_30min_cuffquant/ -b Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.1.19.dna.toplevel.fa -u Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.1.19.gtf HS_30min_accepted_hits.bam


Please download the abundances.cxb files below for either the heat shock experiment or the starvation experiment (your choice). Run Cuffnorm on all the samples for your experiment to get FPKM values for all the genes. Visualization and further analyses on this type of data will be covered on Friday.

Heat shock:

Phosphorus starvation:

$ cuffnorm -o HS_cuffnorm/ -L HS_control,HS_15min,HS_30min,HS_60min Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.1.19.gtf HS_control_cuffquant/abundances.cxb HS_15min_cuffquant/abundances.cxb HS_30min_cuffquant/abundances.cxb HS_60min_cuffquant/abundances.cxb

$ cuffnorm -o MP_cuffnorm/ -L MP_control,MP_2H,MP_6H Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.1.19.gtf MP_control_cuffquant/abundances.cxb MP_2H_cuffquant/abundances.cxb MP_6H_cuffquant/abundances.cxb

Here are the Cuffnorm files if you have trouble with Cuffquant or it takes too long:

Exercise 2: Find differentially expressed genes

A. Run Cuffdiff for your experiment (HS or MP). The gene_exp.diff file will have a p-value for every pairwise comparison for each gene. From that file, pull the relevant comparisons into their own file (ie. HS_control vs HS_15min, HS_control vs HS30min, etc).

$ cuffdiff -o HS_cuffdiff/ -L HS_control,HS_15min,HS_30min,HS_60min Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.1.19.gtf HS_control_cuffquant/abundances.cxb HS_15min_cuffquant/abundances.cxb HS_30min_cuffquant/abundances.cxb HS_60min_cuffquant/abundances.cxb

$ cuffdiff -o MP_cuffdiff/ -L MP_control,MP_2H,MP_6H Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.1.19.gtf MP_control_cuffquant/abundances.cxb MP_2H_cuffquant/abundances.cxb MP_6H_cuffquant/abundances.cxb
import sys
in_name = sys.argv[1]
outbase = sys.argv[1]
out_lines = {}
inp = open(in_name, 'r')
head = inp.readline().strip()
lines = inp.readlines()
for line in lines:
 line = line.strip()
 ln = line.split()
 cond1, cond2 = ln[4], ln[5]
 comp = cond1 + "-" + cond2
 if comp in out_lines:
  out_lines[comp] = [line]
for c in out_lines:
 out_file = outbase + "_" + c + ".txt"
 outp = open(out_file, 'w')
 print >>outp, head
 print >>outp, "\n".join(out_lines[c])

B. For each of the comparisons, how many genes are significantly differentially increased? How many change by 2-fold or 4-fold? How does the number of genes that increase compare to the number that decrease for each comparison?
import sys
in_file =sys.argv[1]
fh = open(in_file, 'r')
head = fh.readline()
lines = fh.readlines()
sigUp = 0
sigDown = 0
twoUp = 0
twoDown = 0
fourUp = 0
fourDown = 0
totalTested = 0
for line in lines:
 ln = line.strip().split()
 if ln[6] == 'OK':
  totalTested += 1
  ln[9] = float(ln[9])
  if ln[-1] == 'yes' and ln[9] > 0:
   sigUp += 1
  elif ln[-1] == 'yes' and ln[9] < 0:
   sigDown += 1
  if ln[9] > 1:
   twoUp += 1
   if ln[9] > 2:
    fourUp +=1
  elif ln[9] < -1:
   twoDown += 1
   if ln[9] < -2:
    fourDown += 1
print "tested:", totalTested
print "sig increase:", sigUp
print "sig decrease:", sigUp
print "2x increase:", twoUp
print "2x decrease:", twoDown
print "4x increase:", fourUp
print "4x decrease:", fourDown

Exercise 3: Venn diagrams

A. How many of the genes differentially expressed in long treatment samples are also differentially expressed in the short treatment samples? Does this give you more or less confidence in the results? Make sure that changing genes are going the same direction (up or down) in both experiments. As a hint, the set() datatype will likely be helpful here, since you can calculate the intersections and differences of two lists at a time.
import sys
def getSet(filename):
 genes1 = set()
 genes2 = set()
 fh = open(filename, 'r')
 head = fh.readline()
 lines = fh.readlines()
 for line in lines:
  ln = line.strip().split()
  if ln[6] == 'OK':
   ln[9] = float(ln[9])
   if ln[9] > 1:
   elif ln[9] < -1:
 return genes1, genes2
set_one_up, set_one_dn = getSet(sys.argv[1])
set_two_up, set_two_dn = getSet(sys.argv[2])
print "genes in file one: %s" % (len(set_one_up) + len(set_one_dn))
print "genes in file two: %s" % (len(set_two_up) + len(set_two_dn))
print "genes in both files %s:" % (len(set_one_up.intersection(set_two_up)) + len(set_one_dn.intersection(set_two_dn)))

B. Compare the overlap of differentially expressed genes in the heat shock treatment vs in phosphorus starvation. What might these genes be? Get the files with the other experiments results below. Pick what you think are the best time points.