Statistics for Sequencing

No model, no inference

Why do we need to talk about statistics in a class on sequencing? We've already seen some places where statistical modeling might be useful: finding sites where a sample differs from the reference sequence or determining whether there is differential expression of genes. Here's where we think about what the various software such as samtools or cuffdiff are trying to do. So we know that statistical analysis can improve some of the inference we make from a sequencing experiment. In another sense, statistics are important because when you are doing a statistical test, or fitting a distribution, you are forced to make clear, explicit assumptions about the data and how they were generated. This is the heart of the phrase "No model, no inference", which basically says that in order to know what your data are saying, you have to have some kind of idea of how they behave. And statistical inference is the key.

Modern sequencing experiments produce an extremely large amount of data. It would be quite challenging to go through all of that data by hand to pick out the interesting stuff. So it's becoming more and more common to use complicated statistical methods to pick out the good stuff. However, it's very important to understand what these methods are doing, otherwise you might be misled.

There will be a bit of math in this lecture. Don't worry about following all the details, and I won't present anything too complicated.

Even though you will be analyzing RNA seq data, it's instructive to start by thinking about DNA sequencing. Imagine that you sequence N reads of length L from a genome of size G, where L is much much smaller than G, as is the case with short read sequencing. Assume that there are no sequencing biases: every single nucleotide is equally likely to be the place where a read starts. Now think about a fixed nucleotide, and imagine you want to know the probability of a read covering that nucleotide. Clearly, if the read has a 5' end that starts anywhere L bases 5' of the site, that site will be covered. This can clearly be seen in the figure below, where reads colored red do not include the site, while reads colored blue do include the site.


Because the whole genome is G long, that means that the probability of a given site being hit by a single read must be

Now, the coverage of that site should be the number of reads that end up including that site. So you want to know what fraction of the N reads you sequenced covered that site. The probability that n out of N reads hit that site is given by the binomial distribution, so we have that

Okay, here is where things get a bit weird. It turns out that if N is big enough and L/G is small enough, we can approximate the binomial distribution by a Poisson distribution. A Poisson distribution is a fundamental probability distribution and is characterized by one number: it's average. In the case of the above binomial distribution, the average is

so we can say that

Notice that C is exactly what we calculate when we say the "coverage" of a genome: the number of nucleotides sequenced, which is the number of reads times the average read length, divided by the number of nucleotides in the whole genome. The above equation gives some intuition into why you need relatively high coverage to make sure you hit every site. Moreover, it's very important to realize: just because the AVERAGE coverage is, say, 10x, that doesn't mean that every single nucleotide shows up in 10 reads. In fact, until you have relatively high coverage, there is a good chance that some nucleotides are just missing from your sequencing data! Below, I've plotted the probability that a site is completely missed against the coverage, and you can see that it remains above 1% until you get to about 4.5x coverage.

However, it's worth pointing out that this is only an approximation: the real process is messy! Below is a figure showing the coverage histogram you would expect compared to one from real data (~27x resequencing of a human). Clearly the real data has a bit more variability than you would expect if reads were coming uniformly at random from the genome.


RNA sequencing statistics

Going off of that background for DNA, we can get a picture of what we expect to happen with RNA seq. To simplify things, let's just think about a transcriptome with two transcripts in it. It wouldn't be hard to work out the theory for a larger number of transcripts, but this will keep things transparent. Now we have two transcripts, one of length L1 and the other of length L2. We need a bit more information, which is the number of copies of the transcripts floating around in the cell when you do the sequencing experiment. Let's say that those are E1 and E2.

Our interest here is a bit different than with DNA sequencing. Instead of wanting to know the probability that a given site is covered by a read, we want to estimate the number of reads that start on a given transcript. Let's think about a simple way to model this. First imagine the probability that a read starts on any given site in transcript 1. Certainly that probability is proportional to E1. But there are L1 sites in transcript 1. So the total probability that a read starts in transcript 1 should be proportional to E1*L1. Hence,

and by the same argument of a binomial distribution converging to a Poisson distribution, we see that the probability that a site in transcript 1 is covered by n reads is approximately Poisson distributed with average value

This is the origin of the Poisson model of variation between RNA seq replicates, that you can find in, say, Marioni et al (2008).

Let's take a look at that expression for C1 and we'll already see some important things that come up in the analysis of RNA-seq data. First of all, what we would like to have is the absolute expression level, E1, but it's all tied up with a bunch of other things. Let's take a look at the bottom of the fraction: In some sense, we can consider E1*L1 + E2*L2 the effective total expression in the cell: it's the total amount of expressed nucleotides. Similarly, E1*L1 is the effective expression level of transcript 1. So what we're really measuring in some sense is effective expression level, which is completely confounded by the length of a gene. Moreover, as should be obvious, the number of reads makes a difference: if you do more sequencing, counts of every gene should be higher!

This fact has motivated a common description of the expression level in RNA seq data: FPKM normalization, which stands for Fragments Per Kilobase (of transcript) per Million mapped reads. So, say that you have counted the number of reads, R, mapping to a particular gene with length L and you mapped N reads total, you calculate the FPKM as

Thus, the FPKM of a gene should be proportional to the expression level, but the proportionality constant is related to the effective total expression in the cell.

Biological variation

Everything we considered before can be considered "technical" variation. We assumed that, at the outset, the number of mRNAs corresponding to any transcript was fixed. This would make sense if we extracted RNA once and then just ran that same sample on multiple machines. However, in many cases, the variation between replicates you get from this kind of experiment will be less than the variation between replicates if you had extracted RNA multiple times and ran each of those samples separately. This should be obvious: say that you extract RNA today and then you extract RNA again tomorrow. Do you expect every single transcript to have the exact same number of mRNA molecules floating around in the cell on both days? Probably not!

The difference here is between technical replication and biological replication. This distinction is VERY important.
  • Technical replication is when the same RNA extract is used to run the experiment multiple times.
  • Biological replication is when different RNA extracts are used to run the experiment multiple times.

The problem with the Poisson model that we developed before is that it is characterized by a single number: the average number of reads mapping to a transcript. This means that the variation between experiments is only related to that number. In fact, the standard deviation, or width, of a Poisson distribution is equal to the square root of its average! To overcome this difficulty, a somewhat unmotivated but ultimately effective solution is to stop using a Poisson model, and instead use the negative binomial distribution. The negative binomial distribution can sort of look like a Poisson distribution, but can be "fatter". This "fatness" is measured by the dispersion parameter. When this parameter is equal to 0, there is no extra variation, and the negative binomial is equivalent to a Poisson. As the dispersion parameter gets large, the distribution becomes increasingly "fat". Below, I plotted an example of this for a negative binomial with mean 10 as the dispersion gets smaller. The solid line is a Poisson distribution with the same mean, while the points correspond to a negative binomial distribution with the given dispersion.


Thus, the negative binomial distribution provides a very robust way of accounting for biological variation. Numerous packages, including edgeR and DEseq, implement this model to control for biological variation between replicates.

Finding Differential expression

One of the key goals of many RNA-seq experiments is to identify genes that are differentially expressed between conditions. In fact, all this background we've developed is key to understanding how to test for differential expression. Let's say that you have two experimental conditions, and you have performed RNA seq in both of them and gotten counts of N1 reads mapping to the gene in condition 1 and N2 reads mapping to the gene in condition 2. Now you want to test every gene and ask if it is differentially expressed between condition 1 and condition 2. How can you do that?

For now, let's consider the Poisson model and ignore the negative binomial, and let's think way way back to the first statistics class you ever took. First, let's assume that that the total number of mapped reads and the total expression levels were the same between the replicates so that we can avoid issues of normalization. Then the null hypothesis is that the expression level is the same between the two conditions; in other words, that the average of the Poisson distribution is the same between the two conditions. The alternative hypothesis is that the expression level is different between the two conditions; in other words, that the average of the Poisson distribution is different between the two conditions. To summarize, if C1 is the average of the Poisson distribution in condition 1 and C2 is the average of the Poisson distribution in condition 2,

  • Null hypothesis: C1 = C2,
  • Alternative hypothesis: C1 != C2.

How do we test this? We make use of a very powerful statistical test called a likelihood ratio test. The likelihood of the data is simply the probability of the observed data. In general, the best way to estimate the parameters of a model (e.g. the average value of a Poisson distribution) is to choose the parameters that make the observed data have the highest probability. It turns out that with a Poisson distribution, setting the average of the Poisson distribution equal to the sample mean gives the data the highest probability. Now what you want to do is ask if the data from your two experiments have a higher probability if there is ONE mean between both experiments (i.e. they both have the same expression level) vs. if there are separate means for each experiment. It turns out that the "right" way to do this is to take twice the log ratio of the probability of the data under the alternative and the null hypothesis,

In our case, we use the fact that the best guess for the average of a Poisson is the sample average, (N1 + N2)/2, to get

while in the alternative case the same averages are simply the observed counts,

Once we have computed the likelihood ratio, we use a strange fact that is not at all obvious: it will be distributed as a chi-square with 1 degree of freedom. You should remember the chi square distribution from testing for Hardy-Weinberg equilibrium in introductory genetics. This means, for example, that if your likelihood ratio is greater than 3.84, your likelihood ratio is significant at the .05 level.

Let's briefly try it out with N1=50 and N2=30. (You can do this quickly in python!)

import sys
import math
N1 = float(sys.argv[1])
N2 = float(sys.argv[2])
Navg = (N1+N2)/2
Pnull = ((Navg**N1)*math.exp(-Navg)/math.factorial(N1))*((Navg**N2)*math.exp(-Navg)/math.factorial(N2))
Palt = ((N1**N1)*math.exp(-N1)/math.factorial(N1))*((N2**N2)*math.exp(-N2)/math.factorial(N2))
LRT = 2*(math.log(Palt)-math.log(Pnull))
print LRT
The numbers below come from the data you will be using in one of the exercises. Each row is the coverage for one gene in the control and in the experiment. Just like in the example above, for each row, you must consider whether those two numbers are significantly different from each other.
#control        experiment
687     614
598     616
434     441
675     666
568     627
222     235
610     677
740     700
Prelude to Exercises -- Installing Samtools

Before we start on the exercises, let's download a smaller fasta file to use in our exercises.

Fastq file:

Download the above genome and the fastq file. Try using the command 'wget'--Linux users should have it installed. Mac users may or may not--if it doesn't work, just download the file manually like usual.

Also, let's install samtools--this is a program that allows us to transform the .sam files mentioned yesterday. Follow these instructions. If you're a linux user, the easiest way to install samtools is to use sudo apt-get install samtools.
Otherwise, use the following set of commands.
$ cd ~/PythonCourse/ProgramFiles/
$ tar -jxf samtools-0.1.19.tar.bz2
$ cd samtools-0.1.19/
$ make
$ echo 'PATH=$PATH:~/PythonCourse/ProgramFiles/samtools-0.1.19/ ' >> ~/.bash_profile
$ source ~/.bash_profile

Type samtools into the terminal after all those commands and hopefully no error occurs! Let us know if it doesn't work.

We will be using the following set of commands to create a pileup file from the fastq file and the genome fasta file. The pileup file gives base-pair information at each chromosomal position. The main thing we will be using in the pileup file is the coverage column.

First, we will index the random_genome.fa:
$ bowtie2-build random_genome.fa random_index
$ samtools faidx random_genome.fa

Use bowtie2 to make a .sam file from the .fastq file
$ bowtie2 -x random_index -U practice.fastq -S practice_mapped.sam
Then, the following commands will turn the .sam file into a .pileup file:
$ samtools view -Sb practice_mapped.sam > practice_mapped.bam
$ samtools sort practice_mapped.bam practice_mapped.sorted
$ samtools mpileup -f random_genome.fa practice_mapped.sorted.bam > practice_mapped.pileup

The .pileup file has the following format (name, position, base pair, coverage, readbase matching, and base pair quality--see the pileup link for more details):
Random_genome   46      T       1       ^K.     P
Random_genome   47      T       1       .       P
Random_genome   48      G       1       .       P
Random_genome   49      T       1       .       P
Random_genome   50      G       1       .       P
Random_genome   51      A       1       .       P
Random_genome   52      C       1       .       P
Random_genome   53      A       1       .       P
You will follow the above steps after making two .fastq files to generate .pileup files, which you will use to get the coverage.


1. Simulate a simple sequencing experiment with 5000 100 base pair reads. Repeat for 10,000 100 base pair reads.

A. Using the random_genome.fa, write a script to simulate reads. I'll outline the pseudocode here:
  1. Read the genome in (use your fasta parser function!)
  2. Generate the 5' ends of each of the reads (Hint: use numpy.random.random_integers--what is the difference between this function and numpy.random.randint?)
  3. For i in 1 through the number of reads:
    1. Get the sequence starting from that integer and going 100 base pairs in the 3' direction
    2. Print each sequence in fastq format (give everything a quality score of P)
Simulate two different sets of 100 bp long reads. First, simulate 5000 reads. Then, simulate 100,000 reads. Save these two sets into two different files.

B. Use bowtie2 to map them to random_genome.fa, and then use samtools to create the .pileup files.

C. Write a script to print how many sites have 1x coverage, 2x coverage, etc. Calculate your average coverage and expected coverage for each simulation and compare them to each other.

2. Likelihood ratio test for differential expression.
Download the 2 attached files, which contain the counts of genes from simulated RNAseq experiments. In each file, there are two columns, corresponding to the counts in an "experimental" and a "control" group. In one of the files, the experimental and control groups have no differential expression, while in the other one, some of the genes are differentially expressed.

Read the data into python using numpy.loadtxt (look it up!).

Using the Poisson likelihood ratio test, determine which genes are differentially expressed at the 5% level. For each file, how many are there? How many would you expect, even if there is no differential expression? What fraction of the hits in the file with real differential expression do you think are actually differentially expressed?

NOTE: Try to do all these calculations using numpy arrays. You should not use a single for loop!
NOTE 2: You will not be able to compute the likelihood functions and then take the logs; you will have to take the logs first. So take the log by hand and input that. Remember the rules of logs, where division is subtraction and multiplication is addition! To get python to compute log(n!), use "from scipy import special" at the top of your script and compute "special.gammaln(n+1)"

No DE:

3. Wrong model, wrong inference.
Download the attached file, which has the counts of genes from 2 simulated RNA-seq experiments. None of these genes are differentially expressed, but they are all simulated using a negative binomial model, instead of a Poisson. Using the Poisson likelihood ratio test, determine which genes are differentially expressed. How many do you get? How many do you expect?

NOTE: You should be able to use the script you wrote in Exercise 2!



1. Simulate a simple sequencing experiment with 5000 100 base pair reads. Repeat for 10,000 100 base pair reads.
To simulate fastq files, I used the following script:

##Exercise 1##
import sequence_tools as st
import numpy as np
import sys
##python fastagenometoreadin lengthread numreads newfilename
##getreads(sequence to get reads from, # reads wanted, length of read)
##returns list of reads
def getreads(sequence, numreads,readlength):
        myind = np.random.random_integers(0,len(sequence)-readlength,numreads)
        myreads = [sequence[i:(i+readlength)] for i in myind ]
        return myreads
##makefastq(list of reads, newfilename to write to, "idname to put after @")
##returns string, makes new fastq file with reads
def makefastq(myreads,newfilename,header):
        newfile = open(newfilename,'w')
        num = 0
        for read in myreads:
                newfile.write(''.join(['P' for i in read])+'\n')
                num += 1
        return "Done making "+newfilename
mygenome = st.read_fasta(sys.argv[1])
myid = mygenome.keys()[0]
mygenome = mygenome[mygenome.keys()[0]]
mysims = getreads(mygenome,int(sys.argv[3]),int(sys.argv[2]))
print makefastq(mysims,sys.argv[4],myid)
I used my fasta parser in the module.
Alternatively, you can use the biopython fasta parser
from Bio import SeqIO
mysims = {}
for record in SeqIO.parse(sys.argv[1],format="fasta"):
        mysims[] = getreads(str(record.seq),int(sys.argv[3]),int(sys.argv[2]))
print makefastq(mysims,sys.argv[4])

Then, I typed into terminal:
$ python random_genome.fa 100 5000 random5000.fastq
$ python random_genome.fa 100 10000 random10000.fastq
Using the two new fastq files, I went through the commands to map the fastq file back onto the random_genome and convert the .sam file to a .pileup file. Your reference genome should already be indexed for bowtie and samtools from the lecture.

$ bowtie2 -x random_index -U random##.fastq -S random##_mapped.sam
$ samtools view -Sb random##_mapped.sam > random##_mapped.bam
$ samtools sort random##_mapped.bam random##_mapped.sorted
$ samtools mpileup -f random_genome.fa random##_mapped.sorted.bam > random##_mapped.pileup

Now, I wrote a script to take the count of the coverage for every position in the .pileup file.
import sys, collections
from Bio import SeqIO
##python "pileup file" lengthread numberreads "fasta file--to get total sequence length for expected covg"
myfile = open(sys.argv[1],'r')
L = float(sys.argv[2])
N = float(sys.argv[3])
mycovg = [int(line.split()[3]) for line in myfile]
mycovg = collections.Counter(mycovg)
for record in SeqIO.parse(sys.argv[4],format='fasta'):
        G = len(str(record.seq))
##Get missing no coverage positions
mymissing = G-sum(mycovg.values())
mycovg[0] = mymissing
print "The histogram values are:"
for i in mycovg:
        print i, mycovg[i]
obscovg = sum([mycovg[i]*i for i in mycovg])/float(sum(mycovg.values()))
expcovg = N*L/G
print "The observed coverage is",obscovg,"and the expected coverage is",expcovg
Running the files:
$ python random5000_mapped.pileup 100 5000 random_genome.fa
The histogram values are:
0 626
1 3811
2 8981
3 13314
4 17127
5 17433
6 14073
7 10985
8 6939
9 3376
10 1960
11 884
12 348
13 119
14 24

The observed coverage is 5.0 and the expected coverage is 5.0

$ python random10000_mapped.pileup 100 10000 random_genome.fa
The histogram values are:
0 15
1 136
2 301
3 954
4 2017
5 4055
6 6243
7 8791
8 11096
9 12573
10 12059
11 10935
12 9113
13 7533
14 5288
15 3430
16 2175
17 1572
18 940
19 521
20 177
21 60
22 16

The observed coverage is 10.0 and the expected coverage is 10.0

2. Likelihood ratio test for differential expression.

My script is:
import numpy as np
from scipy import special
import sys
##python "diff/nodiff.txt file" 3.84(5%significance)
myfilename = sys.argv[1]
cutoff = float(sys.argv[2])
myfile = np.loadtxt(myfilename)
navg = (myfile[:,0]+myfile[:,1])/2
navg = np.transpose(np.array([navg,navg]))  #Make array with same values in two columns (mean)
##Do computation across each n1 and n2 (or the same navg)
##The log of each probability can be written as (with a covg of N): N*log(N)-log(N!)-N
##If you apply it to the array, it performs this computation on every number--then I take the sum across
##rows since I'm multiplying the two probabilities for each covg value in my two conditions.
lognull = myfile*np.log(navg)-special.gammaln(myfile+1)-navg
pnull = np.sum(lognull,axis=1)
logalt = myfile*np.log(myfile)-special.gammaln(myfile+1)-myfile
palt = np.sum(logalt,axis=1)
##Calculate the Likelihood Ratio
LRT = 2*(palt-pnull)
##Count the number of rows where LRT > cutoff (3.84)
numsiggenes =np.where(LRT>cutoff)[0].shape[0]
print "For a 5% cutoff (LRT>3.84), there are", numsiggenes, "genes that look significant."
##Multiply 0.05 by the total number of genes to get expected number of genes you'd expect to be outlier by chance.
expected = 0.05*myfile.shape[0]
print "The expected number of outliers is:", expected
print "The percent deviation from expected is:", str((numsiggenes-expected)/expected*100)+'%'
print "The fraction actually differentially expressed is:", (numsiggenes-expected)/numsiggenes
Then, putting in nodiff.txt, I get:
$ python nodiff.txt 3.84

For a 5% cutoff (LRT>3.84), there are 44 genes that look significant.
The expected number of outliers is: 50.0
The percent deviation from expected is: -12.0%
The fraction actually differentially expressed is: -0.136363636364

$ python diff.txt 3.84

For a 5% cutoff (LRT>3.84), there are 133 genes that look significant.
The expected number of outliers is: 50.0
The percent deviation from expected is: 166.0%
The fraction actually differentially expressed is: 0.624060150376

3. Wrong model, wrong inference.

Plugging in the new file into my script from exercise 2, I get:
$ python nodif.nbinom.txt 3.84

For a 5% cutoff (LRT>3.84), there are 160 genes that look significant.
The expected number of outliers is: 50.0
The percent deviation from expected is: 220.0%
The fraction actually differentially expressed is: 0.6875

Despite the fact that our data is supposed to not show much differentiation between genes, we get a large number of genes that have a significant difference. That's because we assumed the expected probability distribution was Poisson, when in fact there was much more variance and it was a negative binomial distribution! Thus, we would come to the wrong conclusion here!