Short read sequencing


The most important thing to understand about modern sequencing technology is that it produces short reads, typically between 50 to 150 base pairs. This is in contrast to Sanger sequencing, which could produce much longer reads (up to 1kb). And time goes on, technologies to read very long sequences are being developed (e.g. Pacific Biosciences) but they are not yet mainstream. So we'll focus on short reads.

The general idea behind short-read sequencing is that you take your long stretch of nucleic acids and chop it up:

short_read.png

These short fragments are easy to sequence in large batches and so what you get off the machine are "randomly" chopped up fragments of DNA.

What to do with short reads


The challenge with short read data is figuring out how it goes together. One option is de novo assembly. In this case, you chop up a bunch of copies of the same sequence and hope that the different fragments from different copies have some overlap, so you can figure out how to put them back together. We won't talk about this method today, although there are many approaches to de novo assembly. Many of these are based on a mathematical structure called a De Bruijn graph.

We will focus on aligning reads to a known reference genome. Note that the reference genome does not necessarily have to be a member of the same species, but it's helpful if the reference genome that you are mapping to is not too distantly related to our favorite organism.

Getting acquainted with the data


The data you get back from an Illumina sequencer comes in a format called fastq. Note the similarity to fasta: it's essentially a fasta file with a bit of extra information (specifically, the base quality). Each entry in a fastq file consists of four lines. Why four lines when it really makes sense to parse a line at a time? Who knows. An entry in a fastq file looks like this:
@HS2:202:D0YYKACXX:6:1106:16588:78868 1:N:0:GGCTAC
GTGTTGGCACTTCACGTGGACAGGGCGGGCAACCTGTCCCGAGTGACAGA
+
CBCFFFFFHHHHHIJJIHJJJJJJJJJJJGGHFFFFDEEEDDDBDCCCC3
The first line (with the "@" symbol) gives some information about the sequencer and a unique identifier for the read. The second line gives the actual sequence read. The third line can be a little variable: sometimes, after the "+" you'll have the same thing that followed the "@" in the first line, but in other cases (like the example here), you might not have anything. The fourth line are the so-called quality scores for each of the bases, and they bear a bit more discussion.

Sequencing machines don't read the sequence of DNA the same way we read the words on a page. Instead, they actually look at hundreds of pictures, each with millions of little colored dots, and try to guess the nucleotide at that position in the read based on the color. But it's only a guess! Thus, they try to provide some measure of the certainty of the base call. They are given by a simple formula:



where p is the probability that a base is called incorrectly (this number is somewhat mysterious). Thus the quality score divided by 10 gives the order of magnitude of the error probability: Q = 20 means that there is a 10^-2 = .01 probability of the base being called incorrectly, Q = 30 means that there is a 10^-3 = .001 probability that the base is called incorrectly, etc.

But wait! The quality score line has letters, not numbers! The reason for this is simple: 30, for example, takes two characters, but if we want a 1-to-1 correspondence between quality and a nucleotide, this won't work. So they apply one final transformation: first they add 33 to the quality value and then translate that into an ASCII character. Thus, to translate "C" (the quality of the first nucleotide) into a quality score, look up on the table the "Dec" (or DECimal) corresponding to C, and subtract 33. Thus, since C corresponds to 67, and 67-33=24, and the first nucleotide is relatively untrustworthy.

A quick note on how to read fastq files in python. The normal "for line in file:" type of loop won't work here, because each entry corresponds to four lines. So perhaps a better structure is an while loop where you break at the end of the file:
while True:
    id1 = fastq_file.readline().strip()
    if id1=="":
        break
    seq = fastq_file.readline().strip()
    id2 = fastq_file.readline().strip()
    qual = fastq_file.readline().strip()

Mapping by string matching


There are two things you need to start mapping to a reference genome:
  1. A bunch of short reads
  2. A reference genome
But what do you do with them? Think about what you would do if I gave you this reference sequence (in fasta format):
>the_best_chromosome_ever
TTAGTGTAAGTTCAGACCAATTCGTACTTCGTTCAGAACTCACATTTTAACAACAGAGGACACATGCCCTACCTCCATGATCTACTGACGTCCCTGAGGC
and this read:
@my_read
ATTCGTACTT
+
ZWTGBDSWEG
and asked you to align the read to the reference by hand. The first thing I would do is simply start at the beginning and see if it fits there:
ATTCGTACTT
TTAGTGTAAGTTCAGACCAATTCGTACTTCGTTCAGAACTCACATTTTAACAACAGAGGACACATGCCCTACCTCCATGATCTACTGACGTCCCTGAGGC
doesn't work, nor does
.ATTCGTACTT
TTAGTGTAAGTTCAGACCAATTCGTACTTCGTTCAGAACTCACATTTTAACAACAGAGGACACATGCCCTACCTCCATGATCTACTGACGTCCCTGAGGC
and so on, until I find that it fits somewhere:
...................ATTCGTACTT
TTAGTGTAAGTTCAGACCAATTCGTACTTCGTTCAGAACTCACATTTTAACAACAGAGGACACATGCCCTACCTCCATGATCTACTGACGTCCCTGAGGC
We can implement this in python using the .find method for strings and a try-except loop. For example, if the whole genome is a single chromosome,
my_genome = read_genome_fasta("genome.fa")
try:
    print my_read, my_genome.find(my_read)
except ValueError:
    print my_read, "unmapped"
However, this is a very labor intensive process! What if the sequence of the read had come from the very end of the chromosome? Clearly there must be a better way to do this.


Mapping by hashing


Remember how dictionaries can be used to look up the value associated with a particular key very quickly? We can use that to our advantage. Suppose that instead of checking through the entire genome every time, we instead put every k-mer (that is, sequence of k nucleotides) into a dictionary and for our read asked what position it had in that dictionary. But first we have to chose a good k to use. The first thought might be to make k equal to the length of your read; however, this is often unnecessary and dangerous, because later parts of the read might be more prone to errors. A good rule of thumb is to chose k so that 4^k is about equal to the genome size: this will make sure that most k-mers appear at most once in the genome. For example, if we have made a function called make_genome_dict that takes as arguments a genome and a k and returns a dictionary of all k-mers and their positions,
my_genome_dict = make_genome_dict(genome, 10)
if my_read in my_genome_dict:
    print my_read[0:10], my_genome_dict[my_read]
else:
    print my_read, "unmapped"
which should print
ATTCGTACTT 19
so it found that our read maps to position 19 (remember: 0 offset!) in the reference genome.One problem here is that there could be some issues with polymorphism: what if the second position in our read at a C instead of a T?


Mapping by indexing (a.k.a. what you'll do in real life)


The way that most mainstream read-mapping programs work nowadays is to do something even fancier than mapping by hashing, but instead build an index of the genome in a special way. One of the most common ways is called a "Burrows-Wheeler index", and is used by BWA (Burrows-Wheeler Aligner) as well as BoWTie (Burrows-Wheeler Transform, capitalization mine). Instead of thinking about how to build an aligner that works by indexing, we'll instead focus on using one of these. For this course, we'll use Bowtie2, a newer version of Bowtie, developed to more efficiently align longer reads (>50nts).

One of the most useful things is the Bowtie2 getting started page, which walks you through some basic usage. In short, there are 2 essential steps to mapping reads using Bowtie2:
1. Build an index of the reference genome
2. Map the reads, using the index
To build the index, you need to use the program bowtie2-build. This program takes very simple inputs:

$ bowtie2-build path/to/reference.fasta path/to/index

The results of a bowtie2-build command will be a series of files. If I am in a folder with the E. coli genome in a file called E_coli_genome.fasta, then I would build the index like so:

$ ls
E_coli_genome.fasta
$ head E_coli_genome.fasta
>Chromosome dna:chromosome chromosome:GCA_000005845.1:Chromosome:1:4639675:1
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC
TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC
ACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT
AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG
CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGT
ACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTG
GCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAA
$ bowtie2-build E_coli_genome.fasta E_coli_index
<tons of output redacted>
$ ls
E_coli_genome.fasta E_coli_index.2.bt2 E_coli_index.4.bt2 E_coli_index.rev.2.bt2
E_coli_index.1.bt2 E_coli_index.3.bt2 E_coli_index.rev.1.bt2

All of those files that end in .bt2 are the indices that bowtie built. These files are not human-readable.

Finally, you need to map the reads using the index. A standard usage of bowtie to map the reads is

$ bowtie2 -x path/to/index -U path/to/reads.fastq -S path/for/mapped_reads.sam

Breaking that down, there are a few things:
  • -S tells Bowtie to output the reads in SAM format, which is a standard format for mapped reads. We'll talk about it later
  • --un path/for/unmapped_reads.fastq tells Bowtie where to put reads that it fails to map. It will simply puke their fastq entries to the path that is given after --un
  • path/to/index tells Bowtie where to find the index that you built using bowtie-build
  • reads.fastq is the file containing all your reads
  • > path/for/mapped_reads.sam tells Bowtie where to put all the reads that it maps.
There are a few extra options: for example, if you are working on a server or a computer with multiple cores you can put "-p" followed by a number to tell Bowtie how many cores to use. If you just type "bowtie2" on the command line you will get a list of options.

For example, in a directory with a bunch of E. coli reads,

$ ls
E_coli_reads.fastq
$ head E_coli_reads.fastq
@HS2:202:D0YYKACXX:6:1101:1464:2207 1:N:0:TTAGGC
ACTGAAAGCAAAATTTGCCGTAAGCAACGTTAACGGCCCAATCTCGCTGG
+
C@@FFFFFFHHHGJIJJIIIGIJIGJJIIJJJJJIJGIEIIIIJIJJJIB
@HS2:202:D0YYKACXX:6:1101:1594:2231 1:N:0:TTAGGC
CTGGAGTTGGGTTAAAAGCTGGCTGTGAGCATTAGAAATACGCTCAAGTT
+
B@CFFFFDHHHHHJIJJJJJJJJJJIJFHIGIJGICHIJJIIGIJJJIII
@HS2:202:D0YYKACXX:6:1101:1861:2229 1:N:0:TTATGC
CAGCTATCGCGATTGCAGTGGCACTGGCTGGGTTCGCTACCGTAGCGCAG
$ bowtie2 --un unmapped.fastq -x E_coli_index -U E_coli_reads.fastq -S E_coli_mapped_reads.sam
1000000 reads; of these:
1000000 (100.00%) were unpaired; of these:
20247 (2.02%) aligned 0 times
929604 (92.96%) aligned exactly 1 time
50149 (5.01%) aligned >1 times
97.98% overall alignment rate


Bowtie outputs some very important statistics: the fraction of aligned reads and the fraction of unaligned reads. Here we can see that we probably mapped the right reads to the right organism: almost 98% of reads aligned. Why did 2% of reads fail to align? It could be a number of reasons, including (but not limited to) extremely error-filled reads, contamination, or an excess of polymorphism compared to the reference.

Let's take a quick look inside the SAM file:

@HD VN:1.0 SO:unsorted
@SQ SN:Chromosome LN:4639675
@PG ID:bowtie2 PN:bowtie2 VN:2.1.0
HS2:202:D0YYKACXX:6:1101:2456:2149 16 Chromosome 4505497 1 50M * 0 0 TCAATGCAACACCCCTTTCAATTATCTCTTTCGGTGTTTTGAACTTCAGN JJJJJJJIGFBJJJJJJIJJJJJJIIJJJJJJJJJJJHHHHHFFFDD=1# AS:i:-1 XS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:49T0 YT:Z:UU
The lines tagged "@SQ" give the names of the chromosomes and their lengths. In this case, we have just the main E. coli chromosome. On lines 4 and higher, we have the information we're really interested in: where the reads mapped.

The first field is the read name, from the "@" line of the fastq file. The second field is an arcane bitwise flag field, which I will explain shortly. The third field is the name of the chromosome to which the read mapped: in E. coli this is a kind of useless but in organisms with more than one chromosome this is a very important thing to know! Then comes the position on the chromosome. It's important to note that this is always the position relative to the first position in the reference fasta file. The next few fields aren't terribly important, until you get to the sequence of the read and then the quality scores, followed by another set of relatively unimportant fields. If you are interested, you can check out the full specification of the SAM file format.

A few words on the bitwise flag. Basically, it is a compact way to express a bunch of things that are either true or false about the read; for example, whether the read aligned only once or not, whether the read is on the + strand or not, etc. Since something that is true or false can be expressed as either 1 (if true) or 0 (if false) you can represent the status of a bunch of these kinds of things as a binary number.

Because this isn't a course on binary numbers, we'll focus on one of the important things, which is whether the read mapped to the plus strand or the minus strand. Basically, and simply, for a read that maps to the minus strand, then when you take the bit-wise AND of the flag for that read and the number 16, it should be greater than 0. Don't worry about what that means, but in python it is easy to implement. For example, if the flag is "16" for a given read, then

>>> 16&16
16

while if the flag is, say, 4

>>> 16&4
0

Because 0 is always False and a number greater than 0 is always True, you can use a simple if statement if you need to do different things to the plus and minus strand.

if flag & 16:
    #do stuff that you would do if the read mapped to the minus strand
else:
    #do stuff that you would do if the read mapped to the plus strand.

Feel free to ask about what is actually going on here, but for this course you should only need to use it to this complexity.

And that is the basics of mapping reads!


Exercises

(half the class do #2, half do #3)

Download these data:
ftp://ftp.ensemblgenomes.org/pub/bacteria/release-19/fasta/bacteria_22_collection/escherichia_coli_str_k_12_substr_mg1655/dna/Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.1.19.dna.toplevel.fa.gz
https://www.dropbox.com/s/onosn5z1zgbh391/E_coli_reads.fastq.gz

1. Properties of reads
Before you even map any reads, it's important to understand some interesting properties about Illumina data:
  1. Compute the average quality score for each position in a read. Hint: you will need to use the function ord() to turn ASCII characters into decimal numbers
  2. Compute the average frequency of each of the 4 nucleotides, as well as the missing character "N" at each position in the read
Does what you see surprise you?

import sys
import collections
 
#open the fastq file
reads = open(sys.argv[1],"r")
 
#loop through the fastq file
initialized = 0
while True:
 #read the details of the reads
 id = reads.readline().strip()
 if id == "":
  break
 seq = reads.readline().strip()
 id2 = reads.readline().strip()
 qual = reads.readline().strip()
 #learn the length of the reads
 if initialized == 0:
  quality = []
  nucs = []
  for i in range(len(seq)):
   quality.append(0)
   nucs.append(collections.Counter())
   initialized = 1
 #get the nucleotides and the quality scores
 for i, nuc in enumerate(seq):
  nucs[i][nuc]+=1
  qual_score = ord(qual[i])-33
  quality[i] += qual_score
 
print "A\tC\tT\tG\tN\tqual"
for i in range(len(nucs)):
 total_count = sum(nucs[i].values())
 for nuc in ["A","C","G","T","N"]:
  print str(float(nucs[i][nuc])/total_count) + "\t",
 print float(quality[i])/total_count


2. Mapping by string searching
Build a basic mapping tool that will search for each of the reads in the E. coli genome and output the position where they match the E. coli genome. You will need to
  1. Read in the genome in fasta format (you should be able to do this with a module from last week)
  2. Search each chromosome in the genome for the read

Running this program will take a very long time. To debug it, you may want to use an even smaller subset of the data than I provided, by using the head command, for example. When you are running it on the total dataset, I suggest running it in the background while you do the next problem.

When you run the program, make sure to use the time command on the terminal to see how long it takes, e.g.

$ time python my_mapper.py

WARNING: make sure to search both the + and - strands!

import sys
 
complement = {"A": "T", "C": "G", "G":"C", "T":"A"}
 
def reverse_complement(seq):
 return(''.join(map(lambda x: complement[x], seq))[::-1])
 
def read_fasta(file):
 fasta_file = open(file,"r")
 sequences = {}
 for line in fasta_file:
  if line[0] == ">":
   cur_name = line.strip()[1:]
   sequences[cur_name]=[]
  else:
   sequences[cur_name].append(line.strip())
 for key in sequences:
  sequences[key] = ''.join(sequences[key])
 return sequences
 
fasta_file_path = sys.argv[1]
reads_path = sys.argv[2]
 
my_genome = read_fasta(fasta_file_path)
my_reverse_genome = {}
for chrom in my_genome:
 my_reverse_genome[chrom] = reverse_complement(my_genome[chrom])
sys.stderr.write("Done reading genome!\n")
 
reads = open(reads_path,"r")
 
read_num = 0
 
while True:
 read_num += 1
 if read_num % 10000 == 0: sys.stderr.write('%i\n'%read_num)
 id1 = reads.readline().strip()
 if id1 == "":
  break
 seq = reads.readline().strip()
 id2 = reads.readline().strip()
 qual = reads.readline().strip()
 #go over every chromosome and find if it's in there
 mapped = 0
 for chrom in my_genome:
  forward = my_genome[chrom].find(seq)
  if forward != -1:
   print seq, "+", chrom, forward
   mapped = 1
   break
  reverse = my_reverse_genome[chrom].find(seq)
  if reverse != -1:
   print seq, "-", chrom, reverse
   mapped = 1
   break
 
 if mapped == 0: print seq, ".", ".", "."


3. Mapping by hashing
Build a basic mapping tool that will hash the E. coli genome into a dictionary (of arbitrary size k-mers) and then use that dictionary to map the reads to the E. coli genome. Using k = 11, time the program. How long did it take? How does it compare to mapping by string searching? You will need to
  1. Read in the genome in fasta format
  2. Make a dictionary with every k-mer in the genome
  3. Search the dictionary for the first k nucleotides of every read

WARNING: make sure to hash both the + and - strands!

import sys
 
complement = {"A":"T","C":"G","G":"C","T":"A"}
 
def reverse_complement(seq):
 return(''.join(map(lambda x: complement[x], seq))[::-1])
 
def read_fasta(file):
 fasta_file = open(file,"r")
 sequences = {}
 for line in fasta_file:
  if line[0] == ">":
   cur_name = line.strip()[1:]
   sequences[cur_name]=[]
  else:
   sequences[cur_name].append(line.strip())
 for key in sequences:
  sequences[key] = ''.join(sequences[key])
 return sequences
 
def hash_genome(genome,k):
 genome_hash = {}
 #loop over chromosomes
 for chrom in genome:
  chrom_len = len(genome[chrom])
  genome_hash[chrom] = {}
  for i in range(chrom_len-k):
   if genome[chrom][i:(i+k)] in genome_hash:
    genome_hash[chrom][genome[chrom][i:(i+k)]].append(i)
   else:
    genome_hash[chrom][genome[chrom][i:(i+k)]]=[i]
 return genome_hash
 
fasta_file_path = sys.argv[1]
reads_path = sys.argv[2]
k = int(sys.argv[3])
 
my_genome = read_fasta(fasta_file_path)
my_reverse_genome = {}
for chrom in my_genome:
 my_reverse_genome[chrom] = reverse_complement(my_genome[chrom])
 
sys.stderr.write("Done reading genome!\n")
 
my_hashed_genome = hash_genome(my_genome,k)
my_hashed_reverse_genome = hash_genome(my_reverse_genome,k)
 
sys.stderr.write("Done hashing genome!\n")
 
reads = open(reads_path,"r")
 
while True:
 id1 = reads.readline().strip()
 if id1 == "":
  break
 seq = reads.readline().strip()
 id2 = reads.readline().strip()
 qual = reads.readline().strip()
 
 #go over every chromosome and find if it's in there...
 mapped = 0
 for chrom in my_hashed_genome:
  if seq[0:k] in my_hashed_genome[chrom]:
   print seq, "+", chrom, my_hashed_genome[chrom][seq[0:k]]
   mapped = 1
   break
  elif seq[0:k] in my_hashed_reverse_genome[chrom]:
   print seq, "-", chrom, my_hashed_reverse_genome[chrom][seq[0:k]]
   mapped = 1
   break
 if mapped == 0: print seq, ".", ".", "."


4. Mapping by indexing
First, install bowtie. Do this by following a similar set of commands to that you did to install blast.

Using bowtie2-build and bowtie2, map the reads to the E. coli genome. How long did it take? Count the number of reads that mapped to the plus strand vs. the minus strand. Is it what you expect?

Resulting file if you have trouble with bowtie:
https://www.dropbox.com/s/56q84wk0g3t54pm/E_coli_mapped_reads.sam

CHALLENGE: Download the linked set of reads (supposedly) coming from E. coli. What fraction of reads mapped? Do you think something went wrong? What could have gone wrong (hint: RNAseq experiments are prepared by humans...)? Figure out a way to test your hypothesis and test it!

https://www.dropbox.com/s/ogg5kqj9wjnvl9b/weird_reads.fastq.gz

$ /YOUR/PATH/PythonCourse/ProgramFiles/bowtie2-2.2.5/bowtie2-build Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.1.19.dna.toplevel.fa E_coli_index

$ /YOUR/PATH/PythonCourse/ProgramFiles/bowtie2-2.2.5/bowtie2 --un unmapped.fastq -x E_coli_index -U E_coli_reads.fastq -S E_coli_mapped_reads.sam

$ /YOUR/PATH/PythonCourse/ProgramFiles/bowtie2-2.2.5/bowtie2 --un unmapped2.fastq -x E_coli_index -U weird_reads.fastq -S E_coli_mapped_reads2.sam
1000000 reads; of these:
1000000 (100.00%) were unpaired; of these:
890739 (89.07%) aligned 0 times
84211 (8.42%) aligned exactly 1 time
25050 (2.50%) aligned >1 times
10.93% overall alignment rate

import sys
 
mapped_reads_path = sys.argv[1]
 
mapped_reads = open(mapped_reads_path,"r")
 
plus_count, minus_count = 0, 0
 
for line in mapped_reads:
 if line[0] == "@":
  continue
 line = line.strip().split("\t")
 if int(line[1]) & 16:
  minus_count += 1
 else:
  plus_count += 1
 
print "plus strand reads %s" % (plus_count)
print "minus strand reads %s" % (minus_count)