SNP calling: finding polymorphism


Although in our current project, we are not interested per se in finding where the sample sequence differs from the reference sequence. However, in many kinds of experiments and analyses, identifying variable sites is a key step in the analysis pipeline. For example, if you might be conducting an experiment to find the genetic basis of a mutant phenotype in which some samples have a mutant phenotype (let's call these guys cases) and other samples don't (let's call these guys controls). One option in this case is to do an association study, which looks for all the variable sites where one allele is enriched in the cases and the other allele is enriched in the controls. Another, increasing common application of sequencing data is to look at sequence variation in natural populations. This kind of data can be extremely informative about the evolutionary and demographic history of a species.

Errors or SNPs?


If you were guaranteed that every read were 100% accurate, there would be no SNP calling problem. However, because some of the calls that map to a particular site may be errors, we need some way of figuring out whether or not the site is truly a SNP.

Let's first consider a haploid organism. This is both most relevant to this class, since the data we have is from E. coli, but also an easy place to start. Let's take a look at a little bit of a reference sequence and then the reads that map to it. I'm going to write the data in a similar form to what's called a pileup file; you'll see a proper one later.


A AAAAACAAA
G CCCCCGCCC
C CCCAGCAGG

The first column is the reference allele while the second column summarizes the nucleotides from the reads. For example, in the first row, the reference allele is an A and you obtained 8 reads that had an A at that position and 1 read with a C at that position.

The first row is relatively easy: the reference allele is an A, and 8 out of 9 reads are A. Probably that C is an error, and we would think that this site does not have a SNP in this individual.

The second row is the opposite: the reference allele is G, but only 1 out of the 9 reads is a G. Instead, because every other read has a C, we might think that this site is a SNP and this sample has a C, instead of a G.

The last row is a hard case: the reference is a C, but only 4 out of the 9 reads is a C. However, unlike before, the other reads are roughly evenly split between 3 Gs and 2 As. What do you do in this case?

One approach is to adopt a statistical framework that attempts to estimate error rates and makes use of all the data. This is relatively complicated and we won't talk about it here. Another approach makes implicit use of prior knowledge you may have. First of all, in a haploid you expect only 1 allele at any site; so if there aren't enough reads with the same nucleotide, it might be wise to ignore that site. Second, assuming that your reference and your sample are not too diverged, you should expect SNPs to be relatively rare. Even between humans and chimp you expect only about 1 site in 100 to be different! So you probably want to put more weight on a site not being a SNP. Then, you can simply choose an arbitrary cutoff and declare a site a SNP or not.

A diploid organism can be much harder, because you have to deal with heterozygous sites. In a perfect world, your reads would come equally from both copies of the allele, but that isn't always the case. Consider the following sites:
G GCGCCGGCCC
T GGGTGGGTGG
C CCATGCGAT

The first site has 4 Gs and 6 Cs. This is reasonably close to 50% G and 50% C, exactly what you would expect from a heterozygous site. But what about the second site? It has only two Ts but eight Gs. Is that possibly a heterozygous site, or a site homozygous for the non-reference allele? The final site is once again a mess.

Again, a statistical framework can estimate error rates, sequencing bias, and model the process that assigns alleles to reads. However, another option is to again make arbitrary, but informed, cutoffs.

Pre-packaged SNP calling


There are a number of software packages that will perform SNP calling for you. One of the simplest to use is samtools. Samtools uses a relatively simple command line that takes as an input a indexed fasta file and a sorted bam file (we'll talk about that in a moment). The general framework can be found here.

The first thing you need to do to use samtools is index your fasta file using samtools faidx:

$ samtools faidx E_coli_genome.fasta
$ head E_coli_genome.fasta.fai
Chromosome 4639675 77 60 61

As you can see, running samtools faidx results in a file with the same name as the input fasta but with ".fai" tagged on the end of it. The faidx file simply shows the name and size of the sequence, the place in the file where the sequence begins and the number of characters per line. Using this information it is quite easy to get an arbitrary sub-sequence if you know the start and end positions of that subsequence.

The other task is to generate a sorted bam file with your reads. A bam file is simply a binary version of a sam file and samtools plays more nicely with them. To generate a bam file from a sam file of mapped reads, we have to first convert our sam file file into a bam file and then sort it. To conver the sam file into a bam file, use samtools view:

$ samtools view -Sb E_coli_mapped_reads.sam > E_coli_mapped_reads.bam
[samopen] SAM header is present: 2 sequences.

These bam files are NOT human-readable. the -Sb option tells it that the input is in Sam format and that it should output in bam format. To sort the bam file, use samtools sort:

$ samtools sort E_coli_mapped_reads.bam E_coli_mapped_reads.sorted

which results in a file E_coli_mapped_reads.sorted.bam. The first argument is the input bam file and the second argument is the prefix for the output bam file (in other words, the output will be called prefix.bam).

Now we can finally call SNPs! We do this using a series of programs. The first one, samtools mpileup, will be used to generate bcf file. However, first it's instructive to take a brief look at a pileup file:

$ samtools mpileup -f E_coli_genome.fasta E_coli_mapped_reads.sorted.bam > E_coli_mapped_reads.pileup
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

the -f option simply means that the reference genome is given as a fasta file. Let's take a brief look inside E_coli_mapped_reads.pileup (note the below is an idealized example, not pulled from the alignment):

Chromosome 21150 G 17 c,,,,,,,,,,,,,,,^~, +JIIJJGGJJH@JCJHC
Chromosome 21151 T 17 ,,,,,,,,,,,,,,,,, AIJGIJBFJJE;J>JGF
Chromosome 21152 C 17 ,,,,,,,,,,,,,,,,, GGJJJJGHJJJDJEJHH
Chromosome 21153 A 17 ,,,,,,,,,,,,,,,,, EJJJJJIEJJJIJHJCH
Chromosome 21154 A 17 ,,,,,,,,,,,,,,,,, >IIIJJIJJIJEJDJFE
Chromosome 21155 T 17 ,,,,,,,,,,,,,,,,, FJIIJJJIJIJEJ@IGF

The first column is the chromosome, followed by the position, the reference allele and the coverage. The next two columns tell you about the reads at that position: , signifies matching reference on forward strand while . is matching reference on reverse strand; similarly upper and lower case letters indicate a read from a non-reference allele on either the forward or reverse strand, respectively. There are a few other characters, for example indicating the beginning of a read, but this was just meant to be a short detour rather than a full exploration of the pileup format.

However for actual SNP calling we need this to be in bcf format. This is done using an almost identical command line, the only difference being that the option should be "-Suf" instead of simply "-f". The "S" will tell samtools to compute per-site strand bias values, a phenomena in which genotyping information obtained from reads from different strands disagrees.

$ samtools mpileup -Suf E_coli_genome.fasta E_coli_mapped_reads.sorted.bam > E_coli_mapped_reads.bcf
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

Now, we use bcftools view to do all the genotype calling. bcftools uses a statistical method to call SNPs.

$ bcftools view -vg E_coli_mapped_reads.bcf > E_coli.vcf
[afs] 0:279.924 1:4.392 2:3.684

The options -vg tell it to output only variable sites and to call genotypes. Calling genotypes is a bit weird for E. coli, since it's haploid, but this can be a good practice: sites that are called heterozygous might be indicative of low quality sites or that you are not sequencing a clonal culture. The output is stored in a vcf. "vcf" stands for variant call format, and is a compressed way to store variant information. Taking a look inside, first there is a long header:

##fileformat=VCFv4.1
##samtoolsVersion=0.1.17 (r973:277)
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=-1,Type=Integer,Description="List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2">

and then some lines indicating the variants

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT E_coli_mapped_reads.sorted.bam
Chromosome      2863059 .       T       C       3.55    .       DP=1;AF1=1;AC1=2;DP4=0,0,0,1;MQ=42;FQ=-30       GT:PL:SP:GQ     0/1:31,3,0:0:4
This looks complicated but it's rather simple. First let's look at the line describing the variant. The first column is the chromosome, then the position. The next column is an ID for that SNP, for example if you had a list of known SNPs. Finally you get the good stuff: the reference allele and the alternative allele. In this case, samtools thinks that while the reference is a T at position 2863059 in the E. coli chromosome, our sample actually has a C. The rest of the fields give you some information about the SNP. For example, the field that starts with "DP=1" is called the info field. If you want to know what all the things there mean, take a look up at the header. For example, this line

##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
means that "DP=1" indicates that only 1 read mapped to this position in the genome. The last column tells you the actual genotype, as well as the genotype likelihood scores (essentially -log(probabilty of a given genotype)). So here it thinks that the E. coli is heterozygous for the alternative allele.

Your vcf file contains all the information about variant positions in your reference genome, but doesn't usually contain any information about the genomic positions that did not vary. This may not matter if you are only concerned with variant positions - say if you are trying to track down the mutation underlying a new phenotype. However, there are many cases where you might want to know the most likely sequence of all the bases in the genome, not just the variant ones.

To generate a consensus sequence for your organism, you can use another part of vcfutils - a library of perl scripts used to work with vcf files. Since we are looking not just at variant positions, we won't run vcftools with the '-v' flag, so that it outputs every position in the genome

bcftools view -cg E_coli_mapped_reads.bcf | vcfutils.pl vcf2fq > E_coli_genome_cns.fq

Population genomics

Analyzing genomic variation within and between populations is one of the most powerful uses of high-throughput sequencing. We are able to sequence more individuals to higher quality than ever before and so we can get a genome-wide understanding of variation within populations.

Imagine that you have found all the variable sites in a set of individuals. Although you have all the sequence for each sample, you really only care about those places where at least one sample differs from one other sample. Below is a cartoon of a small dataset with 3 chromosomes. Variant sites are marked with a red x, the rest of the chromosome is assumed to be the same between all individuals. If two chromosomes have a variant that (roughly) line up vertically, they have the same variant at that site.

pop_sequence.png
There are three summaries of population genomic variation that can act as a first pass at exploratory data analysis. Two of these summaries are just single numbers, and one of them is a bit more complicated. The two simplest summaries are the number of segregating sites, S, and the average number of pairwise differences, pi. Let's walk through calculating those on the example data above.

S is exactly what it sounds like: simply count up the number of sites have at least one chromosome with a variant,


variants1.png
variants1.png


so that S is equal to 6 in this case.

Calculating pi is a little bit harder. First, compare every pair of sequences and ask how many differences there are between them. Then, take the average of this number over all the comparisons you did,


variants2.png
variants2.png



so that in this case,
pi = (5 + 3 + 4)/3 = 4.


Exercises



VCF:
https://www.dropbox.com/s/33bgaxx6fcsg6d9/CG_chr_21.good.vcf.gz
Individual list: https://www.dropbox.com/s/1lw5pwok1zpx6j8/CGS
_Diversity%2BPed_all54_indvs.txt
1. Download the above VCF file containing variants from human chromosome 21 in various human populations. Individuals are labeled with anonymous identifiers, so you will also need to download a list that tells which population each individual is from. Then,
a. Compute pi and S within each population SEPARATELY. Note that because these are diploids, you can just select a random allele from each individual for calculating pi (you may need to look at the module "random" for this purpose, specifically the function random.randint).
b. Compute the average distance between a random Yoruban (YRI) and European individual (CEU). How does it compare to within population diversity?

2. Using samtools, find variants in the reads from yesterday. How many heterozygous sites are called? Why might there be "heterozygous sites" in a haploid genome?


Challenge:

The above exercises have used an older (but still useful!) variant calling pipeline. Newer and more complicated variant calling packages are constantly being released, and eventually it is worthwhile to switch to newer, shinier software.

The Genome Analysis Toolkit (aka GATK), put out by the Broad Institute, is one of the more popular 'modern' variant detectors. Download the software from the GATK download page(you may need to create an account to download the software).

The GATK package has many different programs, and a generally more complex pipeline than the above samtools/bcftools example worked during lecture. For our purposes, the tool we care most about is the actual variant caller itself. GATK has two - UnifiedGenotyper and HaplotypeCaller. For almost all purposes, HaplotypeCaller is the best choice.

a. Read through the documentation for GATK's HaplotypeCaller. Use HaplotypeCaller to generate a vcf file from the E_coli_mapped_reads.sorted.bam file. (Hint: you may have to index your BAM file using samtools) (Another Hint: What ploidy is E. Coli?)

b. Any published bioinformatics software package will claim to be better than its predecessors (otherwise, why publish it?). It is always a good idea, when trying to decide whether switching to a different piece of software, to compare the output of the new software to the output of the old. Looking at the two vcf files (one from lecture using samtools/bcftools, the other from GATK's HaplotypeCaller), how many variants are shared? How many variants are unique to each vcf? Is switching worth it in this case?



Solutions:
1a)
#!/usr/bin/env python
 
##ex1.py vcf_file ind_file
 
 
import sys
import random
 
vcf = sys.argv[1]
inds = sys.argv[2]
 
 
#parse the individuals and create some useful dictionaries
f = open(inds,'r')
ind_pops = {}
pop_indices = {}
pop_S = {}
pop_pi = {}
pop_goodsites = {}
for line in f:
    split = line.strip().split("\t")
    ind_pops[split[1]] = split[0]
    if split[0] not in pop_indices:
        pop_indices[split[0]] = []
        pop_S[split[0]] = 0.0
        pop_pi[split[0]] = 0.0
        pop_goodsites[split[0]] = 0.0
 
f.close()
 
#parse the VCF
cur_line = -1
v = open(vcf,'r')
for line in v:
    if cur_line % 10000 == 0:
        sys.stderr.write("%i\n"%cur_line)
    #if cur_line > 10000:
    #    break
    cur_line += 1
    ##skip the header lines
    if line.startswith('##'):
        continue
 
    if line.startswith('#'):
        #figure out which indices correspond to which individuals
        split = line.strip().split("\t")
        raw_genos = split[9:]
        for i,ind in enumerate(raw_genos):
            ind_name = ind.split("-")[0]
            pop_indices[ind_pops[ind_name]].append(i+9)
        continue
 
    #compute pi and s by adding SNP by SNP
    split = line.strip().split("\t")
    #go over every population
    for pop in pop_indices:
        is_segregating = False
        #go over every pair of individuals
        for ind1 in pop_indices[pop]:
            for ind2 in pop_indices[pop]:
                #don't count if the two individuals are the same!
                if ind1 == ind2:
                    continue
                #get each nucleotide
                ind1genotype = split[ind1].split("/")
                ind2genotype = split[ind2].split("/")
                #print ind1genotype
                #print ind2genotype
                #determine if either has missing data
                #if any missing data in this population for this site, ignore!
                if "." in ind1genotype or "." in ind2genotype:
                    break
                #choose a random allele from each individual
                ind1allele = ind1genotype[random.randint(0,1)]
                ind2allele = ind2genotype[random.randint(0,1)]
                #determine if they are different
                if ind1allele != ind2allele:
                    #print splitLine[1], pop, ind1genotype, ind2genotype
                    #raw_input()
                    #count the site as segregating if haven't already
                    if not is_segregating:
                        is_segregating = True
                        pop_S[pop] += 1.0
                    #count as a pairwise difference
                    pop_pi[pop] += 1.0
                pop_goodsites[pop] += 1.0
 
#print the results
print "pop","pi","S", "Good Sites"
for pop in pop_pi:
    num_ind = float(len(pop_indices))
    print pop,pop_pi[pop]/(num_ind*(num_ind-1)/2),pop_S[pop],pop_goodsites[pop]
 
pop pi S Good Sites
YRI 54432.8909091 165945.0 21077245.0
CHB 6532.61818182 77655.0 3665330.0
ASW 14911.1818182 128424.0 5810458.0
TSI 7103.47272727 79275.0 3652579.0
MKK 8635.58181818 105167.0 3632253.0
MXL 11622.2181818 94685.0 5861955.0
LWK 8947.81818182 113438.0 3507147.0
CEU 40541.1272727 111613.0 21131624.0
JPT 6762.78181818 79081.0 3641208.0
PUR 1208.01818182 47109.0 615354.0
GIH 7361.30909091 87273.0 3688186.0

1b)

#!/usr/bin/env python
 
import sys
 
import random
 
vcf = sys.argv[1]
inds = sys.argv[2]
 
 
#parse the individuals and create some useful dictionaries
f = open(inds,'r')
ind_pops = {}
pop_indices = {}
pop_S = {}
pop_pi = {}
pop_goodsites = {}
for line in f:
    split = line.strip().split("\t")
    ind_pops[split[1]] = split[0]
    if split[0] not in pop_indices:
        pop_indices[split[0]] = []
        pop_S[split[0]] = 0.0
        pop_pi[split[0]] = 0.0
        pop_goodsites[split[0]] = 0.0
 
f.close()
#create the counters
pi_between = 0.0
good_sites = 0.0
 
#parse the VCF
cur_line = -1
v = open(vcf,'r')
for line in v:
    if cur_line%10000 == 0:
        sys.stderr.write("%i\n"%cur_line)
    cur_line += 1
    ##skip the header lines
    if line.startswith('##'):
        continue
 
    if line.startswith('#'):
        #figure out which indices correspond to which individuals
        split = line.strip().split("\t")
        raw_genos = split[9:]
        for i,ind in enumerate(raw_genos):
            ind_name = ind.split("-")[0]
            pop_indices[ind_pops[ind_name]].append(i+9)
        continue
 
    #compute pi between by adding line by line
    split = line.strip().split("\t")
    for ind1 in pop_indices["YRI"]:
        for ind2 in pop_indices["CEU"]:
            ind1genotype = split[ind1].split("/")
            ind2genotype = split[ind2].split("/")
            #check if any missing data
            if "." in ind1genotype or "." in ind2genotype:
                break
            #choose a random allele
            ind1allele = ind1genotype[random.randint(0,1)]
            ind2allele = ind2genotype[random.randint(0,1)]
            #determine if they are different
            if ind1allele != ind2allele:
                pi_between += 1.0
            good_sites += 1.0
print "pi", "goodsites"
print pi_between/(9.0*9.0), good_sites
 
pi goodsites
41916.6666667 23596420.0


2)
$ grep -c '0/1' E_coli.vcf
89

Challenge:

Note: This proved to be more of a challenge that I thought at first, as completing this challenge required use of Piccard tools, a collection of java programs that work on SAM/BAM/VCF files.

First, create a dictionary of the reference fasta file (similar to indexing with samtools faidx)
$ java -jar /home/james/Programs/picard-tools-1.138/picard.jar CreateSequenceDictionary R=E_coli_genome.fasta O=E_coli_genome.dict

GATK only takes BAM files that have read groups - which allow tracking of reads if they are pooled
$ java -jar /home/james/Programs/picard-tools-1.138/picard.jar AddOrReplaceReadGroups I=E_coli_mapped_reads.sorted.bam O=E_coli_mapped_reads.sorted_rg.bam ID=E_Coli LB=E_Coli PL=ILLUMINA PU=E_Coli SM=E_Coli

GATK (and many other bioinformatics packages) require their BAM files to be indexed
$ samtools index E_coli_mapped_reads.sorted_rg.bam

Now we can finally call SNPs with the HaplotypeCaller. HaplotypeCaller allows adjustment of the ploidy of the organism, instead of assuming everything is a diploid
$java -jar /home/james/Programs/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R E_coli_genome.fasta -I E_coli_mapped_reads.sorted_rg.bam --sample_ploidy 1 -o E_coli_GATK.vcf

Lets compare the results of bcftools and GATK:
##input: a vcf file
##output: a set containing all positions in the vcf file
##set is a set of tuples in the form (chrome,pos)
def load_VCF_pos(invcf):
 
    f = open(invcf,'r')
 
    position_set = set()
 
    for line in f:
        if line.startswith('#'):
            continue
 
        line = line.strip()
        split = line.split('\t')
        chrome = split[0]
        pos = split[1]
 
        position_set.add( (chrome,pos))
 
    return position_set
 
bcf_set = load_VCF_pos('E_coli.vcf')
GATK_set = load_VCF_pos('E_coli_GATK.vcf')
 
print "BCF Count: " , len(bcf_set)
print "GATK Count: ", len(GATK_set)
 
print "Intersection: " ,len(GATK_set.intersection(bcf_set))
 
 
BCF Count: 133
GATK Count: 3
Intersection: 2

With the default settings, it seems like GATK is much stricter about calling SNPs. All the SNPs called by GATK seem to have more reads supporting them than those called y bcftools, at least with the default cutoffs for each program. Critically, GATK doesn't find any heterozygous SNPs, which seems somewhat unlikely in a haploid organism.