Parsing Tabular Data with Pandas & Using Biopython


Topics

  • Tabular data with Pandas
  • Introduction to Biopython
  • Using the SeqIO module
  • Parsing GFF Files with BCBio-gff


Parsing Tabular Data with Pandas



One of the things that gets boring really fast is reading, writing, and parsing tables of data, especially in a way that preserves the information we care about. Recently, a really nice package has caught on for doing exactly that, called pandas (short for PANel DATA). Pandas uses numpy internally, so it's really fast, but it also has lots of nice features for dealing with tables that may contain several types of data.

Pandas, like numpy and scipy and many of the modules we'll be using in this course, comes with the Anaconda python package. Let's look at an example (which I'll do in IPython, in line with the exploratory nature of poking at someone else's data). Let's revisit the Ecoli_genes.tab file from Session 4.2. Use mv to move that file from where it is to your current directory. Then:

$ ipython

import pandas as pd
 
operonCalls = pd.read_table("operonCalls.tab")
print type(operonCalls)
 
operonCalls
 
# operonCalls is a pandas data type called a DataFrame
# which is modeled after an equivalent data type
# in the language R

Pandas will automatically be able to tell the column and row names (if there are any), and parse the data accordingly. This is so much faster than having to parse the file out manually using a for loop, for example.

# Get the row #, column# of the data frame
operonCalls.shape
 
# View the first 5 lines
operonCalls.head()
print
operonCalls.head(10)
print
 
# View the last 5 lines
operonCalls.tail()
print
operonCalls.tail(10)
print
 
#To get an array of column names
operonCalls.columns.values
#This gives you an array of row names or row numbers
operonCalls.index.values
 
operonCalls['Name1']  # One way to access a column
operonCalls.Name1 # Another way to look at a column
 
operonCalls.ix[0]  # Using .ix is the only good way to look at a row
 
# You can then ierate over items in each row/column
 
def do_something(row):
    #This function does nothing
    pass
 
for index in operonCalls.index:
    do_something(operonCalls.ix[index])
 
# or
 
for row_num, row in operonCalls.iterrows():
    do_something(row)
 

Another great thing about Pandas is that it can tell what type of object is in each column (provided they all look like the same thing), without you having to tell it what the object is. For example:

operonCalls['Gene1']
# This column is numbers, so that's what it's stored as
# dtype = int64
 
operonCalls['Name1']
# This one is strings, so that easily pops out too!
# dtype = object
 
operonCalls["pOp"]
# This one is floats, dtype = float64
 
# Because it can recognize numerical arrays,
# You can perform basic operations on the entire column array
operonCalls.pOp+10
operonCalls.pOp.mean()
sum(operonCalls.pOp)
operonCalls.pOp.min()
operonCalls.pOp.max()
 
# You can convert the pandas types into plain old numpy arrays
np.array(operonCalls.pOp)
 

Let's say that there is a specific gene name you want to search for though. You know what column it's in, but not what row.

operonCalls.Name1=='yaaH'
# This is the simplest way to do a search
# It returns boolean values that evaluate the truth of
# Name1 =='yaaH' at each row position
 
# You can also look for all rows where pOp is
# greater than or equal to 0.57
operonCalls.pOp >= 0.57


Biopython


One of the things we're teaching you in this course is how to use existing biology-centric modules in python so you don't have to re-invent the wheel! (Although, sometimes, re-inventing the wheel can be a useful teaching exercise...). Today, we'll be exploring the Biopython module. Biopython is a package with a collection of useful tools for bioinformatics. It can do things like parse sequence files in a variety of different formats (.fasta, .fastq, clustal alignment files, ExPASy files, Swiss Prot, etc), provide code to manipulate sequences (for example, perform transcription, translation, or reverse-complement), and interface with many useful databases on NCBI (Entrez, Pubmed, Blast).

To install Biopython on a Mac, you can type:

$ pip install biopython

If you have Linux, you can type:

$ sudo apt-get install python-biopython

Type in your password and biopython should automatically install. For more installation instructions (should the above not work), please see installing Biopython.

Now, let's start ipython and import the Biopython module:

$ ipython
import Bio

If that worked without throwing an Import error, then congrats, you successfully installed BioPython! We are going to walk through a few examples adapted from the Biopython manual.

Using the SeqIO Module


One of the more useful sub-modules is SeqIO, which does input/output for sequences (and, I'm sorry to say, probably does it better than the FASTA parsers we've written thus far). SeqIO has a special sequence "class" that associates various attributes with sequence objects (some are methods, and some are variables). Let's look at an example parsing the cerevisiae_genome.fasta file from the first day of class. Make a directory in your ~/PythonCourse for today (S6.2) and use cp to copy the cerevisiae_genome.fasta into this folder. Then simply type (or cpaste):

# Bio Example 1
 
from Bio import SeqIO
 
# SeqIO.parse will parse a fasta file
# It returns an iterator object of sequence records
# each with its own attributes
 
fastaFile = 'cerevisiae_genome.fasta'
for record in SeqIO.parse(fastaFile,format="fasta"):
    print record
    print
    print
 
 

So you can see that each of these record objects holds quite a bit of information. Each record object also has special methods associated with it that you can use to pull specific bits of data about each sequence record. For example:

# Bio Example 2
 
fastaFile = 'cerevisiae_genome.fasta'
for record in SeqIO.parse(fastaFile,format="fasta"):
    print "This is the name of the sequence record: ",record.name
    # The .seq method references a sequence object
    # That is indexed and can therefore be sliced
    print "these are the first 50 bases:"
    print record.seq[0:50]
    print type(record.seq)
    print
 

Other methods include .description, .id, .name, etc, that can all be parsed and formatted just like you would any string in Python. For a complete list of record methods in ipython, just hit tab after "record." You'll also notice that the sequences are of a special "Bio.Seq.Seq" class and not traditional python strings. That is (for most purposes) not an issue--you can still apply many string methods to the sequence:

# Bio Example 3
 
fastaFile = 'cerevisiae_genome.fasta'
for record in SeqIO.parse(fastaFile,format="fasta"):
    print "This is the name of the sequence record: ",record.name
    # string methods can be applied to
    # seq objects
    sequence = record.seq[0:100]
    print sequence
    print
    print sequence.lower()
    print
    print 'atgtag'+sequence
 

And while we're on the subject of parsing files, let's talk about another common file type that you might deal with a lot if you are working with many genome sequences: the GFF/GFF3 file. This file format is a tab-delimited file that describes every genomic feature in a given genome, and many genomics programs require one as input. These files can be a pain to parse. Luckily, some generous folks have written a python GFF parser called BCBio-gff. It has not been fully integrated into Biopython yet, so it is still a separate module.

To download this module, Mac users type:

$ pip install bcbio-gff

For Linux users, type:

$ cd
$ git clone git:github.com/chapmanb/bcbb.git
$ cd bcbb/gff
$ python setup.py build
$ sudo python setup.py install

When installation is done, cd back into your ~/PythonCourse/S6.2 directory.

If you'd like more information on all the cool things bcbio-gff can do, you can find the manual here. Now, let's start parsing a sample GFF! Download and save the S. cerevisiae GFF file to your S6.2 directory: http://www.yeastgenome.org/download-data/sequence. If you less into this file, you'll see that it's rich in information about the S. cerevisiae genome, arguably one of the most well-understood eukaryotic genomes in biology.

# Bio Example 4
 
from BCBio import GFF
 
gff = 'saccharomyces_cerevisiae.gff'
gffFile = open(gff)
for rec in GFF.parse(gffFile):
    print rec
    print
    print
 
gffFile.close()
 
In this case, each rec variable represents an entire chromosome, and within each chromosome, there are hundreds of features (genes, ARSs, telomere_repeats, etc). To parse out the specific feature information we want, we have to apply more methods. Let's say you're a yeast biologist that works on silencing (cough cough not me cough), and you're only interested in the silenced parts of the genome:

# Bio Example 5
 
gff = 'saccharomyces_cerevisiae.gff'
gffFile = open(gff)
for rec in GFF.parse(gffFile):
    for item in rec.features:
        if item.type=='silent_mating_type_cassette_array':
            print item.id
            print rec.id
            print item.location
gffFile.close()
 

You can even write your own appropriately-formatted GFF files with bcbio-gff!




Exercises


1) More fun with SeqIO!

a) Get the help information on SeqIO.write and the .reverse_complement method, and figure out how to write out the reverse-complement of all of the yeast genome to a new file.

b) Write a function that will convert a one-letter amino acid sequence (e.g. A = alanine, K = lysine, etc) into a three-letter amino acid sequence (alanine is Ala, lysine is Lys). You may find the following dictionary helpful:

threecode = {'A':'Ala', 'B':'Asx', 'C':'Cys', 'D':'Asp',
                 'E':'Glu', 'F':'Phe', 'G':'Gly', 'H':'His',
                 'I':'Ile', 'K':'Lys', 'L':'Leu', 'M':'Met',
                 'N':'Asn', 'P':'Pro', 'Q':'Gln', 'R':'Arg',
                 'S':'Ser', 'T':'Thr', 'V':'Val', 'W':'Trp',
                 'Y':'Tyr', 'Z':'Glx', 'X':'Xaa', '*':'Ter',
                 'U':'Sel', 'O':'Pyl', 'J':'Xle',
                 }

c) Now, import Bio.SeqUtils, and look at SeqUtils.seq3. How similar is this to what you wrote in part b?

2) Combining SeqIO and GFF parsing


a) Using the GFF parser to get genomic features and SeqIO to parse the cerevisiae_genome.fasta, write a script that will take all coding sequences for all genes on the first chromosome, translate them to protein, and write these protein sequences out to another file titled "protein_seqs.fasta."

b) Write a function using the GFF parser that takes as input the name of the GFF file, chromosome name, start position, and stop position, and returns a list of all genomic features in that interval.

3) Practice with pandas

Re-write the solution to exercise 4.2 incorporating pandas to parse the tab delimited files.




Solutions


1) More fun with SeqIO!
## 1 ##
## 1a ##
 
# Read about SeqIO.write by typing
# "SeqIO.write?" into ipython
# rec.seq.reverse_complement? to read about
# how to reverse-complement a sequence
 
from Bio import SeqIO
 
# Read into the Scer genome file
fasta='cerevisiae_genome.fasta'
scer_genome = SeqIO.parse(fasta,"fasta")
 
# Open a new filehandle
# For the reverse-complement output
outFile='revComp_cerevisiae_genome.fasta'
outFasta = open(outFile,'w')
 
for rec in scer_genome:
    chromosome = rec.id
    desc = rec.description+" reverse complement"
    sequence = rec.seq
    revComp = sequence.reverse_complement()
    revCompseq = SeqIO.SeqRecord(revComp,id=chromosome,description=desc)
    SeqIO.write(revCompseq,outFasta,"fasta")
 
outFasta.close()
 
## 1b ##
 
 
def threecode(proteinSeq):
    threecode = {'A':'Ala', 'B':'Asx', 'C':'Cys', 'D':'Asp',
                 'E':'Glu', 'F':'Phe', 'G':'Gly', 'H':'His',
                 'I':'Ile', 'K':'Lys', 'L':'Leu', 'M':'Met',
                 'N':'Asn', 'P':'Pro', 'Q':'Gln', 'R':'Arg',
                 'S':'Ser', 'T':'Thr', 'V':'Val', 'W':'Trp',
                 'Y':'Tyr', 'Z':'Glx', 'X':'Xaa', '*':'Ter',
                 'U':'Sel', 'O':'Pyl', 'J':'Xle',
                 }
    threeLetterSeq = ""
    for aa in proteinSeq:
        threeLetterSeq+=threecode[aa]
    return threeLetterSeq
 
## 1c ##
 
import Bio.SeqUtils as bioS
# to read up on SeqUtils.seq3, type
# bioS.seq3? in ipython

2) Combining SeqIO and GFF parsing

## 2 ##
## 2a ##
# A seq object in SeqIO has the .translate method
# Look it up to read how it works
 
# Step one: add a conditional if rec.id has string 'chromosomeI'
# Use GFF methods to get list of features on that chromosome that type == CDS
# If it does, save that sequence.
# Use the translate method on that sequence to translate it
# Create a "SeqRecord" object for the sequence
# Write that seq record out to protein_seqs.fasta
 
from BCBio import GFF
 
gff = 'saccharomyces_cerevisiae.gff'
gffFile = open(gff)
 
outFile = open('protein_seqs.fasta','w')
for record in GFF.parse(gffFile):
    if record.id=='chrI':
        parentSeq = record.seq
        Features = record.features
        for feature in Features:
            if feature.type=='gene':
                name = feature.id
                featureSeq = feature.extract(parentSeq)
                ptnSeq = str(featureSeq.translate())
                outFile.write('>'+name+'\n')
                outFile.write(ptnSeq+'\n')
            else:
                pass
    else:
        pass
outFile.close()
 
######
## ALTERNATIVE SOLUTION ##
######
 
from Bio import SeqIO, SeqUtils, Seq
from BCBio import GFF
##gff_getinfo(gff file name, chromosome you want, ex. 'I') -- returns a list of lists, where each sublist includes the start and end positions, and the gene name
def gff_getinfo(filename,chromosome):
gffFile = open(filename)
for rec in GFF.parse(gffFile):
if rec.id != "chr"+chromosome: continue ##selects only the chromosome requested
else: genepos = [[int(item.location.start),int(item.location.end),item.id] for item in rec.features if item.type=='gene'] ##gets list of start and end positions, and the gene name from each line that's a gene
gffFile.close()
return genepos
 
##callseqs(fastafile, list returned from gff_getinfo(), newfilename of protein fasta file) -- returns "Done!" and makes new protein fasta file
def callseqs(filename,myposlist, newfile):
for record in SeqIO.parse(filename,format='fasta'):  ##loop through fasta file until grabbed chr specified, then break
if record.description.split('chromosome=')[1].split(']')[0] == 'I':
mychr = record ##mychr becomes the Seq for the chromosome we want
break
##loop through positions from gff file, grabbing region in mychr, translating it, adding a record with the gene name, and write to fasta
for ind,i in enumerate(myposlist):
myprot =mychr.seq[i[0]:i[1]].translate()
myprotSeq = SeqIO.SeqRecord(myprot,id=i[2], description="| start "+ str(i[0]) + " end " + str(i[1]))
SeqIO.write(myprotSeq,newfile,"fasta")
return "Done!"
 
mypos =  gff_getinfo('saccharomyces_cerevisiae.gff','I')
print mypos
myseqs = callseqs('cerevisiae_genome.fasta',mypos,open("protein_seqs.fasta",'w'))
print myseqs
 
 
## 2b ##
 
from BCBio import GFF
 
def getFeatures(gffFile,chromosome,start,stop):
    # First open and read into the gff file
    featureList=[]
    gff = open(gffFile,'r')
    for record in GFF.parse(gff):
        if record.id==chromosome:
            for feature in record.features:
                loc = feature.location
                featureStart = loc.start
                featureEnd = loc.end
                if (featureStart>=start) and (featureEnd<=stop):
                    featureList.append(feature.id)
                else:
                    pass
        else:
            pass
    gff.close()
    return featureList 
 
print getFeatures("saccharomyces_cerevisiae.gff","chrV",1000,20000)
 


3) Practice with pandas

from collections import defaultdict
import pandas as pd
 
 
def get_operon_pairs(operon_filename):
    # operons = open(operon_filename)
    # operons.readline()
    # # This skips the header line
 
    operons = pd.read_table(operon_filename)
 
    calls = defaultdict(bool)
    # # If no call, it is not an operon
    # calls[('rrmJ', 'ftsH')] = True
    # calls[('ftsH', 'rrmJ')] = True
 
    for row_num, row in operons:
        #line = line.strip()
        #elements = line.split('\t')
        #gene1 = elements[4]
        #gene2 = elements[5]
        #is_operon = (elements[6] == "TRUE")
        gene1 = row['SysName1']
        gene2 = row['SysName2']
        is_operon = row['bOp']
        calls[(gene1, gene2)] = is_operon
        calls[(gene2, gene1)] = is_operon
 
    return calls
 
 
operons = get_operon_pairs("operonCalls.txt")
# gene_info = open('geneInfo.txt')
# gene_info.readline()
# # Skip the header line
gene_info = pd.read_table('geneInfo.txt')
 
current_name = ''
current_start = -1
current_stop = -1
current_strand = '.'
current_operon_names = []
 
GFF_base = '\t'.join(['Chromosome', 'MicrobesOnline', 'operon',
                      '%d', # low end position
                      '%d', # high end position
                      '.', '%s', # strand
                      '.', 'operonName "%s";\n'])
outfh = open('operons2.gff', 'w')
 
genes_thus_far = set()
 
#for line in gene_info:
for row_num, row in gene_info.iterrows():
    #next_line = line.strip()
 
    if 'CRISPR' in row['name'] or 'insE' in row['name'] or 'insC' in row['name']:
        continue
 
    # next_elements = line.split('\t')
    # next_start = int(next_elements[4])
    # next_stop = int(next_elements[5])
    # next_strand = next_elements[6]
    # next_name = next_elements[8]
    next_start = row['start']
    next_stop = row['stop']
    next_strand = row['strand']
    next_name = row['sysName']
    next_hname = row['name']
 
    if not operons[(current_name, next_name)]:
        if current_name:
        # current_name starts out empty, so on the first valid line, this
        #  won't write out to the file
            GFF_line = GFF_base % (min(current_start, current_stop),
                                   max(current_start, current_stop),
                                   current_strand,
                                   '-'.join(current_operon_names))
            outfh.write(GFF_line)
        # current_operon_names = [next_name]
        current_operon_names = [next_hname]
        current_start = next_start
        current_stop = next_stop
        current_strand = next_strand
 
    else: # Genes are in the same operon
        if current_strand != next_strand:
            print line
            # This is just a consistency check.
        if current_strand == "+":
            current_start = min(current_start, next_start)
            current_stop = max(current_stop, next_stop)
            # current_operon_names.append(next_name)
            current_operon_names.append(next_hname)
        elif current_strand == "-":
            current_start = max(current_start, next_start)
            current_stop = min(current_stop, next_stop)
            # current_operon_names.insert(0, next_name)
            current_operon_names.insert(0, next_hname)
 
    current_strand = next_strand
    current_name = next_name
 
 
# Clear out the last one, which has no "next"
GFF_line = GFF_base % (min(current_start, current_stop),
                       max(current_start, current_stop),
                       current_strand,
                       '-'.join(current_operon_names))
outfh.write(GFF_line)
outfh.close()