Plotting Part II: Boxplots and Graphing Shapes Using Python




Informative Interlude


Before we jump into this afternoon's very short lecture, there is an important functionality of matplotlib we should touch on. Many of you are probably familiar with xkcd, the science web comic.

Turns out there is a built-in function within matplotlib that will turn your plots into xkcd comics! To explore this functionality, re-start ipython and copy and paste in the code below. More details here:
http://matplotlib.org/xkcd/examples/showcase/xkcd.html

#!/usr/bin/env python
 
# http://matplotlib.org/xkcd/examples/showcase/xkcd.html
# close ipython and restart it
# $ quit()
# $ ipython
 
# Then, copy & paste the following:
 
import matplotlib
matplotlib.use('QT4Agg')
 
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
 
plt.xkcd()
 
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
plt.xticks([])
plt.yticks([])
ax.set_ylim([-30, 10])
 
data = np.ones(100)
data[70:] -= np.arange(30)
 
plt.annotate(
    'THE DAY I REALIZED\nI COULD COOK BACON\nWHENEVER I WANTED',
    xy=(70, 1), arrowprops=dict(arrowstyle='->'), xytext=(15, -10))
 
plt.plot(data)
 
plt.xlabel('time')
plt.ylabel('my overall health')
 
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.bar([-0.125, 1.0-0.125], [0, 100], 0.25)
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.set_xticks([0, 1])
ax.set_xlim([-0.5, 1.5])
ax.set_ylim([0, 110])
ax.set_xticklabels(['CONFIRMED BY\nEXPERIMENT', 'REFUTED BY\nEXPERIMENT'])
plt.yticks([])
 
plt.title("CLAIMS OF SUPERNATURAL POWERS")
 
plt.show()
 
 
# Now, turn your own plot into
# an XKCD comic!
 
fpkms = pd.read_table("FPKM_table.tab")
 
# To find what row "HMLALPHA2" is in
for num,row in fpkms.iterrows():
    if row["Gene"]=="HMLALPHA2":
        rowNumber = num
    else:
        pass
 
print rowNumber
 
# Get values for all columns for this row
# Excluding Gene
expressionValues = fpkms.ix[1617][1::]
 
# Convert to Numpy Array
expArray = np.array(expressionValues)
print expArray
 
# Get length of the array to pass to plt.bar
l = len(expArray)
 
# Plot
plt.bar(range(1,l+1),expArray,align="center")
plt.xticks(range(1,l+1),fpkms.columns.values[1::])
plt.show()
 
 


This afternoon, we'll be finishing out the second week of the bootcamp by touching on two more plotting capabilities in python: boxplots and graphing shapes. Let's begin!

Boxplots


Boxplots are a great way of graphing the distribution of measurements for a given sample, whether that be FPKMs, quality scores, or any other count value. Let's explore how to make box-plots in python using an example adapted from the boxplot demo in the matplotlib gallery.

from pylab import *
 
# Remember that when we are in pylab mode
# numpy, scipy and matplotlib are automatically loaded
# so we don't have to put the module abbreviations
# in front of functions
 
# fake up some data
spread= rand(50) * 100
center = ones(25) * 50
flier_high = rand(10) * 100 + 100
flier_low = rand(10) * -100
data =concatenate((spread, center, flier_high, flier_low), 0)
 
# basic plot
boxplot(data)
show()
 
# To make a notched boxplot
boxplot(data,1)
show()
 
# To change outlier point symbols
boxplot(data,1,'gD')
show()
 
# horizontal boxes
boxplot(data,0,'rs',0)
show()
 
One final note: to plot more than one boxplot, pass an entire array of data columns to plt.boxplot. It will automatically plot a boxplot for each column.

Graphing Shapes


The last graphing functionality we'll talk about is how to graph shapes in python. As a biologist, you might be graphing genomic features along a horizontal axis to highlight ORF positions, for example. There are many ways of graphing shapes within matplotlib (for more details, click here). Today we'll be talking about one such way: graphing shapes with the pylab module. Two potentially more biologically relevant shapes are rectangles (to represent non-coding genomic features) and arrows (to represent ORFs), although you can draw almost any shape conceivable using python (check out the matplotlib gallery to see more!).

import pylab as p
import matplotlib.pyplot as plt
 
# Drawing Rectangles
# Use p.Rectangle to graph a simple rectangle
# Specify the coordinates of the lower-left corner
# with a tuple as the first argument
# 2nd argument = width, 3rd argument = height
 
rectangle = p.Rectangle((1,2),4,5,color='red')
 
# Shapes are added to figures in a two-step process
# First, grab the current axis info using .gca()
axesInfo = p.gca()
 
# Second, add the rectangle using "add_patch"
axesInfo.add_patch(rectangle)
 
# Make sure the x & y limits include the coordinates for
# the patch!!
 
plt.ylim(0,10)
plt.xlim(0,10)
plt.show()
 
 
# To draw arrows
# First two numbers indicate start coordinates of the arrow
# 3rd number is how far in the x-direction the arrow goes (x-distance)
# 4th number is how far in the y-direction the arrow goes (y-distance)
# width = width of arrow stem
# head_width = width of arrow head
 
# Horizontal arrow
arrow = p.arrow(2,3,5,0,length_includes_head=True,head_length=2,head_width=2,color="green",width=1)
plt.ylim(0,10)
plt.xlim(0,10)
plt.xticks(range(10))
plt.show()
 
# Arrow pointing to another coordinate
arrow = p.arrow(2,3,5,8,length_includes_head=True,head_length=2,head_width=2,color="green",width=1)
plt.ylim(0,15)
plt.xlim(0,10)
plt.xticks(range(10))
plt.show()
 
# To flip the direction of the arrow
# make the x-distance a negative number
 
arrow = p.arrow(2,3,-5,0,length_includes_head=True,head_length=2,head_width=2,color="green",width=1)
plt.ylim(0,10)
plt.xlim(-5,5)
plt.xticks([-5,-4,-3,-2,-1]+range(5))
plt.show()
 
# Changing the arrow width to be the same width as
# the arrow head gives a different looking arrow
 
arrow = p.arrow(2,3,5,0,length_includes_head=True,head_length=2,head_width=2,color="green",width=2)
plt.ylim(0,10)
plt.xlim(0,10)
plt.xticks(range(10))
plt.show()
 
 




Exercises


1) Boxplots

Create a graph that plots the distribution of quality scores for each read position using the .fastq file ("E_coli_reads.fastq") from Exercise #1, Session 7.2 in the form of boxplots. If you want, you can use Biopython to extract quality scores for each read. The x-axis should be "read position," and the y-axis should show a box-plot distribution of quality scores for that position.

Figure out how to exclude the flier points (points outside of the box) from plotting. Then, save the figure in a format of your choice.

2) Graphing genome features

Below is a dictionary of genomic features in the S. cerevisiae genome on chromosome 10, coordinates 5000-10000. Each feature is a key and values are a tuple of four elements: the start position, stop position, orientation (1 for plus, 0 for minus), and whether or not the feature is an ORF (if ORF, True; if not, False).

Loop through the dictionary and plot each feature on a graph spanning 0-5000 points on the x-axis. If the feature is an ORF, plot it as a purple arrow; if the feature is NOT an ORF, plot it as a gray rectangle patch. Remember to plot ORFs in the appropriate orientation and to adjust the feature location coordinates for plotting on an x-axis that starts at 0!!!

featureDict =  {'TEL10L_X_element':(7004,7367,1,False),
'ARS1002':(7544,7943,1,False),
'YJL223C':(8775,9138,0,True),
'YJL222W-B':(9345,9483,1,True),
'YJL222W-A':(9551,9679,1,True)}


Solutions


1) Boxplots

 
## 1 ##
 
from Bio import SeqIO
import matplotlib.pyplot as plt
import numpy as np
 
# Parse .fastq file
ecoli_reads = SeqIO.parse("E_coli_reads.fastq","fastq")
 
qualityScores=[]
 
for read in ecoli_reads:
    quality = read.letter_annotations["phred_quality"]
    qualityScores.append(quality)
 
# Convert list of lists to array
 
qualityArray = np.array(qualityScores)
print qualityArray
 
# Plot array
plt.boxplot(qualityArray,1,showfliers=False)
plt.savefig("boxplot_ex1.png")
 

2) Graphing genome features


## 2 ##
import matplotlib.pyplot as plt
import pylab as p
 
featureDict = {'TEL10L_X_element':(7004,7367,1,False),'ARS1002':(7544,7943,1,False),'YJL223C':(8775,9138,0,True),'YJL222W-B':(9345,9483,1,True),'YJL222W-A':(9551,9679,1,True)}
 
plt.figure()
 
for feature in featureDict:
    start,stop,orientation,orf = featureDict[feature]
    # adjust start,stop coordinates
    start,stop = start-5000,stop-5000
    # check to see whether or not feature
    # is ORF
    if orf:
        # plot arrow
        orf_length = stop-start+1
        # check orientation
        if orientation:
            pass
        else:
            orf_length=orf_length*-1
        p.arrow(start,3,orf_length,0,length_includes_head=True,head_length=50,head_width=3,width=1.5,color='purple')
    else:
        # plot rectangle
        rectangleWidth=stop-start+1
        rectangle = p.Rectangle((stop,2),rectangleWidth,1.5,color="gray")
        axesInfo = p.gca()
        axesInfo.add_patch(rectangle) 
 
plt.ylim(0,10)
plt.xlim(0,5000)
plt.show()