Plotting Part I: How to make fancy figures


Okay, there's a lot more to this than we can realistically cover in one lecture, but there are lots of things you can do to make really cool figures in Python. We're going to be using matplotlib, which is a plotting library that took a lot of the plotting functionality from the popular MATLAB software, re-wrote it in Python, and (in my opinion) made it about 10 times saner and easier to use.

Nevertheless, we'll try to cover the basics, and give you a good enough understanding of how everything is set up so you can look at the extensive gallery of examples and figure out how to make similar plots with your own data.

Basic plotting

There are two main ways we could go through this plotting lecture. One is by good ole' fashioned scripting in aquamacs, and the other is through the "pylab" mode of IPython. IPython in pylab mode is meant to be set up pretty close to the MATLAB environment, in terms of plotting and doing other math heavy stuff. To use the pylab mode of ipython, type:

$ ipython --pylab

Pylab will automatically load matplotlib and will automatically show the graphs that you plot. But if you are scripting this in aquamacs, you actually want to import the MATLAB plotting library. For today, we'll work primarily out of ipython and import the modules separately. This is just my personal preference; if you prefer to work out of one or the other, go for it!



Informative Interlude: Differences between IPython with and without --pylab


If you start IPython without the --pylab flag, and you discover that you want to plot things, you have a couple options. By far the best of these is to use the %pylab magic word, which will load all the pylab related things that you need. Next best is to use from pylab import *, which will load all the plotting functions, but they won't work quite properly for interactive plotting. Instead, every time you want to see what you've plotted, you'd need to use the show() function, which will block anything else you do until you close the window. If you want to have a program that does its plotting, then saves the figures out, then you can do something like the above, where you have

import numpy as np
import matplotlib.pyplot as plt
 
# Plotting code here
# ...
# ...
 

In a perfect world, someone ought to be able to run a script and have almost all their figures for a paper just pop out, with only relatively minor tweaking. Then, if that code were available too, it should be possible for a reasonably savvy reviewer to see that all the data is on the up-and-up.



Now let's start plotting! Let's say we have some data that's approximately a line, but there's some noise in it. Let's plot it:

#!/usr/bin/env python
 
import numpy as np
import matplotlib.pyplot as plt
 
x = np.arange(0,100)
y = 0.5 * x + 5 + 10*np.random.uniform(0,3,len(x))
plt.plot(x,y)
plt.show()
 

That was pretty easy! Now, let's say we don't want to have lines connecting each data point, but instead just a marker. We can look at the documentation for plot like we would anything else in IPython:

In [6]: plot?
 
Type: function
Base Class: <type 'function'>
String Form:<function plot at 0x6312770>
Namespace: Interactive
File: /Library/Frameworks/.../site-packages/matplotlib/pyplot.py
Definition: plot(*args, **kwargs)
Docstring:
Plot lines and/or markers to the
:class:`~matplotlib.axes.Axes`. *args* is a variable length
argument, allowing for multiple *x*, *y* pairs with an
optional format string. For example, each of the following is
legal::
 plot(x, y) # plot x and y using default line style and color
 plot(x, y, 'bo') # plot x and y using blue circle markers
 plot(y) # plot y using x as index array 0..N-1
 plot(y, 'r+') # ditto, but with red plusses
 
If *x* and/or *y* is 2-dimensional, then the corresponding columns
will be plotted.
 
An arbitrary number of *x*, *y*, *fmt* groups can be
specified, as in::
 a.plot(x1, y1, 'g^', x2, y2, 'g-')
Return value is a list of lines that were added.
The following format string characters are accepted to control
the line style or marker:
================ ===============================
character description
================ ===============================
``'-'`` solid line style
``'--'`` dashed line style
``'-.'`` dash-dot line style
``':'`` dotted line style
``'.'`` point marker
``','`` pixel marker
``'o'`` circle marker
``'v'`` triangle_down marker
``'^'`` triangle_up marker
...

You can see that there's a lot of different things you can do for something as simple as plotting... Markers, colors, lines. If you keep reading, you can even incorporate labels for the lines. Let's try this code, now, and see what it looks like:

x = np.arange(0,100)
y = 0.5 * x + 5 + 10*np.random.uniform(0,3,len(x))
plt.plot(x,y, 'bo')
plt.show()
 
from scipy import stats
 
r_slope, r_int, r_rval, r_pval, r_stderr = stats.linregress(x, y)
plt.plot(x, x * r_slope + r_int, 'g-.', label='Dash-dotted Line')
plt.show()


Hey, what about the labels? To incorporate these, use the legend() function.

## To add label to the graph, call the legend function and add the kwarg "label" to the plot function.
plt.plot(x, x * r_slope + r_int, 'g-.',label='Dash-dotted Line')
plt.legend()
plt.show()
 

Or, if we decide that we don't like the labels that we gave it before, we can put a list of labels into legend(). This case, we're plotting two lines, and each line will take the label corresponding to the string in the position in the list by the order in which it's plotted:

x = np.arange(0,100)
y = 0.5 * x + 5 + 10*np.random.uniform(0,3,len(x))
r_slope, r_int, r_rval, r_pval, r_stderr = stats.linregress(x, y)
 
plt.plot(x, y, x * r_slope + r_int, 'g-.',label='Dash-dotted Line')
plt.legend(["Noisy data", "Linear regression"])
plt.show()
 

Here are some other tweaks in the call to the legend() function:

plt.legend(["Noisy data", "Linear regression"], loc='lower right', numpoints=1,fancybox=True, shadow=True)
 
If you want to look at the full list of options you can set with the legend() function, try plt.legend? in ipython.

Ok, so let's say you've spent all this time and you're reasonably satisfied with the figure you've created. To save the figure into a file, use the savefig function:

plt.savefig('filename.png',format='png')
#OR
plt.savefig('filename.pdf',format='pdf')
 

Pretty cool! You can load your data, graph it in the way that you want, and then save that figure, ready to go, or import into Illustrator or any other image editor of your choice for further editing.

Making our own plotting functions


You know how in papers, they will sometimes have a kind of fancy figure, and then they'll have things in the same style, but for a bunch of different ways of slicing and dicing their data? It's really pretty effective scientific story-telling. It allows them to connect all those figures together conceptually, and readers only have to look for the relevant differences.

The thing is, if you're going to actually make those figures, it can be annoying to tweak the plots in the same way every time. Fortunately, we've spent almost two weeks now taking boring things that a person could do and making the computer automate them.

Creating a scatter plot


Yesterday, we had you run cuffquant, cuffnorm, and cuffdiff on your RNA-seq samples. Cuffnorm returned an output file that contained a list of each transcript and its FPKM value across the different treatments. Let's say you are interested in what the correlation is between FPKM values between samples (to compare replicates, let's say, or even different treatment conditions). Let's create a function that will take the cuffnorm output and will make a scatter plot of the FPKM values in one sample vs another.

As sample data, I've included a table of FPKMs from an RNA-seq experiment I did a few years ago (you can download it here). You can less into the file and see that it is a tab-delimited file where each row is a gene, and each column represents the FPKM of that gene across different mutant conditions.

#!/usr/bin/env python
 
import matplotlib.pyplot as plt
import pandas as pd
 
# Function to parse and plot the data
def scatterFPKMs(table,sample1,sample2):
    # Parse table
    tableFile = pd.read_table(table)
    FPKMs_Sample1 = tableFile[sample1]
    FPKMs_Sample2 = tableFile[sample2]
    plt.scatter(FPKMs_Sample1,FPKMs_Sample2)
    plt.show()
 
 
scatterFPKMs("FPKM_table.tab",'WT','sir2')
 

Looks good! Except all the interesting business seems to be going on in the lower left hand corner. Next, we'll talk about some tweaks we can make to be able to view things in a clearer way. But first...


Informative Interlude: How Matplotlib is laid out


As long as we're grabbing information from the axes, it's worth spending a few moments talking about how Matplotlib is organized. We're going to use the code from the Log plots example to make a pretty looking set of pixels:

FigureAxes.png


The window that this is being plotted in corresponds to a Figure. This is everything inside the window (but not the tools on the bottom), and when you want to save an image to the disk (so you can include it in your manuscript), this is what actually gets saved. Figures control things like the size of the image if you print it out, and the resolution (for on screen, something like 72 dpi is fine, but if you're printing, you want it to be more like 300).

A Figure can contain zero or more sets of Axes, which are the subplots. In this case, we have four. A set of Axes is usually what you'll want to be trying to modify. Axes have properties like x and y limits, a set of major (and sometimes minor) ticks for the x and y axes, an optional legend, and lots more data. Furthermore, each axis has its own plotting commands, which are very similar to the top-level commands, but lets you be certain that they're going within the same Axes. This is particularly important if you have several subplots going on, each of which could conceivably receive the plotting data.

This is one of those places where we can't spend all our time talking about every single feature available, but the inline documentation is pretty good, and the gallery of examples is also really helpful if you kinda know what you want to do.




# Let's change the function so that the axes are in log-scale
 
import matplotlib.pyplot as plt
import pandas as pd
 
 
# Function to parse and plot the data
def scatterFPKMs(table,sample1,sample2):
    # Parse table
    tableFile = pd.read_table(table)
    FPKMs_Sample1 = tableFile[sample1]
    FPKMs_Sample2 = tableFile[sample2]
    plt.scatter(FPKMs_Sample1,FPKMs_Sample2)
    # To set more parameters, we can use the "get current axes" or gca() function
     # plt.gca? to get more info
     # or frame."tab" to see all possiblities
     # For example, try frame.axes.errorbar?
    frame = plt.gca()
    frame.axes.set_yscale('log')
    frame.axes.set_xscale('log')
    # Set limits on x-axis. Because the axis is log-scale, the min must be > 0
    plt.xlim(1,1000000)
    # Set limits on y-axis
    plt.ylim(1,1000000)
    plt.show()
 
scatterFPKMs("FPKM_table.tab",'WT','sir2')
 


Plotting a histogram of all RPKMs/FPKMs


Another thing that would be good to know is the overall distribution of expression values across all genes in the genome. Let's create a function that will take a sample name from this FPKM table and plot a histogram of the FPKM values in that sample:

#!/usr/bin/env python
 
import matplotlib.pyplot as plt
import pandas as pd
 
# Function to parse and plot the data
# in this case, we are grabbing one sample
 
def histogramFPKMs(table,sample1):
    # Parse table
    table = pd.read_table(table)
    colNames=table.columns.values
    if sample1 in colNames:
        FPKMs = table[sample1]
    else:
        return 'Could not find',sample1,'!. Please try again!'
    plt.hist(FPKMs,bins=30,range=(0,1000))
    plt.show()
 
histogramFPKMs("FPKM_table.tab","WT")
 


Finally, you can even set axis ticks and you can draw partially transparent graphs. To do the first, you can again use the gca() function to set a frame variable, then use the methods below to set x-ticks. To make a graph transparent (to enable overlapping graphs, for example), you can pass in the keyword argument "alpha" to hist(), then the bars are drawn partially transparent.

import pandas as pd
import matplotlib.pyplot as plt
 
fpkms = pd.read_table("FPKM_table.tab")
wt = fpkms["WT"]
 
# Let's add 100 to each FPKM value
# to offset these values from the others
 
wt100 = wt+100
 
sir3 = fpkms["sir3"]
 
# Plot each one (notice what plt.histogram() outputs)
plt.hist(wt100,bins=range(0,1000,10),range=(0,1000),alpha=0.5)
plt.hist(sir3,bins=range(0,1000,10),range=(0,1000),alpha=0.5)
plt.show()
 
 

By the way, notice what the output of histogram is?

Alpha can be any number between 0 and 1, where 0 is fully transparent (in which case why are you even drawing it?), and 1 is fully opaque (this is the default). Most of the matplotlib functions know how to deal with this alpha property, so that can sometimes be useful when your plots start getting visually busy.

Bar Plots


Sometimes, despite the mountains of data you have, you just want to look at a simple bar plot of expression values for a single gene across multiple conditions. To make bar graphs, matplotlib has a simple function called plt.bar that can do just that. Let's say we just want to plot the expression value of a gene called "HMLALPHA2" in our FPKM_table.tab from before:

import pandas as pd
import matplotlib.pyplot as plt
 
fpkms = pd.read_table("FPKM_table.tab")
 
# To find what row "HMLALPHA2" is in
for num,row in fpkms.iterrows():
    if row["Gene"]=="HMLALPHA2":
        rowNumber = num
    else:
        pass
 
print rowNumber
 
# Get values for all columns for this row
# Excluding Gene
expressionValues = fpkms.ix[1617][1::]
 
# Convert to Numpy Array
expArray = np.array(expressionValues)
print expArray
 
# Get length of the array to pass to plt.bar
l = len(expArray)
 
# Plot
plt.bar(range(1,l+1),expArray,align="center")
plt.xticks(range(1,l+1),fpkms.columns.values[1::])
plt.show()
 

One final note: the matplotlib documentation can be immensely helpful in aiding you to make a plot that fits your science needs. Not only is there a gallery of example plots, but there are also demo scripts that give you the code for how to make these plots. These demo scripts can easily be adapted for your own plotting purposes. No need to try and remember all of these functions and methods! Plus, unless you make the same kind of plot many many times, it would be virtually impossible to remember all of the different methods and functions that this plotting library contains.

Happy plotting!


Exercises


Exercise 1: Plotting %A, %T, %G, %C, and %Ns across read position.

(a) For Exercise 1 on Tuesday, you calculated the frequency of each of the 4 nucleotides and "N" by read position for the "E_coli_reads.fastq.gz" file. This metric can tell you if there is any bias in the %GC content of sequenced reads based on position. Now that you have the frequencies, plot the % As, %Ts, %Gs, %Cs, and %Ns by read position as a line graph. Your final graph should show read position on the x-axis (1 - 50), and percent (%) on the y axis. The figure should contain five lines: one for %As, one for %Ts, one for %Cs, one for %Gs, and one for %Ns.

(b) Add a figure title to your graph and labels for the x-axis ("Read Position") and y-axis "% Nucleotide". Save the figure in the .png format. Hint: look up plt.figure and plt.xlabel in ipython.

Exercise 2: Scatter plot practice

We've learned a lot about how to make graphs and how to manipulate certain parameters to our liking. But what if we want multiple panels in our graph? Let's say we want to compare the scatter plots of more than one pair-wise comparison in the same figure. Use "plt.subplot?" to read about the subplot function and how it works.

Next, make a single figure that contains three separate scatter plots (each in different colors) that compares the FPKMs for the following samples (use the "MP_genes.fpkm_table" file below to get FPKM values):
1. MP_control_0 versus MP_2H_0
2. MP_control_0 versus MP_4H_0
3. MP_2H_0 versus MP_4H_0



Remember to set the x and y axes to log scale and make both axes in range (1,1e6) (as in example from lecture)! Add titles to each subplot (Hint: see plt.title()). Save this plot as an image file in the format of your choice. To see all the different possible formats, try plt.savefig? in ipython.

Exercise 3: Bar plots

Here is a link to a table that has FPKM values for one gene across wild type plus five mutant conditions, with three replicates for each measurement: data table. You want to make a simple bar plot of this gene's expression across all genotypes, with error bars.

(a) Google how to calculate the standard error of the mean using scipy.stats and apply the appropriate method to get the standard error of the mean (sem) for the gene's expression in each genetic condition.

(b) Plot the mean measurement of the gene across all mutant conditions as a bar plot and add error bars representing the standard error of the mean you calculated in part (a). Hint: read the documentation of plt.bar to figure out how to add error bars to your bar plot.




Solutions


Exercise 1: Plotting %A, %T, %G, %C, and %Ns across read position.

## 1 ##
## 1a ##
 
import numpy as np
from Bio import SeqIO
 
# Create an array of zeros 50 x 5
# Each row represents a position
# Each column represents: A, T, G, C, N
 
nucleotideArray = np.zeros((50,5))
 
# Read into .fastq file
ecoli_fastq = "E_coli_reads.fastq"
 
indexDict = {'A':0,'T':1,'G':2,'C':3,'N':4}
 
number_of_reads=0
 
for rec in SeqIO.parse(ecoli_fastq,"fastq"):
    read = rec.seq
    number_of_reads+=1
    for pos,base in enumerate(read):
        nucleotideArray[pos,indexDict[base]]+=1
 
# Divide this entire array by number_of_reads
# and multiply by 100 to get percentage
 
percentArray = (nucleotideArray/float(1e6))*100
labelNames = []
# Now plot percent for each base
# by slicing down each column
for base in indexDict:
    column = indexDict[base]
    label = "% "+base
    x = range(50)
    plt.plot(x,percentArray[:,column],'-o',linewidth=3)
    labelNames.append(label)
 
plt.legend(labelNames)
plt.show()
 
## 1b ##
 
# Replace the code above
# Starting at percentArray
# with the following:
 
# Initialize a new figure and
# Give it a title by passing a string
# to plt.figure()
 
plt.figure("Percent Nucleotide By Read Position")
 
percentArray = (nucleotideArray/float(1e6))*100
labelNames = []
# Now plot percent for each base
# by slicing down each column
for base in indexDict:
    column = indexDict[base]
    label = "% "+base
    x = range(50)
    plt.plot(x,percentArray[:,column],'-o',linewidth=3)
    labelNames.append(label)
 
# Set x & y-axis labels
plt.xlabel("Read Position")
plt.ylabel("% Nucleotide")
plt.legend(labelNames)
 
# Save the figure
plt.savefig("Ex10-1b.png")

Exercise 2: Scatter plot practice

## 2 ##
import pandas as pd
import matplotlib.pyplot as plt
 
# Read into table to get FPKMs
fpkmTable = pd.read_table("MP_genes.fpkm_table")
 
comparisons = [("MP_control_0","MP_2H_0"),("MP_control_0","MP_4H_0"),("MP_2H_0","MP_4H_0")]
colorList = ['purple','green','yellow']
 
# Iterate over comparisons and plot
for ind,pair in enumerate(comparisons):
    cond1_fpkm = fpkmTable[pair[0]]
    cond2_fpkm = fpkmTable[pair[1]]
    # Use the .add_subplot method on the fig object to add a subplot.
    # This returns an axes instance just like the above plt.subplot(n,n,n)
    plt.subplot(1,3,ind+1)
    plt.scatter(cond1_fpkm,cond2_fpkm,color=colorList[ind],edgecolor="none")
    plt.ylim(1,1000000)
    plt.xlim(1,1000000)
     # Use ax.set_title to add a title to the current axes instance
    plt.title(pair[0]+' vs '+pair[1])
    plt.yscale('log')
    plt.xscale('log')
 
plt.show()

Exercise 3: Bar plots

## 3 ##
## 3a ##
 
import pandas as pd
import scipy.stats as sp
import matplotlib.pyplot as plt
 
# Read data table
dataTable = pd.read_table("dataTable.tab")
 
# Turn data into float array
# to make calculations easier
 
data = np.zeros((6,3))
 
for idx,item in enumerate(dataTable.columns.values):
    if item=="Genotype":
        pass
    else:
        dataI = idx-1
        values = dataTable[item]
        data[:,dataI]+=values
 
# Calculate Standard Error of the Mean
# across rows, or axis = 1
 
semArray = sp.sem(data,1)
print semArray
 
 
## 3b ##
 
# Calculate the mean across rows
avgArray = np.mean(data,1)
 
# plot
plt.bar(range(1,7),avgArray,align="center",yerr=semArray)
plt.show()