Interactive Python: Running code as we go along


So that's it. You're done. We've taught you all the core programming concepts that you'll need for the next week and a half, as well as for most of your life as a person who programs. Every language worth using will have some version of integers, floats, strings, lists, if/else, loops of some kind, and functions.

external image 41XbV6PjcHL._SY300_.jpg

...But we aren't letting you loose on the world quite yet. Like Harry Potter, you all have gained a 'magical' ability - though yours is more about reading and writing code and less about broomsticks and fighting He-Who-Must-Not-Be-Named. However, Harry had to spend 6 years refining his magical skills - you only have a bit more than a week left! In the time we have left, we'll show you a whole bunch of modules and functions that make the jobs you need to get done faster and easier.

By now, you've been writing code for 3 days, and it should be pretty apparent that it's easy to make python crash. Whether through syntax errors (which make Python crash before it even runs your code), or index or key errors, or a host of other errors that you'll run into, there's just a lot that can go wrong. Oddly enough, having your program crash is actually the good kind of error! Far more insidious is when your code happily runs and does something not entirely what you meant.


mops.jpg
A program (fill the cauldron) gone wrong!



Tomorrow morning, Mel is going to go through a couple "Strategic Initiatives" to help us get bugs out of our code, and keep it bug free, but it will bear repeating that one of those is "Test Early, Test Often".


One of the ways that we can fight against this is to have tests for our code. (There's even a style of programming, called Test Driven Development, where you write the test first, then write the code to pass the test.) Writing tests for your code is hard, but it's also a worthwhile skill to have.

Just like in bench science, it's always a good idea to have both positive and negative controls.

One of the ways to make your tests as compact and easy to understand as possible is to have your functions that you're testing also be small and logically compact. In a highly contrived example, let's imagine a function that checks whether a list has the proper number of entries for a GTF file (9), then if it does, finds something that looks like a Ensemble gene identifier, and prints it.

##input: a list of strings gtf_list (from a line of a GTF file)
##output: the string ensemble id from gtf_list
def get_ensemble_id(gtf_list):
 
    if len(gtf_list) == 9:
 
        ##ids should all start with ENS
        ensID_start = gtf_list[8].find('ENS')
        ##the id ends with a ", so find the next " after the start
        ensID_end = gtf_list[8][ensID_start:].find('"') + ensID_start
        return gtf_list[8][ensID_start:ensID_end]



Notice the header above the function - I have specified the type and name of each input to the function get_ensemble_id (in this case, there is only one input, the list of strings gtf_list). I've also described the type of output from this function - a string that is the ensemble identifier for the line. This input/output header is by no means required. However, I find it useful when looking through old code, as in one glance you know what to feed a function and what it spits out.

Now logically, this function has two separate ideas, each of which relies on a bit of domain specific knowledge. For instance, both the 8 and the 9 are what programmers call "magic numbers". While these aren't necessarily bad, someone else who's looking at your code might wonder why 9, and not 10, or 8. And the 8 in the next couple lines might be because it's 1 less than 9 (which is the maximum index), or it could just be a separate number that happens to be close to 9. Basically, for a 5 line function, it will be pretty confusing for the next person who looks at it, and that next person might be you in a couple months, which is more than long enough to forget why you wrote something the way you did.

What if we compare it to this set of functions:

##input: a list of strings gtf_list (from a line of a GTF file)
##output: the ensemble id from gtf_list
def get_ensemble_id(gtf_list):
 
    if is_valid_gtf(gtf_list):
        return find_ensID(gtf_list)
 
##input: a list of strings gtf_list (from a line of a GTF file)
##output: a boolean, True if the list has 9 elements
def is_valid_gtf(gtf_list):
    return len(gtf_list) == 9
    ## GTF files have 9 fields
    ##http://uswest.ensembl.org/info/website/upload/gff.html
 
##input: a list of strings gtf_list (from a line of a GTF file)
##output: the ensemble id from gtf_list
def find_ensID(gtf_list):
 
    annotation = gtf_list[8]
    ##ids should all start with ENS
    ensID_start = annotation.find('ENS')
    ##the id ends with a ", so find the next " after the start
    ensID_end = annotation[ensID_start:].find('"') + ensID_start
    return annotation[ensID_start:ensID_end]



This has a couple distinct advantages: by breaking it up, we can re-use each of the functions later. Second, we can test each of the functions separately, and if it turns out that there's a bug in one of them, then we can fix it once, and everywhere it gets reusued, it'll be fixed there. Finally, as a wise man once said, there's two ways to write software: one is to write code so simple that there are obviously no bugs, and the second is to write code so complex that there are no obvious bugs. By breaking things up into sub-functions, it's easier to do the former than the latter. It's also easier to try different inputs to the function, and see what results it gives.

Now, we can do this by making a separate file that tests the output to make sure it gives what we want, but another quick and dirty way to do it is to use an interactive interpreter.

The basic interpreter


In the last 3 days, has anyone tried typing just
$ python

on the command line? If you haven't, go ahead and try it now. Rather than complaining about the fact that there's no file for it to run, it prints some stuff out, and gives us a totally different command line:

Enthought Python Distribution -- www.enthought.com
Version: 7.3-1 (32-bit)

Python 2.7.3 |EPD 7.3-1 (32-bit)| (default, Apr 12 2012, 11:28:34)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "credits", "demo" or "enthought" for more information.
>>>

The >>> is a prompt to enter commands, like in the shell, but now we can enter any Python code that we want. I'll be using it from now on to indicate anything that will go into an interactive interpreter, as opposed to into a file. Let's start by making a variable

>>> people = ['Aisha','Mel','James']



Nothing happened... or did it?
>>> print people
['Aisha','Mel','James']
>>> people
['Aisha','Mel','James']
 


As it turns out, we so often want to see the value of something from the interactive interpreter that we can just type the name of the variable (or many other expressions without an assignment in them), and the interpreter will print them out for us.

>>> people*3
['Aisha', 'Mel', 'James', 'Aisha', 'Mel', 'James', 'Aisha', 'Mel', 'James']
>>> people  # is left unchanged
['Aisha', 'Mel', 'James']

One key thing you might like to do from here is to get help about what else you can do

>>> help()
 
Welcome to Python 2.7!  This is the online help utility.
 
If this is your first time using Python, you should definitely check out
the tutorial on the Internet at http://docs.python.org/tutorial/.
 
Enter the name of any keyword, or topic to get help on writing Python
programs and using Python.  To quit this help utility and return to the
interpreter, just type "quit".
 
To get a list of available modules, keywords, or topics, type "keywords",
or "topics".

As the help function has just informed us, you can use help() to get help on specific modules, keywords, or topics. Let's use it to get the low-down on the other very useful interpreter function, dir().

Help on built-in function dir in module __builtin__:
 
dir(...)
dir([object]) -> list of strings
 
If called without an argument, return the names in the current scope.
Else, return an alphabetized list of names comprising (some of) the attributes
of the given object, and of attributes reachable from it.
If the object supplies a method named __dir__, it will be used; otherwise
the default dir() logic is used and returns:
for a module object: the module's attributes.
for a class object:  its attributes, and recursively the attributes
of its bases.
for any other object: its attributes, its class's attributes, and
recursively the attributes of its class's base classes.
(END)

This shows us that dir() can tell us about objects in our namespace, which in non-jargon means that dir() can tell us about things like variables and modules and functions, and which ones we have access to at whatever point we're at in our program. To get out of the help mode and back into the main interpreter, we just keep hitting "q" for quit until we get back to our familiar >>>

help> q
 
You are now leaving help and returning to the Python interpreter.
If you want to ask for help on a particular object directly from the
interpreter, you can type "help(object)".  Executing "help('string')"
has the same effect as typing a particular string at the help> prompt.

This gives us some useful information about how better to use help in the future.

Now, what does dir() do by itself?

>>> dir()
['__builtins__', '__doc__', '__name__', '__package__', 'people']

So, we should recognize our list variable, "people", from before. So, what does dir() have to say about that?

>>> dir(people)
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__',
'__delslice__', '__doc__', '__eq__','__ge__', '__getattribute__',
'__getitem__', '__getslice__', '__gt__', '__hash__', '__iadd__',
'__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__',
'__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__',
'__setslice__', '__str__', 'append', 'count', 'extend', 'index',
'insert', 'pop', 'remove', 'reverse', 'sort']

This command shows us ALL the things that belong to our list variable, including the methods that we can call to manipulate it, such as .append(), etc. It also shows us a whole bunch of internal objects that belong to our list variable (which don't belong to us, and we don't get to use them -- at least not in this class).

So, hopefully you can see how to use this power for good: dir() can tell you all the methods and functions available to an object. For those of us that prefer to work in the interpreter, we find ourselves only rarely consulting exterior documentation. We can use dir() to find out what is available to an object, and then we can use help() to figure out what things are:

>>> help(people.append)
 
Help on built-in function append:
 
append(...)
L.append(object) -- append object to end
(END)

Which very succinctly tells us what we've already learned: people.append() will add whatever object we put in the parentheses to the end of the people list.



Using IPython, an enhanced Python interpreter



The default Python interpreter is a very basic interface. There are things that you'd like to be able to do, such as enter more than one line of code at a time, or perhaps look back through your history of commands. The enhanced IPython interpreter provides these perks and more.

You can start off by launching IPython from the command line:

$ ipython

Enthought Python Distribution -- www.enthought.com
 
Python 2.7.3 |EPD 7.3-1 (32-bit)| (default, Apr 12 2012, 11:28:34)
Type "copyright", "credits" or "license" for more information.
 
IPython 0.12.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]:

IPython launches and tells us what version we're running, and gives us a couple of tips. Just typing a question mark and pressing enter will get us a comprehensive introduction to IPython, which I'll leave to you to peruse in your leisure time. You'll notice that the prompt for IPython includes a line number for each line of your code, signified on the first line by ln [1]:. The %quickref guide gives us a succinct description of the fanciness of IPython. For starters, you have instant access to the system prompt for commands like ls and cd.

In [2]: ls
cerevisiae_genome.fasta  gambling.py.bak  GTF_parse.py    histfile.py  operons.gff      operonToGFF.py.bak  yeast.nt
gambling.py             genes.ark        GTF_parse.py.bak  operons.ark   operonToGFF.py  test.out


A number of magic functions, all of which start with a percent sign %, provide functionality that the regular interpretter cannot provide. One for regular use is %history, which will show you the lines of code you have previously entered in this session:

<range type="comment" id="516003648_1">In [5]: %hist -nt
1: get_ipython().magic(u'quickref')
2: get_ipython().show_usage()
3: help
4: get_ipython().system(u'ls -F --color ')
5: get_ipython().magic(u'hist -nt')</range id="516003648_1">
 

The -n flag means "show line numbers", which is actually usually more of a hindrance, so the default is to do it without. The -t flag means to translate what you wrote into what IPython is feeding directly into Python. This generally applies only to the "magic" commands that you type in.

In [6]: mystr = 'alohomora'
 
In [7]: for i in mystr:
   ...:     print i
   ...:
a
l
o
h
o
m
o
r
a
 
In [8]: %hist
%quickref
?
help
ls
%hist -nt
mystr = 'alohomora'
for i in mystr:
    print i
%hist


See how the regular Python code got left alone? Now one of the primary use cases for IPython is to try out code that you're not sure how it's going to work, then add it into a .py file that you'll run over and over, once you're mostly confident that what you have is pretty good. If this is the first time you're saving into a given file, IPython makes it even easier!

In [9]: %save histfile.py 6-7
The following commands were written to file `histfile.py`:
mystr = 'alohomora'
for i in mystr:
    print i


And now any time you want to run that script, you can do it from IPython:

In [10]: run histfile  # The .py is optional when running within ipython
a
l
o
h
o
m
o
r
a



One of the nice things about this approach is that all the variables from the global namespace of your script also get added to the interpreter's namespace, so you can then use a script to generate or load in some data, play around with it in IPython, then go back and update your script. It's also possible to edit pre-existing files from within IPython!


In [3]: %edit histfile


%edit tells IPython to open a text editor (default is vi) to allow you to modify code in a saved file.

# coding: utf-8
mystr = 'alohomora'
newstr = "This doesn't get printed"
for i in mystr:
    print i
 


After you finish editing, exiting the text editor will bring you back to IPython

In [3]: %edit histfile
Editing... done. Executing edited code...
a
l
o
h
o
m
o
r
a
 
In [4]: print newstr
This doesn't get printed


Note that edit assumes you want to run the code immediately afterwards. If this isn't the case, you can call edit -x.

Okay, so there are lots of nice little magic tricks in IPython, and you can use the %quickref guide to explore them more on your own. Meanwhile, lemme show you a couple of other handy tricks that don't involve magic functions.

Remember in the basic Python interpretter, we could use dir() to find out what objects belonged to a given object? Well, in IPython, all we have to do to capture the same information is type the object and period, then press the tab.

In [16]: mystr.
mystr.capitalize  mystr.isalnum     mystr.lstrip      mystr.splitlines
mystr.center      mystr.isalpha     mystr.partition   mystr.startswith
mystr.count       mystr.isdigit     mystr.replace     mystr.strip
mystr.decode      mystr.islower     mystr.rfind       mystr.swapcase
mystr.encode      mystr.isspace     mystr.rindex      mystr.title
mystr.endswith    mystr.istitle     mystr.rjust       mystr.translate
mystr.expandtabs  mystr.isupper     mystr.rpartition  mystr.upper
mystr.find        mystr.join        mystr.rsplit      mystr.zfill
mystr.format      mystr.ljust       mystr.rstrip
mystr.index       mystr.lower       mystr.split


This shows us all the things we can use or modify that belong to our variable mystr, from our for loop. If you know what letter(s) a particular function you wants ends with, you can type them and it narrows the list:

In [16]: mystr.l
mystr.ljust   mystr.lower   mystr.lstrip
 
In [16]: mystr.l

If you give enough of the name that it's unique, you can just type that and hit tab, which will save you time.

And what if you don't know what one of those functons or methods is? Try this:
In [16]: mystr.endswith?
Type:       builtin_function_or_method
Base Class: <type 'builtin_function_or_method'>
String Form:<built-in method endswith of str object at 0x10d3780>
Namespace:  Interactive
Docstring:
S.endswith(suffix[, start[, end]]) -> bool
 
Return True if S ends with the specified suffix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
suffix can also be a tuple of strings to try.
 
In [17]:
 


One question mark at the end of the line brings up a help dialog for the object, including the docstring for a function or method.
This is nearly the same as doing help(mystr.endswith), but 5 fewer keystrokes, or more if you need to scroll around to put the "help(" at the beginning after typing mystr.endswith.

If you call it with two question marks, and the source is available, it will display that as well! A lot of Python is written in Python, so one way to learn about things is to see how other people have written their modules.

What if there's a large block of code you want to paste in? You may have noticed, as we were typing in the for loop, it automatically gave us the standard 4 spaces to indicate the block. To get around this, use the %cpaste magic command, which lets you paste things in blocks of code from elsewhere, without trying to add indentation. To stop pasting, have a line with only two hyphens on it ('--').

In [8]: cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:##input: a list of strings gtf_list (from a line of a GTF file)
:##output: the ensemble id from gtf_list
:def get_ensemble_id(gtf_list):
:
:    if is_valid_gtf(gtf_list):
:        return find_ensID(gtf_list)
:
:##input: a list of strings gtf_list (from a line of a GTF file)
:##output: a boolean, True if the list has 9 elements
:def is_valid_gtf(gtf_list):
:    return len(gtf_list) == 9
:    ## GTF files have 9 fields
:    ##http://uswest.ensembl.org/info/website/upload/gff.html
:
:##input: a list of strings gtf_list (from a line of a GTF file)
:##output: the ensemble id from gtf_list
:def find_ensID(gtf_list):
:
:    annotation = gtf_list[8]
:    ##ids should all start with ENS
:    ensID_start = annotation.find('ENS')
:    ##the id ends with a ", so find the next " after the start
:    ensID_end = annotation[ensID_start:].find('"') + ensID_start
:    return annotation[ensID_start:ensID_end]
:--
 
In [9]: is_valid_gtf([1,2,'cat'])
Out[9]: False
 
In [10]: is_valid_gtf(range(9))
Out[10]: True



Now, let's see how well that code we wrote before stands up:

In [16]: gtfline = 'chrI\tgasAcu1_ensGene\tCDS\t8387389\t8387447\t0.000000\t-\t2\tgene_id "ENSGACT00000012032"; transcript_id "ENSGACT00000012032";'
 
In [17]: is_valid_gtf(gtfline.split())
Out[17]: False
 

Not so great for a first test - I was a bit confused at first why my positive control wasn't working. What's going on here?

Lets look at the input we are giving to is_valid_gtf:

In [20]: gtfline.split()
Out[20]:
['chrI',
 'gasAcu1_ensGene',
 'CDS',
 '8387389',
 '8387447',
 '0.000000',
 '-',
 '2',
 'gene_id',
 '"ENSGACT00000012032";',
 'transcript_id',
 '"ENSGACT00000012032";']
 


Remember that split() by default doesn't just split a string at the \t character - it also splits on spaces and other deliminators. While this normally isn't a problem, the last column of our GTF line contains spaces, causing split() to break in up. We can fix this by telling split() to only work with \t characters.

In [18]: is_valid_gtf(gtfline.split('\t'))
Out[18]: True
 
In [19]: find_ensID(gtfline.split('\t'))
Out[19]: 'ENSGACT00000012032'



Much better! This is a good example of why you should test your code early and often - this was an easy bug to track down at this stage, but might have been trickier to find if we just tried to run the wrapper function get_ensemble_id.

In [22]: get_ensemble_id(gtfline.split())
 
 





Since gtfline.split() is not a valid GTF line, nothing is returned. This is trickier to debug - something could have gone wrong in any of the code that we wrote, and its hard to pin down where it is without further tests.


However, as we already found the bug, lets test this with the proper input

get_ensemble_id(gtfline.split('\t'))
Out[23]: 'ENSGACT00000012032'



Profiling


One thing that's pretty valuable, that IPython lets us do easily is "profiling" our code, which is a way to see where all the time is being spent. Now, I've said earlier that it's better to do something slow and simply, than (potentially) super-fast and complicated, but if you find that you're doing something a lot, it may be worthwhile to spend some time figuring out how to make it faster (especially if that comes at a low complexity cost).


As an example, let's look at our fasta parser:

def fastaParser(filename):
   current_gene = ""
   genes = {}
   fh = open(filename, 'r')
 
   for line in fh:
       line = line.strip()
       if line.startswith('>'):
           current_gene = line[1:]
           genes[current_gene] = ''
       else:
            join(genes, current_gene, line)
 
   return genes
 
 
def join(gene_dict, gene, line):
    gene_dict[gene] += line
 
 
yeast_genome = fastaParser('cerevisiae_genome.fasta')
 
print "There are", len(yeast_genome), "chromosomes in yeast"
 

Now, this is probably okay for short little genes, but what happens if we try to parse the yeast genome, which has lots of long sequences?


When we run this, it takes a really long time (something like 30 seconds on my computer). Now, the yeast genome isn't that big (not to knock those mighty single-celled fermenters, but 12megabases is nothing to fly's 130MB, or a human's 3 GB, or Paris japonica, a canopy plant with a whopping 150 GB genome), so it would be nice if we could find a faster way to run it. To do that, though, we'd need to know where to focus our efforts. To do that, from within IPython, we run the code with the '-p' flag:


In [1]: run -p yeast_genome.py
 
55921 function calls in 139.252 seconds
 
 Ordered by: internal time
 
 ncalls tottime percall cumtime percall filename:lineno(function)
 151946 134.229 0.001 134.229 0.001 cerevisiae_genome.py:17(join)
 1 3.315 3.315 139.249 139.249 cerevisiae_genome.py:1(fastaParser)
 151963 1.036 0.000 1.036 0.000 {method 'startswith' of 'str' objects}
 151963 0.669 0.000 0.669 0.000 {method 'strip' of 'str' objects}
 1 0.002 0.002 139.251 139.251 {execfile}
 1 0.000 0.000 139.249 139.249 cerevisiae_genome.py:1(<module>)
 1 0.000 0.000 139.252 139.252 interactiveshell.py:2257(safe_execfile)
 2 0.000 0.000 0.000 0.000 {open}
 1 0.000 0.000 0.000 0.000 posixpath.py:312(normpath)
 
...
 



So I broke up the fastaParser function in a bit odd of a way, but that's because I happened to know where it would be spending almost all of its time: the join function. (By the way, if you're profiling your own code and you think there may be a basic operation that you can't profile (something like +), feel free to break it out into it's own function when necessary).

What's happening here is that since a string is an immutable type, whenever you + two strings together, you have to make a new one. This is generally okay for short strings, but if you're doing it a lot of times, with increasingly longer strings, it gets really slow (in fact, it slows down with the square of the number of things you're adding together).

A different way to do this would be:


def fastaParser2(filename):
   current_gene = ""
   genes = {}
   fh = open(filename, 'r')
 
   for line in fh:
       line = line.strip()
       if line.startswith('>'):
           current_gene = line[1:]
           genes[current_gene] = []
       else:
            join2(genes, current_gene, line)
 
   for gene in genes:
       genes[gene] = ''.join(genes[gene])
 
   return genes
 
 
def join2(gene_dict, gene, line):
    gene_dict[gene].append(line)
 
yeast_genome = fastaParser2('cerevisiae_genome.fasta')
 
print "There are", len(yeast_genome), "chromosomes in yeast"

So what have I done here? Instead of building up the string as we go along, I add them to a list. Lists have been optimized such that you can add to them, and they don't slow down as they get longer. Then, once everything is in a list, we go through for all of the keys and use the string "join" method (which is different from the join function we wrote), which will combine all the items in the list, putting an empty string between them.

607884 function calls in 4.021 seconds
 
   Ordered by: internal time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.732    1.732    4.019    4.019 cerevisiae_genome2.py:1(fastaParser2)
   151946    0.863    0.000    1.216    0.000 cerevisiae_genome2.py:20(join2)
   151963    0.535    0.000    0.535    0.000 {method 'startswith' of 'str' objects}
   151963    0.499    0.000    0.499    0.000 {method 'strip' of 'str' objects}
   151953    0.353    0.000    0.353    0.000 {method 'append' of 'list' objects}
       17    0.037    0.002    0.037    0.002 {method 'join' of 'str' objects}
        1    0.001    0.001    4.020    4.020 {execfile}
        1    0.000    0.000    4.019    4.019 cerevisiae_genome2.py:1(<module>)
        1    0.000    0.000    4.021    4.021 interactiveshell.py:2257(safe_execfile)
        2    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 posixpath.py:312(normpath)
        1    0.000    0.000    0.000    0.000 syspathcontext.py:55(__enter__)
        1    0.000    0.000    0.000    0.000 posixpath.py:118(dirname)
 



So now we see that our own join function goes about 300 times faster, and the we spend almost no time at all in the extra string joining. Plus, the code is not really that much more complicated.

A general piece of advice is not to try to optimize your code until *after* you've profiled it. Otherwise, you might spend a lot of time making something largely irrelevant twice as fast, but only improve your total speed by a very tiny amount.

Exercises

Exercise 1 should be considered "required". Work on it tonight if you're unable to finish it during the time provided. Nevertheless, it may be helpful to do some of the other exercises first if you want to get a better handle on some of the things we've talked about.

1. One of the things we'll need next week during the Project portion of the class is a GFF file with all of the E. coli operons, or as EcoCyc might call them "transcription units". They call it something different, since in their mind an operon must have more than one gene in it. Instead, we're interested in everything that's likely to appear on a single transcript, whether that's one gene or several.

In the middle of the last decade, the Arkin lab put together a set of predictions about which genes in the E. coli genome (as well as lots of other bacterial genomes), were likely to be on the same transcript, and thus fit a rough definition of "operon". Their predictions were based on :
  • The distance between them in nucleotides
  • Whether the genes are conserved near each other in other genomes, based on MicrobesOnline Ortholog Groups
  • The correlation of their expression patterns, if gene expression data is available
  • Whether they both belong to a narrow GO category
  • Whether they share a COG functional category

While their E. coli data is online, it's in a format that is not standard and not terribly convenient. What we'd like to do here is take the tab-delimited version of the table, as well as the tab-delimited list of all genes that's also provided, and make a GFF file that has one operon per line.

A GFF is also tab delimited, and has the following fields:
  1. <seqname> This is the name of the reference sequence. If we had multiple chromosomes, it could change, but for E. coli, it will just be "Chromosome"
  2. <source> This is some information on where this annotation came from. If you're combining the results of multiple programs, it might change, but for us, it can just be "MicrobesOnline"
  3. <feature> This is the type of feature. A GFF file can have multiple kinds of features, like genes, exons, introns, tRNAs, and more all in the same file.
  4. <start> An integer from 1 to the size of the reference, inclusive. Note that, unlike Python, sequences start counting from 1.
  5. <end> An integer from <start> to the size of the reference. By convention, even if it's on the - strand, this will be a greater or equal coordinate than the <start>. Often times, you'll have to switch these by hand when your feature is on the minus strand. I'm not terribly pleased by this compromise, but it does seem to be standard, and you'll likely break other software if you go against it.
  6. <score> Often times, this is just a "." (without the quotes), meaning no score was assigned. Extra credit if you want to take the pOp column from the table and combine them in a sensible way.
  7. <strand> This will be either "+" or "-" (without the quotes). Some feature types won't have a direction, and will be ".", but really very few.
  8. <frame> This will be "." for our purposes. The most common case you'll see a number here is for exons. The first one will start at 0, but if other exons get spliced in the middle of a codon, it might be 1 or 2.
  9. [attributes] This is a semicolon separated list of whatever else you want to put in. The format is something like: Name1 "Value1"; Name2 "Value2". For our purposes, let's have one attribute, named "operonName", with the value a hyphen separated list of genes in that operon. For instance, we should have an operon named: "lacZ-lacY-lacA". (This is not the usual way of naming things, but easier, and clear enough. For bonus points, figure out how to get the canonical "lacZYA")

As for the tab separated operon calls file, here are the rules I can glean by spending some time looking at it:
  1. If two adjacent genes are in the same direction, determine whether they are part of the same operon. If they are, print TRUE in the bop column, otherwise, print FALSE. For reasons I won't speculate on, the cutoff seems to be if the probability is 0.570 or greater.
  2. If two adjacent genes are in opposite directions, then they are clearly not on the same operon, so no line is necessary.
  3. Genes on the minus strand look like they can appear in either the transcriptional order (like the trp operon) or in order along the '+' strand (like the lac operon)

For reasons that are, like the cutoff of 0.570, incomprehensible to me, the gene rlmE is referred to by its synonym rrmJ in the genome information file, and as rlmE in the operon prediction file. Feel free to edit one of them in the data file, or to have a special case at some point in your program.

It's worth bearing in mind that depending on whether you try to collate the names into the operon calls, or the operon calls into the names, this can be a lot harder or a lot easier.

The first few lines of my operons.gff file looks like this:

Chromosome    MicrobesOnline    operon    190    255    .    +    .    operonName "thrL";
Chromosome    MicrobesOnline    operon    337    5020    .    +    .    operonName "thrA-thrB-thrC";
Chromosome    MicrobesOnline    operon    5234    5530    .    +    .    operonName "yaaX";
Chromosome    MicrobesOnline    operon    5683    6459    .    -    .    operonName "yaaA";
Chromosome    MicrobesOnline    operon    6529    7959    .    -    .    operonName "yaaJ";
Chromosome    MicrobesOnline    operon    8238    9191    .    +    .    operonName "talB";
Chromosome    MicrobesOnline    operon    9306    9893    .    +    .    operonName "mog";
Chromosome    MicrobesOnline    operon    9928    10494    .    -    .    operonName "yaaH";
Chromosome    MicrobesOnline    operon    10643    11786    .    -    .    operonName "yaaW-yaaI";
Chromosome    MicrobesOnline    operon    12163    15298    .    +    .    operonName "dnaK-dnaJ";

This is a bigger problem than anything we have assigned so far - we're asking you to read in data from two different files, merge the data and write out file in a completely different format. I don't want to break this problem down into parts, because there are many 'right' ways and many more 'mostly right' ways to do this problem, and learning how to break down a programming problem is a valuable skill in and of itself.

However, if you are feeling stuck, here is how I broke the problem down. Again, there are many ways to approach this problem, so go with whatever makes the most sense in your mind.

a) Read in the operon information from the given file, and assign each gene to an operon. If a gene isn't in an operon with another gene, how do you want to store this?

b) Read through the gene information file from microbesOnline, and store the store the start, end, and strand of each gene

c) Write the GFF file complete with operon information using the information gathered in a) and b). This can be done while stepping through the file in b, though this is a bit trickier.

2) From ipython, type %magic to read about more magic functions that we did not discuss. Usually from inside of iPython, if you type ls you will get the directory listing of the current working directory. Change this behavior so that it displays the long directory listing (i.e. like using ls -l in the terminal).



Solutions

Exercise 1 - Solution A (previous year's solution)

#! /usr/bin/env python
 
##input: a string operon_path which points to the operon file
##output: a dictionary in the form operons[geneA] = [geneA,geneB,...geneX] if geneA is in an operon
##if the gene isn't in an operon, it isn't in the dictionary
def get_operons(operon_path):
 
 
    ##operons will have one entry per gene in the operon
    genes_in_operons = {}
 
    operon_file = open(operon_path,'r')
 
    ##the first line is a header, skip it
    operon_file.readline()
 
    for line in operon_file:
 
        split_line = line.split('\t')
 
        ##skip the line if it doesn't look to be properly formatted
        if len(split_line) < 7:
            continue
 
        ##the names are located in the 5th/6th columns
        gene1 = split_line[4]
        gene2 = split_line[5]
 
        ##either TRUE or FALSE is at split_line[6], so this is True if its TRUE
        in_operon = split_line[6] == 'TRUE'
 
 
        ##if the two genes are in an operon
        if in_operon:
 
            ##if the first gene is already in an operon
            if gene1 in genes_in_operons:
                ##we want to update all the other genes already in this operon
                ##as we are going to change this value, use a copy instead
                old_genes = genes_in_operons[gene1][:]
                ##for each gene previously in the operon (which includes gene1) add gene2 to the list of genes
                for gene in old_genes:
                    genes_in_operons[gene].append(gene2)
 
                ##set gene2 to be a copy of gene1
                genes_in_operons[gene2] = genes_in_operons[gene1][:]
 
            ##if the gene isn't previously in an operon, initialize the operon
            else:
                genes_in_operons[gene1] = [gene1,gene2]
                genes_in_operons[gene2] = [gene1,gene2]
 
    return genes_in_operons
 
 
 
 
 
##input: a dictionary operons, with operon names as keys and a list of genes in the operon as values
##       a string gene_path that points to a file containing the genomic coordinates for each gene
##      also takes a string gff_path, to which the operon gff file will be written to
##output: a GFF file consisting of one entry per transcriptional unit writen to gff_path
def write_operon_gff(operons,gene_path,gff_path):
 
    gff_file = open(gff_path,'w')
 
    ##this string is the basis for each line
    gff_base_string = '\t'.join(['Chromosome', 'MicrobesOnline', 'operon',
                      '%d', # low end position
                      '%d', # high end position
                      '.', '%s', # strand
                      '.', 'operonName "%s";\n'])
 
    ##open the gene_file
    gene_file = open(gene_path,'r')
 
    ##get rid of the header line
    gene_file.readline()
 
 
    ##initialize a current_start,current_end,current_strand to store operon values as the loop progresses
    cur_start = 0
    cur_end = 0
    cur_strand = ''
    cur_name = ''
    cur_operon = []
 
    ##loop through each line in the gene locations file
    for line in gene_file:
 
        split_line = line.split('\t')
 
        ##if the line is malformed, skip it
        if len(split_line) < 16:
            continue
 
        ##read and store the data from each line
        next_name = split_line[8]
        next_strand = split_line[6]
        next_start = int(split_line[4])
        next_end = int(split_line[5])
 
 
        ##in GFF terms, start always comes before end, so if on the - strand flip start and end
        if next_strand == '-':
            next_start,next_end = next_end,next_start
 
        ##correcting a synonym
        if next_name == 'rrmJ':
            next_name = 'rlmE'
 
        next_operon = [next_name]
 
        ##if the next gene is in a operon, get the name
        if next_name in operons:
            next_operon = operons[next_name]
 
        ##if the current gene and the next gene are in the same operon
        if next_operon == cur_operon:
            cur_end = next_end
 
 
        else:
            ##this starts out empty (False), so won't write the first time
            if cur_name:
                ##format the gff line, then print it
                gff_line = gff_base_string % (cur_start,cur_end,cur_strand,'-'.join(cur_operon))
                gff_file.write(gff_line)
 
            ##set the current values to the next values
            cur_start = next_start
            cur_end = next_end
            cur_strand = next_strand
            cur_operon = next_operon
 
        cur_name = next_name
        cur_strand = next_strand
 
 
    ##write out the current one more time (as it was simply set to next in the last iteration of the loop
    gff_line = gff_base_string % (cur_start,cur_end,cur_strand,'-'.join(cur_operon))
    gff_file.write(gff_line)
    gff_file.close()
 
 
write_operon_gff(get_operons('operons.ark'), 'genes.ark', 'operons.gff')
 


Exercise 1 - Solution B (Fernando's solution)

#! /usr/bin/env python
 
##input: a string operon_path which points to the operon file
##output: a list of lists containing each operon's genes
##if the gene isn't in an operon, it isn't in the dictionary
def get_operons(operon_path):
 
 
    ##operons will have one entry per gene in the operon
    all_operons = []
 
    operon_file = open(operon_path,'r')
 
    ##the first line is a header, skip it
    operon_file.readline()
 
    for line in operon_file:
 
        split_line = line.split('\t')
 
        ##skip the line if it doesn't look to be properly formatted
        if len(split_line) < 7:
            continue
 
        ##the names are located in the 5th/6th columns
        gene1 = split_line[4]
        gene2 = split_line[5]
 
        ##either TRUE or FALSE is at split_line[6], so this is True if its TRUE
        in_operon = split_line[6] == 'TRUE'
 
        # If the list is empty
        if all_operons == []:
 
            ##if the two genes are in an operon, append them together as a list
            if in_operon:
                all_operons.append([gene1,gene2])
 
            # Else, append them as single-gene lists, separately
            else:
                all_operons.append([gene1])
                all_operons.append([gene2])
 
        # If the list is not empty:
        else:
 
            # If the two genes are in an operon
            if in_operon:
 
                # If we're continuing from a previous operon, just add the last gene to that operon
                if all_operons[-1][-1] == gene1:
                    all_operons[-1].append(gene2)
 
                # If we're starting a new operon, add the two genes together as a list
                else:
                    all_operons.append([gene1,gene2])
 
            else:
 
                # If we're continuing from a previous operon, gene1 is already in it, so only need to add gene2 as new operon
                if all_operons[-1][-1] == gene1:
                    all_operons.append([gene2])
 
                # If neither of the two genes are in the list, append them separately
                else:
                    all_operons.append([gene1])
                    all_operons.append([gene2])
 
    # Close operon file
    operon_file.close()
 
    return all_operons
 
 
# Input: name of gene name file
# Output: dictionary with gene names as keys, and list of information as values
def collect_gene_info(gene_name_file):
 
    # Iterate over gene name file
    infile = open(gene_name_file,"r")
 
    # Dictionary with gene information
    gene_dict = {}
 
    # Get rid of first line
    infile.readline()
 
    # Iterate over lines in file
    for line in infile:
 
        # Split line into list of columns
        fields = line.strip("\n").split("\t")
 
        ##if the line is malformed, skip it
        if len(fields) < 16:
            continue
 
        # Gene name is column 8 (0-indexed)
        gene_name = fields[8]
 
        ##correcting a synonym
        if gene_name == 'rrmJ':
            gene_name = 'rlmE'
 
        # Gene start is column 4
        start = fields[4]
 
        # Gene end is column 5
        end = fields[5]
 
        # Gene strand is column 6
        strand = fields[6]
 
        # Invert positions if we're in negative strand - only important if orientation of transcription matters
        if strand == "-":
            start,end = end,start
 
        # Create dictionary entry
        gene_dict[gene_name] = [int(start),int(end),strand]
 
    # Close gene file
    infile.close()
 
    # Return dictionary with gene information
    return gene_dict
 
 
# Input: list of lists containing operons, dictionary of gene information, name of output file
# Output: Operon gff file
def write_operon_gff(all_operons, gene_info_dic, gtf_output_file):
 
    # Open output file
    outfile = open(gtf_output_file,"w")
 
    # Iterate over each operon
    for operon in all_operons:
 
        # Name is formed by concatenating all genes via dash
        operon_name = "-".join(operon)
 
        # Some entries are empty, skip them - found this out via error message
        if operon[0] == '' or operon[-1] == '':
            continue
 
        # Get start of first gene
        allstart = [gene_info_dic[x][0] for x in operon]
        operon_start = str(min(allstart))
 
        # Get end of last gene
        allend = [gene_info_dic[x][1] for x in operon]
        operon_end = str(max(allend))
 
        # Strand should be the same for all genes in operon, so we just get the one from the first one
        operon_strand = gene_info_dic[operon[0]][2]
 
        # Create final line
        final_line = "\t".join(['Chromosome', 'MicrobesOnline', 'operon',operon_start,operon_end,".",operon_strand,".","operonName \""+operon_name+"\";\n"])
 
        # Write line to file
        outfile.write(final_line)
 
    outfile.close()
 
 
write_operon_gff(get_operons('operons.ark'), collect_gene_info('genes.ark'), 'operons.gff')
 


Exercise 1 - Solution C (Avi's solution):

https://github.com/flamholz/qb3_python_2015/blob/master/day4/day4_afternoon.ipynb



Exercise 2

alias ls='ls -l -a'