Beyond learning to code: Maintaining and Writing programs


Introduction

This week, we've shown you a pretty large fraction of the core Python language. With enough patience, you could read through most of the Python documentation on your own and write code to do whatever you want. However, just as there's more to being a scientist than learning how to pipette (important a skill though that may be), there's more to writing software than learning the syntax of a language. This afternoon, I'll introduce you to a couple important skills that will serve you no matter what language you ultimately decide to program in.

The Project


Historically, the second week of the course has been dedicated to a specific project. Not only does this allow us to naturally talk about more science-specialized aspects of programming, but it also gives people in the class an opportunity to see how larger programs are structured. In years past, we've tried both coming up with our own analyses of published data, and attempted to replicate the results of a moderately computation heavy experimental paper. Both of these have ended up being a lot of work on the instructors part for not a whole lot of payoff: mostly uninteresting results, and when replicating someone else's paper, a lot of inconsistencies with no clear origin. So this year we're trying something new again:

Next week, we'll be going over some RNA-seq data that we've collected specifically for the course. These are experiments that one of the previous instructors (Mike) wanted to do anyways. We'd like to stress that this is pre-publication data, so we'd appreciate you not sharing the data beyond this course. On the other hand, it's worth noting that a lot of the analysis we'll be going through next week is exactly the analysis we wanted to do on the data. You now know enough Python to do real science (although we'll be using some more modules that make things easier).

Now, here's the project: In bacterial transcription, there are two major ways that transcripts are terminated. The first, intrinsic termination, a hairpin forms in the elongating RNA that destabilizes the elongation complex. These hairpins can be located using RNA secondary structure predictors. The second major mechanism for termination is factor-dependent. Approximately half of the factor-dependent termination sites depend on the protein Rho. Rho is a hexameric ATPase that binds to the elongating RNA and disrupts translocation of the elongation complex. Rho binds to a pyrimidine rich (C/T/U) region, but there hasn't been any identified binding motif.

Some genomics work has been done on bicyclomycin (BCM) treatment of E. coli, which inhibits Rho. In particular, there are expression microarrays and ChIP-chip studies that have been done, but each of these has distinct flaws that limit our ability to draw the conclusions we'd like. The microarray study was performed using a pre-designed Affymetrix chip that is focused on the gene transcription, rather than the UTRs. While there are some conclusions to be drawn from this, it misses the most interesting part of the effect of Rho inhibition. The ChIP-chip study attempts to identify Rho-dependent genes, but due to the lack of a good antibody for Rho, instead looks at RNAP binding, and uses that as a proxy for Rho binding. Furthermore, microarray studies in general have problems with linearity: twice as bright a spot on the array does not necessarily mean that there's twice as much RNA.

What we decided to do was the simplest thing that could possibly work: Look at the transcriptome in three different concentrations of BCM, at two different time points after treatment. Then, we do RNAseq on each of those samples.

Homework


First, it's critically important that you understand what RNAseq is. If you're not familiar with Illumina high throughput sequencing, you might try reading this page from Oregon HSU: http://www.ohsu.edu/xd/research/research-cores/mpssr/project-design/mpssr_sequencing_technology.cfm . The technology we used is the Illumina HiSeq, which is fundamentally similar to the GA IIx, but with about 10 times more reads per lane. For more on RNAseq in particular, this review is pretty straightforward: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949280/

Source code control

Good record-keeping is of the utmost importance in science, and it turns out to be really, really helpful in programming too. Sometimes, the "improvements" you make to a large piece of software actually break something that you weren't thinking about, so it's nice to have a record of what you did when, and easily be able to go back to previous versions. Alternatively, you could have multiple different versions of the same piece of software floating around lab, and if you're not careful, it's easy to make one set of modifications to one version, and a different set of modifications to another version.

This is problem isn't specific to just scientists, and the software engineering community has come up with a number of different software tools to help keep track of the changes that get made to source code. These are called Version Control Systems, and today I'll be showing you a brief overview of one, called Git.

git init

The first thing you'll want to do is initialize a new repository. A repository is just the term for a collection of files that Git will keep track of. From the command line, the way to do this is straightforward.
$ git init
Initialized empty Git repository in /Users/pcombs/Documents/PythonCourse2011/.git/
What it's saying is that it's created a directory called ".git" inside of the PythonCourse directory. By convention, Unix-derived operating systems (like Linux and Mac OS X) hide things that begin with a . by default, though it is possible to get them to show up using the ls -a instead of just ls. For the most part, though, you won't need to muck about in .git directly anyways, so don't worry too much about it.

Also, in the event that you already had a git repository set up in the current folder, doing git init won't overwrite it, it just "reinitializes". I haven't figured out what this means, exactly, aside from a slightly different message that it shows up. But basically, you don't need to worry if you think you might already have a repository there: you can init one anyways and it won't break anything. In fact, all but a very few git commands are safe, and won't destroy data.

git add

So now that we have our shiny new repository, what do we do with it? Git will only keep track of things that we tell it to track, and the way we do that is by using git add. I'm first going to make a really simple file, and then git add it.

hello.py
print "Hello, world!"

$ git add hello.py


git commit

At this point, we're almost tracking hello.py, but not quite! Git uses a 2-step process: first, you add files to a staging area (called the index); then, once you've added the files you want to the index, you commit them to the repository. A commit (Computer Scientists aren't the best at grammar, so what used to be a verb is now a noun) is a snapshot of your code at a particular time. Each commit has an associated message that can be as long or as short as you'd like, but traditionally, the first line is a brief, one-line summary of the changes you've made, and then you can put in a blank line and then as long or as short of a message as you'd like to explain the changes in more detail. This is like your lab notebook, so be as verbose as you need to explain why you did what you did.

$ git commit -m "Beginning of project"

So now let's make some more changes to our code:
hello.py
print "Hello, world!"
x = 3
print x

status and diff

Now let's say we made those changes last night right before going home, and we don't remember if we added them to the index and/or committed them. There are a couple commands you can use to check on them:

$ git status
# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#    modified:   hello.py
#
no changes added to commit (use "git add" and/or "git commit -a")

In this case, we see that we've modified hello.py, but we haven't added it to the index. If we want to find out what changes we've made, exactly, we can do:
$ git diff
diff --git a/hello.py b/hello.py
index 4351743..2aae829 100644
--- a/hello.py
+++ b/hello.py
@@ -1 +1,3 @@
 print "Hello, world!"
+x = 3
+print x

Now by default, git diff will tell you the difference between what is in the working directory and the most recent commit. That is, it's the changes that we could *add*. If you want to find out what changes we've already added, you can give it the --staged flag, so:

$ git add hello.py
$ git diff
(nothing gets displayed, so there aren't any more changes we can add)
$ git diff --staged
diff --git a/hello.py b/hello.py
index 4351743..2aae829 100644
--- a/hello.py
+++ b/hello.py
@@ -1 +1,3 @@
 print "Hello, world!"
+x = 3
+print x

$git commit -m "Second modification"

Branches


So now let's say your lab mate (let's call him "Aaron") comes up to you after you show off the results of your program in lab meeting and says, "That program's really cool, but to use it for my project, I'd want to print out 3 squared instead of 3." Now, your project relies on plain 3, so you'd need to either
  1. Print out both 3 and 3 squared and rely on the user to figure out which one to use. That might work in this case, but maybe Aaron asked for modifications that aren't compatible with that approach.
  2. Copy the whole folder full of code elsewhere, and then make the change there. The problem with that approach is that if you discover a bug in the original program, you have to fix it in both places, which won't necessarily be trivial or obvious, and then you're never quite sure whether you've actually made the fix in both places, and ...
  3. Make a new branch of the repository. The code is allowed to diverge, but by storing the two branches in the same repository, you can keep track of the changes, and merge the changes from one to the other.

$git branch xsquared
$git checkout xsquared

hello.py
print "Hello, world!"
x = 3
print x**2

$ git add hello.py
$ git commit -m "Prints x^2 instead of x"

Now we have both branches of code running in parallel to each other, and we can make changes in one without affecting the other. If you're ever not sure what branch you're on, you can do:

$ git branch

Note the lack of a name for the branch
  master
* xsquared

Merging


As we work some more, we realize perhaps that something is wrong. Our program isn't nearly excited enough. That's an easy change, though:

$ git checkout master
hello.py -- on branch master
print "Hello, world!!!!"
x = 3
print x**2
$ git add hello.py
$ git commit -m "Getting excited"

We are really excited and want to make this change apply to both branches, though, so it would be nice to have some way to merge the changes into the xsquared branch.
$ git checkout xsquared # First, we switch back over to xsquared
$ git merge master # We say what branch we want to merge the changes from.
Auto-merging hello.py
Merge made by recursive.
 hello.py |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

Now, when we take a look at the code, we see that the program has automatically done the Right Thing™, and made the changes it was supposed to.

hello.py -- on branch xsquared
print "Hello, world!!!!"
x = 3
print x**2

Advanced Merging


Sometimes, though, it's not possible for git to know what changes to make, and sometimes it does guess wrong. Let's work through an example where that happens.

Let's say that in two different branches, we make a change to the same line of code:

hello.py -- on branch master
print "Hello, world!!!! Let's print x"
 x = 3
 print x
$ git add hello.py
$ git commit -m "More descriptive message on master"

$git checkout xsquared
hello.py -- on branch xsquared
print "Hello, world!!!! Let's print x**2."
 x = 3
 print x**2
 
print "Goodbye, cruel world..."

Now we add this in with two separate commits, one for the introductory message, and one for the sign-off message:

$ git add -p

The -p flag lets us do thing's piecewise. After the "stage this hunk?" question, the program is allowing us to choose multiple options.

diff --git a/hello.py b/hello.py
index e9692b1..57e3f45 100644
--- a/hello.py
+++ b/hello.py
@@ -1,3 +1,6 @@
-print "Hello, world!!!!"
+print "Hello, world!!!! Let's print x**2."
 x = 3
 print x**2
+
+print "Goodbye, cruel world..."
+
Stage this hunk [y,n,q,a,d,/,s,e,?]? ?
 
 
 
y - stage this hunk
n - do not stage this hunk
q - quit; do not stage this hunk nor any of the remaining ones
a - stage this hunk and all later hunks in the file
d - do not stage this hunk nor any of the later hunks in the file
g - select a hunk to go to
/ - search for a hunk matching the given regex
j - leave this hunk undecided, see next undecided hunk
J - leave this hunk undecided, see next hunk
k - leave this hunk undecided, see previous undecided hunk
K - leave this hunk undecided, see previous hunk
s - split the current hunk into smaller hunks
e - manually edit the current hunk
? - print help
@@ -1,3 +1,6 @@
-print "Hello, world!!!!"
+print "Hello, world!!!! Let's print x**2."
 x = 3
 print x**2
+
+print "Goodbye, cruel world..."
+
Stage this hunk [y,n,q,a,d,/,s,e,?]? s
 
 
Split into 2 hunks.
@@ -1,3 +1,3 @@
-print "Hello, world!!!!"
+print "Hello, world!!!! Let's print x**2."
 x = 3
 print x**2
Stage this hunk [y,n,q,a,d,/,j,J,g,e,?]? y
 
 
 
@@ -2,2 +2,5 @@
 x = 3
 print x**2
+
+print "Goodbye, cruel world..."
+
Stage this hunk [y,n,q,a,d,/,K,g,e,?]? n
$ git commit -m "More descriptive intro message on xsquared"
$ git diff
diff --git a/hello.py b/hello.py
index e5df48d..57e3f45 100644
--- a/hello.py
+++ b/hello.py
@@ -1,3 +1,6 @@
 print "Hello, world!!!! Let's print x**2."
 x = 3
 print x**2
+
+print "Goodbye, cruel world..."
+
$ git add hello.py
$ git commit -m "Added sign-off message"

So now we have some code (the sign-off message) from xsquared that we want to merge back into the master branch.

$ git checkout master
$ git merge xsquared
Auto-merging hello.py
CONFLICT (content): Merge conflict in hello.py
Automatic merge failed; fix conflicts and then commit the result.

So let's take a look at the difference between the code now and our last commit:
$ git diff
diff --cc hello.py
index 0a78149,57e3f45..0000000
--- a/hello.py
+++ b/hello.py
@@@ -1,3 -1,6 +1,10 @@@
++<<<<<<< HEAD
 +print "Hello, world!!!! Let's print x"
++=======
+ print "Hello, world!!!! Let's print x**2."
++>>>>>>> xsquared
  x = 3
- print x
+ print x**2
+
+ print "Goodbye, cruel world..."
+

So we see a few things here:
  • The first line has two different versions. Because the same line was changed, it has no way to know what the Right Thing™ is, so it just gives us both options and makes us manually make the change.
  • It's been a little overzealous with the changes, and turned the "print x" from the master branch into "print x2" (which was the version from the xsquared branch). This is easy to fix by hand.
  • It added in the sign-off message. That we'll just leave there.

Once we make those changes, we can add them to the index and then commit them.
$ git add hello.py
$ git commit -m "Resolved merge"

By the way, this style of having an "experimental" branch and a "master" branch can be a good way to go about things. That way, you always have a branch that works, but you still have a place to add in new features and whatnot.

Collaboration

Even if you're going to be the only person touching your code, some kind of version control will likely be helpful, but if you're going to be working on it with other people, it's nearly essential. Git was designed by Linus Torvalds to help with the development of Linux, which has hundreds of individual contributors. (He also named it after himself: "I'm an egotistical bastard, and I name all my projects after myself. First Linux, now git.")

Teaching you how to do this is outside the scope of this course, but Git is able to deal with it. Unlike some other Version Control Systems, Git is distributed, meaning that there is no central copy that everyone agrees on. Each copy of a repository is just as valid as any other, and they can be merged at will. If you do find yourself collaborating with someone else (and maybe even if you don't), I'd encourage you to look at Github, a Git-based code server. In the free level, all your repositories are openly displayed (though only you can modify them, unless you give other users permission), but there are also relatively cheap options for having closed-source repositories, if you're concerned about getting scooped on something. It's also possible to set up a Git server on a central lab server, but setting that up is way outside the scope of this course.

There's also a really nice visual guide to what lots of the most common git commands do: http://marklodato.github.com/visual-git-guide/index-en.html I'd encourage you to check it out if you ever get confused by git (don't worry, sometimes it happens to me too!)

Stubbing and the 'pass' statement


When we write complicated code, we need to decompose it into simpler parts. This is an intuitive concept, and one that we've touched on before. Stubbing is writing what your program should be doing, without actually getting around to filling in the details. It's like writing an outline of a paper.

Let's say that we want to make a program that gambles online and makes money for you so that you are free to pursue the standard academic career path of postdoctoral positions ad infinitum. I would start stubbing such a program with some big picture ideas: the program has to log on to a gambling server, keep track of your balance, and play until you win or go home.

gambling.py

#! /usr/bin/env python
 
import sys
 
import internet
import games
 
account_name = sys.argv[1]
password = sys.argv[2]
 
 
balance, session_info = internet.logOnToIllegalGamblingServer(account_name,password)
 
while balance > 0:
 
    balance = games.playGame('slots',balance,session_info)
 
    if balance > 1000000000:
 
        print 'Congratulations you are a billionare!'
        internet.logOffFromIllegalGamblingServer()
        sys.exit()
 
print 'Darn'
internet.logOffFromIllegalGamblingServer()
 
 




internet and games are not built in to python - we'll have to write the functions in them ourselves eventually. However, now we have a better idea of what these functions should look like, and what they need to do. For now, lets use a new concept to fill these functions in and allow us to run and test this code: the statement pass. It's pretty simple: it does nothing. Although this sounds somewhat pointless, in this case it allows you to write little function stubs without Python (or your text editor) complaining. However, it pops up in other places as well, usually as a shortcut where you mean to write more code later. This could be in an if or else statement or while raising an exception. Each of those cases requires something after the colon for it to be valid Python, and pass is a valid way to put in something that does nothing. We won't cover those applications here, but keep them in mind while you're programming: we encourage you to try it out.

internet.py

#! /usr/bin/env python
 
 
##input: two strings, account_name and password
##output: the balance and seesion info for the associated account
def logOnToIllegalGamblingServer(account_name,password):
    pass
 
##logs the user off from the server
def logOffFromIllegalGamblingServer():
    pass
 



games.py

#! /usr/bin/env python
 
 
##input: a string game_name, an int balance, and the session information
##output: the new balance after playing a round of a a game
def playGame(game_name,balance,session_info):
 
    pass
 


Now that we have the basic logical flow of our program, lets expand our stubbing in games to help us plan our slot machine game.

#! /usr/bin/env python
 
 
##input: a string game_name, an int balance, and the session information
##output: the new balance after playing a round of a a game
def playGame(game_name,balance,session_info):
 
    if game_name == 'slots':
 
        balance = playSlots(balance,session_info)
 
 
    else:
        print "Game %s not found, please pick another" % game_name
 
 
    return balance
 
##input: an int balance and the session info
##output: the new balance after playing a round of slots
def playSlots(balance,session_info):
 
    current_bet = getBet(balance)
    balance = balance - current_bet
 
    current_result = pullLever()
 
    winnings = calcWinnings(current_result,current_bet)
 
    balance = balance + winnings
 
    return balance
 
##input: an int balance
##output: a valid bet given the current balance
##prompts the user to make a bet
def getBet(balance):
    pass
 
##output: a list of the results of a single slot machine game
def pullLever():
    pass
 
##input: a list of results from a slot machine game, and an int bet
##output: an int of the total winnings (if any) earned from the game
def calcWinnings(current_result,current_bet):
    pass