Functions and Modules


Topics:

  • Function definitions
  • Function documentation
  • Functions within functions
  • Modules
  • Modules of Interest: sys, math, collections
  • Making your own libraries

Introduction


This afternoon we'll concentrate on our last fundamental programming concept for the course. To date, we've been writing all of our program logic in the main body of our scripts. And we've seen how built-in python functions like raw_input() are used to operate on variables and their values. In this session, we'll learn how to write functions of our own, how to properly document them for ourselves and other users, and how to collect them into modules, and make our own local repositories, or libraries.

If you properly leverage a well-designed function, writing the main logic of your programs becomes almost-too-easy. Instead of writing out meticulous logical statements and loops for every task, you just call forth your previously-crafted logic, which you've vested in well-made functions.

Functions


Functions are the basic means to manage complexity in your programs, allowing you to avoid nesting and repeating large chunks of code that could otherwise make your tasks unmanageable. They allow you to bundle code with a defined input and output into single lines, and you should use them frequently from now on.

Make a new file called functions.py and copy the following code in:
#!/usr/bin/env python
 
# define the function
def hello(name):
 greeting = "Hello %s!" % (name)
 return greeting
 
# use the function
functionInput = 'Zaphod Beeblebrox'
functionOutput = hello(functionInput)
print functionOutput
 
The output is:

Hello Zaphod Beeblebrox!

To define a function, you use the keyword def. Then comes the function name, in this case hello, with parentheses containing any input arguments the function might need. In this case, we need a name to form a proper greeting, so we're giving the hello() function a variable argument called name. After that, the function does its thing, executing the indented block of code immediately below. In this case, it creates a greeting "Hello <name>!". The last thing that it does is return that greeting to the rest of the program.

Technically speaking, a function does not need to explicitly return something, although it's uncommon that you'll write any that don't. If you don't return something explicitly, Python will nevertheless return the special object None. None is logically false (for if statements), and printing None will result in nothing being printed (although None is not the empty string). It's easy to forget to return a value, so this is an easy first thing to check in case your functions don't work as expected.

#!/usr/bin/env python
 
# define the function
def hello(name):
 greeting = "Hello %s!" % (name)
 ##return greeting
 
# use the function
functionInput = 'Zaphod Beeblebrox'
functionOutput = hello(functionInput)
print functionOutput

The output is:

None

Note that the variable names are different on the inside and the outside of the function: I give it functionInput, although it takes name, and it returns greeting, although that return value is fed into functionOutput. I did this on purpose, as I want to emphasize that the function only knows to expect something, which it internally refers to as name, and then to give something else back. In fact, there is some insulation against the outside world, as you can see in this example:

#!/usr/bin/env python
 
def hello(name):
 greeting = "Hello %s!" % (name)
 testVariable = """The hotel room is a mess, there's a chicken hangin'
                   out, somebody's baby is in the closet, there's a
                   tiger in the bathroom that Mike Tyson wants back, Stu
                   lost a tooth and eloped, and Doug is missing."""
 print 'Inside of the function:', testVariable
 return greeting
 
testVariable = "What happens in Vegas stays in Vegas."
grt = hello("Stu Price")
print 'Outside of the function:', testVariable

The output is:

Inside of the function: The hotel room is a mess, there's a chicken hangin'
out, somebody's baby is in the closet, there's a
tiger in the bathroom that Mike Tyson wants back, Stu
lost a tooth and eloped, and Doug is missing.
Outside of the function: What happens in Vegas stays in Vegas.

Even though the epic story of a bachelor party gone horrifically awry was assigned to a variable called testVariable inside the function, nothing happened to that variable outside the function. Variables created inside a function occupy their own namespace in memory distinct from variables outside of the function, and so reusing names between the two can be done without you having to keep track of it. (Refer to this article about namespace for more information.) That means you can use functions written by other people without having to keep track of what variables those functions are using internally. Just like a sleazy town in Nevada, what happens in the function stays in the function. (An important exception lies with lists and dictionaries, which you will examine in the exercises.)

What happens if you try to print testVariable outside of the function and you don't assign anything to it?

Let's have another example, returning to a more pressing subject:
#!/usr/bin/env python
 
def whichFood(balance):
    if balance < 10:
        return 'ramen'
    elif balance < 100:
        return 'good ramen'
    elif balance < 200:
        return 'better ramen'
    else:
        return 'ramen that is truly profound in its goodness'
 
print whichFood(14)
 
The output is:

good ramen

Here we've made a slightly more complicated function-- it contains some control statements, and there is more than one way for it to return. We also never explicitly create an input variable (as we did with functionInput in the first example), and we don't store the output to a variable either (as we did with functionOutput).

Here are a few more examples of the syntax used with functions:

#!/usr/bin/env python
 
# functions can do their thing without taking input or returning output
 
def useless():
    print 'What was the point of that?'
    print
 
useless()
 
def countToTen():
    for i in range(10):
        print i
 
countToTen()
print
 
print "Call function within function"
def calluseless():
    print "Let's use the function useless()"
    useless()
 
calluseless()
 
 
The output is:

What was the point of that?

0
1
2
3
4
5
6
7
8
9

Call function within function
Let's use the function useless()
What was the point of that?

Notice that what you print inside the function gets printed if you call on the function, even if you don't return anything. However, it won't print anything inside the function unless you call on the function. Finally, you can call on functions from inside functions!

We've shown examples with one input variable and one return value, but functions can accept zero input variables, one input variable, or multiple input variables, and functions don't necessarily need to return variables back to the program, but they are also capable of returning multiple variables.

Here's an example with multiple input variables and multiple output variables.
#!/usr/bin/env python
 
# functions can also take multiple items in and return multiple items out
 
def doLaundry(amtDetergent, dirtyClothes):
    cleanClothes = []
    for load in dirtyClothes:
        amtDetergent -= 1
        cleanClothes.append(load)
    return (amtDetergent, cleanClothes)
 
amtTide = 5
print "Starting amount of Tide:",amtTide
print "Let's do some laundry!"
dirtyLaundry = ['socks','shirts','pants']
(amtTide, cleanLaundry) = doLaundry(amtTide, dirtyLaundry)
print "Amount of Tide left:", amtTide
print cleanLaundry
 
#What happens if you only give this function one argument, or more than two arguments?
#What happens when you output to just one variable, rather than a tuple of two variables?
 
 
The output is:

Starting amount of Tide: 5
Let's do some laundry!
Amount of Tide left: 2
['socks', 'shirts', 'pants']

Above, in doLaundry, I returned a tuple of the two variables enclosed in parenthesis. You could also return a list, which works much the same way. You could return other objects as well, like dictionaries. Below is an example where we return a list.

#!/usr/bin/env python
 
def returnStuff():
    a = '>Gene1'
    b = 'ATGGTGGG'
    return [a,b] # returns the output as a list
 
print type(returnStuff())
# We can index the output the same as any list
print returnStuff()[0]
print returnStuff()[1]
 
(name, seq) = returnStuff()
# stores output to the variables name & seq, so you can access name and seq directly
print name
print seq
 
both = returnStuff()
# stores the output to the variable both which will be a list
print both
 
dictOfStuff = {}
dictOfStuff[returnStuff()[0][1:]] = returnStuff()[1]
print dictOfStuff
 
 
The output is:

<type 'list'>
>Gene1
ATGGTGGG
>Gene1
ATGGTGGG
['>Gene1', 'ATGGTGGG']
{'Gene1': 'ATGGTGGG'}


Let's take a short break!


So how do functions make our lives easier? We can exploit functions to break difficult tasks into a number of easier tasks, and then these easier tasks into ones easier still, and so on. Large code blocks, with a few function calls, are only tens of lines long, and many functions are only a handful of lines. This allows us to program in large, structural sweeps, rather than getting lost in the details. This makes programs both easier to write and easier to read:

##Don't copy this into a script!
 
def publishAPaper(authors,topic,journal):
 data = doWork(topic)
 figures = analyze(data)
 paper = writePaper(data,figures)
 submit(authors,paper,journal)
 

And, a big part of that ease comes with the use of:

Modules


In all of the examples above, we defined our functions right above the code that we hoped to execute. If you have many functions, you can see how this would get messy in a hurry. Furthermore, part of the benefit of functions is that you can call them multiple times within a program to execute the same operations without tiresomely writing them all out again. But wouldn't it be nice to share functions across programs, too? For example, working with genomic data means lots of time getting sequence out of FASTA files, and shuttling that sequence from program to program. Many of the programs we work with overlap to a significant degree, as they need to parse FASTA files, calculate evolutionary rates, and interface with our lab servers, for example -- all of which means that many of them share functions. And if the same function exists in two or more different programs, we hit the same problems that we hit before: complex debugging, decreased readability, and, of course, too much typing.

Modules solve these problems. In short, they're collections of functions and variables (and often objects, which we'll get to towards the end of the course) that are kept together in a single file that can be read and imported by any number of programs.

Using a module: the basics


To illustrate the basics, we'll go through the use of two modules, sys and math, one of which we use almost all the time. In fact, it's a very rare program indeed that doesn't use the sys module. sys contains a lot of really esoteric functions, but it also contains a simple, everyday thing -- what you typed on the command line.

Copy the following into testmodules.py
#!/usr/bin/env python
 
import sys # gaining access to the module
 
# you can access variables stored in the module by using a dot
# to get at the variable 'argv' which is stored in 'sys', type:
 
commandLine = sys.argv
 
print commandLine
 
 
In the terminal,
$ ./testmodules.py hi world
['testmodules.py', 'hi', 'world']

The sys module contains a variable argv, which is a list of strings composed of what was written into the command line, where each of the different strings are separated by whitespaces. We can access this list argv from our program by importing the module sys and calling sys.argv.

Above, we accessed a variable. We can also access functions stored inside modules. To demonstrate this, I'll use the module math.

#!/usr/bin/env python
 
import sys
import math
 
# sys.argv contains only strings, even if you type integers.
# And, remember, the first element is the command itself-- usually
# not very useful.
 
x = float(sys.argv[1]) # argv stores the command line arguments as
                       # strings, but python isn't especially clever,
                       # so we can't do math with strings
logX = math.log(x)
 
print logX

And to run it:

$ ./testmodules.py 3
1.09861228867

There's actually a really great module that lets you call your program really easily from the command line, without having to manually parse out what each of the arguments does. I'll show you how to use that next week.

Great! Not so hard.

Modules have more than just functions: The collections module


We already knew this: sys.argv is a list. Another thing that modules often contain is datatypes. Just as Python has some built-in datatypes (like int, list, str, and dict), it's also possible (although outside the scope of this course) to create full-fledged data types of your own.

One of the more useful of these is the collections module. It has a bunch of new data types that are, as you might guess from the name, collections of other things. There are two of them that I use with some regularity: Counter and defaultdict. Let's start with Counter, which counts things.

#!/usr/bin/env python
 
import collections
 
my_genera = ['Helicobacter', 'Escherichia', 'Lactobacillus', 'Lactobacillus', 'Oryza',
 'Wolbachia', 'Oryza', 'Rattus', 'Lactobacillus', 'Drosophila']
 
c = collections.Counter(my_genera)
print c
##Note that placing the list into Counter() immediately gets you the count.
 
d = collections.Counter()
for genus in my_genera:
    d[genus] += 1
 
print d
##Here, a Counter is initialized, but each of the keys in Counter do not need to be initialized as well before adding to its value.
 
 
In terminal,
$./testmodules.py
['Helicobacter', 'Escherichia', 'Lactobacillus', 'Lactobacillus', 'Oryza', 'Wolbachia', 'Oryza', 'Rattus', 'Lactobacillus', 'Drosophila']

Counter({'Lactobacillus': 3, 'Oryza': 2, 'Drosophila': 1, 'Escherichia': 1, 'Rattus': 1, 'Wolbachia': 1, 'Helicobacter': 1})
Counter({'Lactobacillus': 3, 'Oryza': 2, 'Drosophila': 1, 'Escherichia': 1, 'Rattus': 1, 'Wolbachia': 1, 'Helicobacter': 1})

The collections module gives us a new data type, Counter, that counts things. It is essentially a dictionary where the key is some element we are recording and the value is the count of how often it appears. Remember that list of amino acids we got the count for in the exercises in Section 2.1? There, we created a dictionary where every key was initialized with a value of zero, and then proceeded to add one for each observance. Here, we can just use the Counter data type to get the count of each unique element in the list.

##This is how we did a count in a dictionary. Many more lines of code!
e = {}
 
for genus in my_genera:
    if genus not in e:
        e[genus] = 0
    e[genus] += 1
 
print "The dictionary", e
The output is:

The dictionary {'Lactobacillus': 3, 'Oryza': 2, 'Drosophila': 1, 'Escherichia': 1, 'Rattus': 1, 'Wolbachia': 1, 'Helicobacter': 1}

But using a Counter is faster to write and makes it more obvious that we are counting, as opposed to a dictionary, which could be used for almost anything. Another big advantage of the Counter type is that it makes it really easy to sort by frequency:

c = collections.Counter(my_genera)
 
print c
print c.most_common()
 
 
The out put is:
[('Lactobacillus', 3), ('Oryza', 2), ('Drosophila', 1), ('Escherichia', 1), ('Rattus', 1), ('Wolbachia', 1), ('Helicobacter', 1)]

most_common() outputs a list of tuples, sorted in order by highest count to lowest count.

The other collections type I really like is the defaultdict, which is also like a dictionary, but has a default type for a key that we haven't seen before (with a normal dictionary, if you try to read something where the key isn't in the dict, then you get an error). Let's think about how we'd make a dictionary where each key is a genus, and the value is a list of species in that genus:

import collections
 
my_species_list = [('Helicobacter','pylori'), ('Escherichia','coli'),
              ('Lactobacillus', 'helveticus'), ('Lactobacillus', 'acidophilus'),
              ('Oryza', 'sativa'), ('Wolbachia', 'pipientis'), ('Oryza', 'glabberima'),
              ('Rattus', 'norvegicus'), ('Lactobacillus','casei'),
              ('Drosophila','melanogaster')]
 
##Below, we put the list into a normal dictionary, with genera as keys and species as values
d1 = {}
for genus, species in my_species_list:
    if genus not in d1:
        d1[genus] = []
    d1[genus].append(species)
 
print "normal dictionary -- ", d1
 
 
The output is:

{'Lactobacillus': ['helveticus', 'acidophilus', 'casei'], 'Oryza': ['sativa', 'glabberima'], 'Drosophila': ['melanogaster'], 'Escherichia': ['coli'], 'Rattus': ['norvegicus'], 'Wolbachia': ['pipientis'], 'Helicobacter': ['pylori']}

With a defaultdict, we can once again save the line in the for loop where we check for a non-existent key:
d2 = collections.defaultdict(list)
 
for genus, species in my_species_list:
    d2[genus].append(species)
 
print
print "default dict -- ", d2
 
The output is:

defaultdict -- defaultdict(<type 'list'>, {'Lactobacillus': ['helveticus', 'acidophilus', 'casei'], 'Oryza': ['sativa', 'glabberima'], 'Drosophila': ['melanogaster'], 'Escherichia': ['coli'], 'Rattus': ['norvegicus'], 'Wolbachia': ['pipientis'], 'Helicobacter': ['pylori']})

One thing to look at is the line where we actually declare the defaultdict: here we've given it another type, and if we use a key that's not in the dictionary already, it will initialize it to be an empty variable of that type. Most often, this will be a list, but you could imagine uses for other types, like a string, an integer (here "empty" actually would mean 0), or even another dict. It's possible to even have a defaultdict of defaultdicts!

It turns out that it's easy to write our own modules too:

Making a module


Any file of python code with a .py extension can be imported as a module from your script. When you invoke an import operation from a program, all the statements in the imported module are executed immediately. The program also gains access to names assigned in the file (names can be functions, variables, classes, etc.), which can be invoked in the program using the syntax module.name. Go ahead and make your first module by pasting the following code into your text editor and saving as greeting_module.py:

#!/usr/bin/env python
 
print 'The top of the greeting_module has been read.'
 
def hello(name):
 greeting = "Hello %s!" % name
 return greeting
 
def ahoy(name):
 greeting = "Ahoy-hoy %s!" % name
 return greeting
 
x = 5
 
print 'The bottom of the greeting_module has been read.'
 

Now make a new program called test.py with the following code and include your first name as an argument in the Terminal command line when you execute it:
#!/usr/bin/env python
 
import greeting_module
 
hi = greeting_module.hello('person')
print hi
print greeting_module.x
 
# What happens if you try 'print x' here?
 
# Remember how to access argv?
 
import sys
 
print greeting_module.hello(sys.argv[1])
# This will take your Terminal argument as input for the greeting
# module's hello function
 
$./test.py Mel
The top of the greeting_module has been read.
The bottom of the greeting_module has been read.
Hello person!
5
Hello Mel!

Notice that it runs through all of greeting module first, so anything that is printed out in greeting_module.py is also printed out before anything in test.py is run.

And that's it! See-- no more messy function declarations at the beginning of your script. Now if you need any other program to say hi to you, all you need to do is import the greeting module.

Using modules: slightly more than just 'import'


Although creating a basic module is easy, sometimes you want more than just the basics. And although using a module in the most basic manner is easy, it's best to get a more thorough picture of how modules behave.

First, what if you only want one function from a given module? Let's say, as an Alexander Graham Bell loyalist, you really only dealt in 'ahoys' rather than 'hellos.' We need to use a modified syntax for retrieving only the ahoy function from the module, without cluttering things up by loading the newfangled hello function preferred by T.A. Edison's entourage.

Comment out the code in test.py and copy in the following code:

from greeting_module import ahoy
 
hi = ahoy('everybody')
# if you grab a function from a module with a 'from' statement,
# you don't need to use the <module>.<function> syntax
print hi
 
The output is:

The top of the greeting_module has been read.
The bottom of the greeting_module has been read.
Ahoy-hoy everybody!

We see that we can now write ahoy('everybody') directly, instead of having to write greeting_module.ahoy('everybody'). And if we wanted to access both functions this way, we could import them both in one statement by changing the import line in test.py to the following:

#!/usr/bin/env python
from greeting_module import ahoy, hello

Or, what if there were a lot of functions from the greeting_module we wanted to use, but didn't want to write out the full name? Rather than writing out all of the function names to import individually (there could be a lot of them), we can use the asterisk wildcard (*) symbol to refer to them.
#!/usr/bin/env python
from greeting_module import *
 
hi = ahoy('everybody')
hi2 = hello('everybody')
 
print hi
print hi2
 
The output is:

The top of the greeting_module has been read.
The bottom of the greeting_module has been read.
Ahoy-hoy everybody!
Hello everybody!

While this may be useful if we are familiar with the contents of the module, including all of the names inside, there are a few reasons to be careful about using the from modulename import * syntax. First, if the module contains a lot of variables that we don't need to use, we will needlessly allocate memory to storing the information. Second, and perhaps more importantly, if the module being imported contains variables with the same names as those inside your program, you will lose access to the original values of those variables.

For example, would might have a problem if both yourprogram.py and yourmodule.py each define distinct functions called hello(). If instead you use the syntax import yourmodule, then you can call the function in yourprogram.py using hello() and you can call the function in yourmodule.py using yourmodule.hello(). If you want to import a whole module, but don't want to type out it's full name every time, you can use the syntax: import a_long_module_name as mname.

Finally, you can also import variables from modules and assign them new names in your program using the syntax from modulename import variablename as newvariablename.

Where to Store Your Modules: using PYTHONPATH


Over time, you'll end up accumulating lots of these modules, and they'll tend to fall together in meaningful collections. For example, you might have a module for all your functions related to reading and parsing files, called files_tools.py. You might have another for common sequence-related tasks, called sequence_tools.py. Python keeps its modules installed in a system directory that you may or may not have access to on a remote server. Therefore, it's useful and simpler to just create your own python modules directory and then let your operating system environment know about it. Here, I accomplish this by placing my modules in ~/pylib (~ is a shortcut to your own full home path, which you can find by typing pwd in your home folder) and then adding a few lines to my .bash_profile file in my home directory with the following terminal commands:

$ echo 'PYTHONPATH=$PYTHONPATH:/Users/your_name/PythonCourse/pylib' >> ~/.bash_profile
$ echo 'export PYTHONPATH' >> ~/.bash_profile
$ source ~/.bash_profile

NOTE: .bash_profile vs. .bashrc: In Linux, .bash_profile is run upon login while .bashrc is run each time a new terminal is open. Thus, if you are using Linux and it isn't working, try inputting the following commands and see if it works. This link gives a pretty good summary of the difference in the two hidden files.

$ echo 'PYTHONPATH='~/PythonCourse/pylib' >> ~/.bashrc
$ echo 'export PYTHONPATH' >> ~/.bashrc
$ source ~/.bashrc

And with that, any file that ends up in this directory will be treated as a module by Python. And though this is a good final resting place for your polished modules, you can also prototype them by simply saving them in your current working directory, and moving them over when you're happy with them.




Exercises:


1: Practice with functions

Make a function that:

A) Takes an integer x as input and prints x * 2.

B) Takes integers x and y as input and prints x * y.

C) Takes a list xs as input and prints xs[0] * xs[1].

D) Modify the above programs so that the function returns the result instead of printing it, then the output is printed from program that called the function.


2. What happens in functions doesn't always stay in functions


As promised, most things that happen in functions stay in the functions, but there are important exceptions. Make the following functions, which should illustrate this property:

A) The function takes an integer as input and increments the integer by one using the '+=' operator. Print the value of the integer before and after the function is called.

B) The function takes a list as input and changes the first element of the list to the string 'x'. Print the value of the list before and after the function is called.

C) The function takes a dictionary as input and adds the key 'x' with value 'y' to this dictionary. Print the dictionary before and after the function is called.


3. Reverse Complement


A) Write a function that takes a DNA sequence as an argument, ensures that it the sequence is in capital letters, and then returns the reverse complement of the sequence.

B) Modify the function to ensure that only the characters A, T, G, C and N (for unknown nucleotide) are in the input sequence.


4. Making a module


If you haven't done so already during the lecture, create a directory in your PythonCourse directory called pylib, then add it to your PYTHONPATH. Create a module in this directory called exercises.py. Put your functions from Exercise 1, part D, into this module. Put the reverse complement function from Exercise 3 into this module. Add a print statement saying "This is the exercises module". Now write two programs (as described in part A and B) that import and call all of the functions in the module:

A) A program that uses the line import exercises.

B) A program that uses the line from exercises import *. What happens when you have print statements in exercises.py? Are they printed when you use the from statement?

5. Make a FASTA parser


Copy the seq.FASTA fasta file and the read_fasta.py script from Section 3.2 into your Section 4.1 folder. Modify the script and make the function fastaparser() that takes a filename as input, reads through the file using open(), distinguishes between ID-containing lines and sequence-containing lines, and returns a dictionary with gene IDs as keys and sequences as values. Put this function along with your reverse complement function into a sequence_tools.py module and place it in your modules folder.

Using the sequence_tools.py module, write a program that prints the reverse complement of the sequence for gene3 in seq.FASTA.

6. (Bonus) Create an ORF finder


For our purposes, we will define an open reading frame (ORF) as a start codon followed at some distance by a stop codon in the same frame. This program should take a dictionary from a parsed FASTA file (see exercise 5) as input and then output a dictionary of gene name:ORF(s) as the key:value pairs.

HINT: Remember that an ORF is made of codons, so the number of nucleotides is divisible by three. Use ATG as the start codon and TAG, TAA, and TGA as potential stop codons.

7. Collections


Go back to exercise 4 of Section 2.1. Remember where you counted the number of each type of amino acid in this program? Look over the code you copied in for the exercise and see if you now understand it. Rewrite the section where you counted amino acids using the collections module.

8. For This and Giggles.


Try out the following code:

#!/usr/bin/env python
 
import this

#!/usr/bin/env python
 
import antigravity

Solutions:


Some alternate solutions in an iPython Notebook are here.

1: Practice with functions

print "EXERCISE 1"
##A##
def f1(x):
        print x*2
##B##
def f2(x,y):
        print x*y
##C##
def f3(xs):
        print xs[0]*xs[1]
 
x,y,z = f1(1),f2(2,3),f3([4,5])
print 'f1', x
print 'f2', y
print 'f3', z
print
##Note that the operations in the function are printed, but the values returned to the variables x,y,z are None
 
##D##
def f1(x):
        return x*2
def f2(x,y):
        return x*y
def f3(xs):
        return xs[0]*xs[1]
 
x,y,z = f1(1),f2(2,3),f3([4,5])
print 'f1', x
print 'f2', y
print 'f3', z
##Using return, the operations done are sent to the assigned variables.
The output is:
EXERCISE 1
2
6
20
f1 None
f2 None
f3 None

f1 2
f2 6
f3 20

2. What happens in functions doesn't always stay in functions

print "EXERCISE 2"
 
##A##
def f1(x):
        x+=1
 
y = 1
print "The original number is:", y
f1(y)
print "The number is still: ", y
print
 
##B##
def f2(lst):
        lst[0] = 'x'
y = [3,4]
print "The original list is:", y
f2(y)
print "The list is now:", y
print
 
##C##
def f3(mydict):
        mydict['x'] = 'y'
 
y = {3:4}
print "The original dictionary is:", y
f3(y)
print "The dictionary is now:", y
print
 
The output is:
EXERCISE 2
The original number is: 1
The number is still: 1

The original list is: [3, 4]
The list is now: ['x', 4]

The original dictionary is: {3: 4}
The dictionary is now: {'x': 'y', 3: 4}

Note that the list and dictionary have changed, but the integer has not.

3. Reverse Complement

print "EXERCISE 3"
 
##A##
def revComp(seq):
        ##First way##
        seq = seq.upper()[::-1] ##Reverse the string and make all upper case
        seq = seq.replace('T','a') ##Replace each ACGT with complement in lower case
        seq = seq.replace('A','t') ##We do lower case to not accidentally replace bps
        seq = seq.replace('C','g') ##we already replaced.
        seq = seq.replace('G','c')
        seq = seq.upper() ##make all upper case again
        '''
        ##Alternate way##
        comp = {'A':'T','T':'A','C':'G','G':'C'} ##Make dictionary of complements
        seq = seq.upper()[::-1] ##Reverse string and make all upper case
        newseq = '' ##Specify new string
        for i in seq: ##Loop through old seq, adding complement in new seq, if it has one
                if i in comp: newseq += comp[i]
                else: newseq += i
        seq = newseq
        '''
 
        ##B - Check if in AGCTN##
        myset = set(seq) ##Get set of unique basepairs
        mycheck = [1 for i in set(seq) if i not in "ACGTN"] ##Add 1 to list if not ACGTN
        if sum(mycheck) != 0: ##If !=0, then there is an error--return 0
                print "The base pairs are not all A, G, C, T, or N"
                return 0
        else: return seq ##Everything works, return the seq
 
myseq = 0
while myseq == 0: ##Loop until returns a seq
        myseq = raw_input("Please enter a sequence to take the reverse complement of: ")
        myseq = revComp(myseq) ##Apply my revComp function to the sequence
 
print myseq
 
 
The output is (when I plug in 'AGTCN'):
EXERCISE 3
Please enter a sequence to take the reverse complement of: AGTCN
NGACT

The output is (when I plug in 'jadsh' and then 'aaatcn'):
EXERCISE 3
Please enter a sequence to take the reverse complement of: jadsh
The base pairs are not all A, G, C, T, or N
Please enter a sequence to take the reverse complement of: aaatcn
NGATTT

4. Making a module

Your exercises.py folder should be in your ~/PythonCourses/pylib module, which you set as your PYTHONPATH using the instructions right before the exercises.
##exercises.py in pylib module##
 
## EXERCISE 1 Part D ##
def f1(x):
        return x*2
 
def f2(x,y):
        return x*y
 
def f3(xs):
        return xs[0]*xs[1]
 
## EXERCISE 3 ##
def revComp(seq):
        ##First way##
        seq = seq.upper()[::-1]
        seq = seq.replace('T','a')
        seq = seq.replace('A','t')
        seq = seq.replace('C','g')
        seq = seq.replace('G','c')
        seq = seq.upper()
        '''
        ##Alternate way##
        comp = {'A':'T','T':'A','C':'G','G':'C'}
        seq = seq.upper()[::-1]
        newseq = ''
        for i in seq:
                if i in comp: newseq += comp[i]
                else: newseq += i
        seq = newseq
        '''
 
        ##B - Check if in AGCTN##
        myset = set(seq)
        mycheck = [1 for i in set(seq) if i not in "ACGTN"]
        if sum(mycheck) > 0:
                print "The base pairs are not all A, G, C, T, or N"
                return 0
        else: return seq
 
print "This is the exercises module"
It includes the functions you made in Exercises 1 and 3 of this section.

Now, for part A,
import exercises
 
print exercises.f1(1)
print exercises.f2(2,3)
print exercises.f3([4,5])
 
print exercises.revComp("AGCTN")
 
while for part B,
from exercises import revComp
 
print f1(1)
print f2(2,3)
print f3([4,5])
 
print revComp("AGCTN")
 
The output for both scripts is:
This is the exercises module
2
6
20
NAGCT

Note: The print statement in the exercises.py module is always printed in both cases.

5. Make a FASTA parser

My sequence_tools.py module is:
def fastaparser(filename):
        myfile = open(filename,'r')
        mydict = {}
        for line in myfile:
                if line[0] == '>': mygene = line[1:-1]
                else:
                        if mygene not in mydict: mydict[mygene] = line[:-1]
                        else: mydict[mygene] += line[:-1]
        myfile.close()
        return mydict
def revComp(seq):
        ##First way##
        seq = seq.upper()[::-1]
        seq = seq.replace('T','a')
        seq = seq.replace('A','t')
        seq = seq.replace('C','g')
        seq = seq.replace('G','c')
        seq = seq.upper()
        '''
        ##Alternate way##
        comp = {'A':'T','T':'A','C':'G','G':'C'}
        seq = seq.upper()[::-1]
        newseq = ''
        for i in seq:
                if i in comp: newseq += comp[i]
                else: newseq += i
        seq = newseq
        '''
 
        ##B - Check if in AGCTN##
        myset = set(seq)
        mycheck = [1 for i in set(seq) if i not in "ACGTN"]
        if sum(mycheck) > 0:
                print "The base pairs are not all A, G, C, T, or N"
                return 0
        else: return seq
 

My script for exercise 5 is:

import sequence_tools as sq
 
mygenes = sq.fastaparser("seq.FASTA")
print "The sequence is:", mygenes['gene3']
print "The reverse complement is:", sq.revComp(mygenes['gene3'])
 
The output is:
The sequence is: TTATGGCACCCACTAGAGCCAGATTATTTTAAA
The reverse complement is: TTTAAAATAATCTGGCTCTAGTGGGTGCCATAA

6. (Bonus) Create an ORF finder

import sequence_tools as st
 
##Function that inputs a sequence and outputs a list of ORFs
def orffinder(gene):
        myorfs = []
 
        ##Find positions of all start and stop codons
        startind = findallindex(gene,"ATG")
        stopind = []
        for stop in ["TAG","TAA","TGA"]: stopind.extend(findallindex(gene,stop))
        stopind.sort()
 
        ##Loop over all start codons, finding the closest stop codon that gives a sequence that is divisible by 3
        for start in startind:
                for stop in stopind:
                        if stop-start < 3: continue
                        myorf = gene[start:(stop+3)]
                        if len(myorf) % 3 != 0: continue
                        myorfs.append(gene[start:(stop+3)])
                        break ##Makes sure that once an ORF for that start codon is found, we move on to the next start codon
        return myorfs
 
##Find all positions in a string where the given character/string is found
def findallindex(mystr,myvar):
        indices = []
        numbases = 0
        while myvar in mystr:
                pos = mystr.index(myvar) ##index() finds the first occurrence
                indices.append(pos+numbases) ##Add the position in original string (hence, add numbases) where the given char/str was found
                mystr = mystr[(pos+len(myvar)):] ##Update the string to remove everything including the first occurrence of the given char/str
                numbases += pos+len(myvar) ##Keep track of how many previous positions were removed from the string
        return indices
 
mygenes = st.fastaparser("seq.FASTA")
ORFs = {}
for i in mygenes:
        ORFs[i] = orffinder(mygenes[i])
print ORFs
 
The output is:
{'gene1': ['ATGAGACGTAGTGCCAGTAGCGCGATGTAG', 'ATGTAG'], 'gene2': ['ATGTTCGACGCATACGACGCGCAGTACCAGCAATGA', 'ATGACGCACCGGGATACACGACGCGGATTTTTACGCACCGAGATAGCATAA'], 'gene3': ['ATGGCACCCACTAGAGCCAGATTATTTTAA']}

7. Collections

#!/usr/bin/env python
###NEW COMMENTS HAVE THREE HASHTAGS!!!
 
#initialize list to store sequence
protSeq = []
#open pdb file
f1 = open('2Q6H.pdb', 'r')  ###Download the pdb file for 2Q6H
#loop over lines in file
for next in f1:
    #identify lines that contain sequences
    if next[:6] == 'SEQRES':
        #strip away white space and
        #convert line into list
        line = next.strip().split()
        #delete descriptor information
        #at beginning of each line
        del line[:4]
        #loop over amino acids in line
        for aa in line:
            #add to sequence list
            protSeq.append(aa)
#close file
f1.close()
 
print "The total number of amino acids in the protein is:", len(protSeq)
 
###My added lines below
import collections
mycount = collections.Counter(protSeq)
for i in sorted(mycount.keys()):
      print i, mycount[i]
 
 
The output is the same as before:
The total number of amino acids in the protein is: 519
ALA 54
ARG 21
ASN 14
ASP 12
GLN 6
GLU 24
GLY 45
HIS 6
ILE 54
LEU 61
LYS 19
MET 13
PHE 50
PRO 25
SER 18
THR 27
TRP 16
TYR 17
VAL 37

8. For This and Giggles.

The output for import antigravity is the link: http://xkcd.com/353/

The output for import this is:
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!