Session 5.1

Topics:

  • Common Errors
  • Error messages from Python
  • Exception Handling (try, except)
  • Print statement debugging

Introduction


"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." – Brian W. Kernighan

Let's be honest, most of us aren't perfect. And in our zen self-awareness, we are well-served to be sensitive to our predispositions to err in particular ways. We are probably even better served to listen to the computer when it tells us we've made a big mistake. Python can be quite good at communicating our flaws to us, if only we're receptive to the constructive criticism in its spartanly high-contrast black and white print.

This morning we're going to look at common sources of error, see what to look for as feedback from Python, and learn a couple of tricks to both obviate, or if necessary, track down bugs, should they occur.

A few commonly encountered sources of errors include:

  • mis-typed variable names
  • mis-typed variables
  • non-existent variables (or indices in a variable)
  • unexpected input

If you're very lucky, an error will cause Python to give up right away, while others will cause insidious bugs that sneak through unnoticed until you present your work in lab meeting and someone calls you on your exciting, but seemingly impossible (and ultimately bogus) result.

Strategic Initiative 1: Test Early, Test Often


There's an approach to software engineering called "Test Driven Development". In that way of doing things, before you write a function that does something, you write code that tests it, often in a separate file. It's really convenient to use modules to do this, where you could have a test.py file that imports the relevant functions from your other modules and tests them on as many cases as it can. This also has the advantage that you notice fairly quickly when your new code breaks something that used to work. Especially in scientific programming, this isn't always practical to do, but the general ideas hold: testing is good.

Even if you don't do TDD, you can save yourself lots of time by testing frequently. Debugging 100 lines of code is often more than 10 times as hard as debugging 10 lines. Writing a ton of code that generates output without checking if each component works individually does NOT make you a coding rock star; it makes you sloppy.

Strategic Initiative 2: Be verbose


"Errors should never pass silently." -- The Zen of Python, by Tim Peters

One of the easiest ways to debug code is to print out the value of variables at the point things start going wrong. Of course, if you knew where things were going wrong, you would probably know what was going wrong in the first place, so to find this, a divide-and-conquer approach is often fastest. Start out by putting a bunch of print statements in throughout your code, then narrow things down gradually until you're right at the broken line of code.

It's not a bad idea to include these print statements in moderation before you need them. If you think there will be a subset of your data that will fail a logical test, setup your if and else statements to report the incidence of failure.

# lab_members = {name : [status]}
# pretend this dictionary is already filled with data
 
for name in lab_members:
     if lab_members[name] == 'Post-doc':
          get_a_job(name)
          print "Time for this post-doc to get a job..."
     elif lab_members[name] == 'Grad Student':
          do_research()
          print "Be patient, this one's just a grad student."
     else:
          print "Warning:", name, "is not a post-doc nor a grad student!"

Strategic Initiative 3: Be Boring, Be Obvious


"Programs must be written for people to read, and only incidentally for machines to execute." --Structure and Interpretation of Computer Programs Hal Abelson and Gerald Jay Sussman

As you get better at coding, you will start to take shortcuts and combine lines. As soon as something doesn't behave as you expect, you should decompose your compound statements, as this is a common source of error.
import sys
##Call in first gene and print out sequence
myfile = open(sys.argv[1],mode='r')
myfile = myfile.read()
 
##COMPOUND
xx = {int(myfile[5:7].strip()):myfile[7:40].strip()+myfile[40:72].strip()+myfile[72:].strip()}[int(myfile[5:7].strip())][3:8]
print 'Compound:',xx
 
 
##EXPANDED
myl1 = myfile[5:7]
myl1 = myl1.strip()
myl2 = myfile[7:40]
myl2 = myl2.strip()
myl3 = myfile[40:72]
myl3 = myl3.strip()
myl4 = myfile[72:]
myl4 = myl4.strip()
mykey = int(myl1)
myseq1 = myl2+myl3+myl4
mydict = {mykey:myseq1}
xx = mydict[mykey][3:8]
print 'Expanded:', xx
$./errors.py seq.FASTA
Compound: AGACG
Expanded: AGACG

These two code snippets do the same thing, and Python doesn't much care which you use. On the other hand, it's much easier to look at each line of code in the expanded case and see that it does what it should be doing, whereas in the compound form, you could very easily have parentheses that are matched in unexpected ways that throws the whole meaning of things off.

On a related note, it's good practice to comment your code. The Python Style Guide has some good recommendations about comments. Anything you did that required some thinking on your part should ideally be commented, since you might come back to the code weeks or months later and have forgotten why you did it that way. Also try to keep your comments up to date with the code.

Error Messages from Python: Syntactical Mistakes and Others


Much like in spoken languages, Python has rules for what is and is not grammatically allowed. Before running your script, Python will check it to make sure that it does follow all these rules, and will let you know if it doesn't. Keep in mind, though, that it can't catch errors in meaning, so long as your code follows the rules of the language (as a natural language equivalent of things Python can't catch, consider Groucho Marx: "One morning I shot an elephant in my pajamas. How he got into my pajamas I'll never know.").

#!/usr/bin/python
 
while True
    print 'Bioinformatics is AWESOME!'

$ ./errors.py

Traceback (most recent call last):
  File "./errors.py", line 3
    while True
             ^
SyntaxError: invalid syntax

This error message, for its brevity, is quite helpful. We quickly learn the following:

  1. The problem is on line 3 of the file errors.py. This may seem trivial in this case, but when you have nested functions conjured from various modules in various directories, knowing which file to look in is a big, big help.
  2. There is a problem after True. Note that, for syntax errors, the caret will often point somewhere after the place where the actual error is, and sometimes can point quite a bit after (for instance, when the closed parentheses is missing for a function), though in this case it's pretty close.
  3. The problem is that there is invalid syntax: we haven't followed the rules of the language. As you've likely seen already, and will no doubt see much more of in the future, there are all kinds of different errors that can show up, usually with a terse message explaining what went wrong.

What is the problem and how is it fixable?

There are lots of other kinds of errors that will almost inevitably crop up in your code. You've even already seen some in our lecture on data structures.

instructors = ['Aisha', 'Peter', 'Diana']
print instructors[10]

$./errors.py
Traceback (most recent call last):
  File "errors.py", line 4, in <module>
    print instructors[10]
IndexError: list index out of range

Again, let's see what we can work out from this error message:
  1. The error is now on line 4 of errors.py.
  2. There's some problem with the expression print instructors[10] (presumably in this case you could have looked that up yourself, but Python does try to be helpful).
  3. That problem is of the type IndexError, and in particular there is a list index out of range.

By now, it should be clear what the problem is (assuming it wasn't clear to you before I even ran it): instructors isn't long enough to have an item at index 10, so Python throws up its hands and gives up. Sometimes, though, we don't want Python to surrender at the first sign of trouble. If only there were some way to help it deal with problems that we can anticipate...

Easier to Ask Forgiveness than Permission: Try and Except


"Errors should never pass silently. Unless explicitly silenced." -- The Zen of Python, by Tim Peters

The try and except statements let us attempt to do something that may or may not work, and then respond appropriately if or when something goes wrong. For example, sometimes the file we want to work with isn't available for some reason (and if you're really lucky, it's only because you forgot to plug in an external hard drive).
fh = open('AwesomeDataFile.txt', 'r')

$ ./errors.py
Traceback (most recent call last):
  File "errors.py", line 4, in <module>
    fh = open('AwesomeDataFile.txt', 'r')
IOError: [Errno 2] No such file or directory: 'AwesomeDataFile.txt'
An IOError is given. Now, let's use try and except.
try:
    f1 = open('AwesomeDataFile.txt', 'r')
except IOError:
    print "The file 'AwesomeDataFile.txt' doesn't exist!"
 
This means that when there is an IOError, rather than making python fail, the program should do something in particular (print a line saying as much, in this case).

Sometimes you might have a section of code where multiple things could go wrong. In that case, you can respond to each of them appropriately using multiple except statements. It's even possible to have an except statement without listing the type of error, which will deal with anything not handled by one of the previous except statements.
import sys
 
def dividetwo(num):
    return 2/num
 
try:
    print dividetwo(float(sys.argv[1]))
except ValueError:
    print 'not a float!'
except IndexError:
    print 'did you put a number in the command line?'
except:
    print 'Some other error!'
What happens for each of these cases?

$ python errors.py 1

$ python errors.py a

$ python errors.py

$ python errors.py 0

It's considered slightly bad form, however, to have a catch-all except block. If you didn't anticipate the error, it will probably be helpful to you to get the default error message, even if it is a little arcane, rather than a message like the above, which is pretty darn vague.

For a listing of the default kinds of errors (also called "exceptions") that can happen, see the list of Built-in Exceptions in the Python Documentation.

Else and Finally


Similar to the use of else in for and while loops, anything in an else block will run *only if* no errors happened. On the other hand, anything in a finally block runs after the try/except blocks, and will run regardless of whether an exception was thrown; if there was an exception, it doesn't stop it from being thrown, it just waits a bit before it does.

def divide(x,y):
    try:
        result = x/y
    except ZeroDivisionError:
        print "Warning: divide by zero!"
        return float('NaN') #Not a number
    else:
        return result
    finally:
        print "executing final clause"
 
print divide(2,1)
print
print divide(2,0)
print
print divide('A','B')

$./errors.py
executing final clause
2

Warning: divide by zero!
executing final clause
nan

executing final clause
Traceback (most recent call last):
File "errors.py", line 32, in <module>
print divide('A','B')
File "errors.py", line 19, in divide
result = x/y
TypeError: unsupported operand type(s) for /: 'str' and 'str'

The first case made it through the division just fine.

The second case raised an exception (a division by zero error), but we anticipated that, so we replace it with a special Not a Number value, and move on.

The third case we found an exception that we didn't anticipate. The exception came up, so we still ran the finally clause, but then raised the exception. This might be useful if you want the program to still finish up despite the error (output whatever has already been done by the script, perhaps).

Any questions?



Exercises


1) Exception handling we have known.


If you don't know them already, there are two functions to get rid of elements in a list: remove() and discard().
#!/usr/bin/env python
 
list_of_letters = ['a', 'a', 'b', 'c','c','c','d','e']
print 'ORIGINAL'
set_of_letters = set(list_of_letters)
print set_of_letters
 
print 'DISCARD'
set_of_letters.discard('q')
print set_of_letters
 
print 'REMOVE'
set_of_letters.remove('q')
print set_of_letters

a) Now that we've learned more about exception handling, explain what is happening here.

b) Create a script that contains a list of 5 of your favorite beers or wines or soft drinks (depending on your preference) for a tasting party stored as a set. Each time someone drinks a beverage, it is removed from your fridge and cannot be drunk again. Create this scenario: Fernando drinks a beverage. Adjust the set as appropriate. James sees Fernando drinking his beverage and wants the same one. Tell him you're out of that beverage.

2) Modify the FASTA parser in the sequence_tools.py module to handle:


1) non-existent filenames. Return a print statement indicating the error.
2) files that do not conform to the FASTA format (i.e. >gene for IDs, and strings of A,T,G, or C for sequence). Return a print statement indicating the error.
3) sequences that are in both cases. Make sure the program can take in upper and lower cases without producing an error.

3) Modify the FASTA parser to identify file compression format and read compressed files.


We need to add functionality to our FASTA parser for use next week. We will be using very large FASTA files (> 3GB) when they are uncompressed. In order to save disk space, we'll be leaving them compressed (~1 GB) while we use them. However, this will require us to use the gzip module to read a compressed file directly. We will also need the mimetypes module, which can identify the type of file we are using.

First, write a new function called open_file_by_mimetype() that determines whether a file is gzipped or not (using the mimetypes modules) and returns an open file handle. Then make a function called read_fasta() using your FASTA parser script, but modify it to call this new function such that your parser can read both types of FASTA files automatically.

To test out the code, use the gzip command in the terminal to compress your seq.FASTA file to seq.FASTA.gz. Use gunzip to uncompress your zipped file.

When you're done, store your new file-opening function in a module called file_tools.py. Store your improved read_fasta() parser in the module you already made called sequence_tools.py.

4) All code is bug-free until your first user.


You have another coworker who made an AMAZING secondary structure analysis script. She used it on the protein 3GV2, inputting the pdb file for 3GV2. She asks if you will analyze her protein, interleukin-19, as well.(HINT: use PDB code 1N1F). Crud! This protein breaks your code. Why? Rewrite your code to work on both interleukin-19 and on the original 3GV2 HIV capsid protein.

The script she shared is below. I recommend also downloading the pdb file for 3GV2, to see what a working final output should look like.

#!/usr/bin/python
##SCRIPT TO PARSE OUT SECONDARY STRUCTURE INFORMATION
 
import sys, os
 
full_seq = []
helix_aa = []
sheet_aa = []
atoms = []
f1 = open('1N1F.pdb' ,'r')
for next in f1:
    tmp = next.strip().split()
    if tmp[0] == 'SEQRES':
        if tmp[2] == 'A':
            full_seq.extend(tmp[4:])
    elif tmp[0] == 'HELIX':
        try:
            int(tmp[5])
        except:
            tmp[5] = tmp[5][:-1]
        helix_aa.append(tmp[:9])
    elif tmp[0] == 'SHEET':
        sheet_aa.append(tmp[:10])
    elif tmp[0] == 'ATOM':
        if len(tmp) < 12:
            begin = tmp[0:2]
            end = tmp[3:]
            middle = [tmp[2][:3], tmp[2][4:]]
            tmp = begin + middle + end
        try:
            int(tmp[5])
        except:
            continue
        atoms.append(tmp)
 
######################
 
num_helix_res = 0.0
print "There are %s residues in the sequence" % len(full_seq)
 
# Set up a listing of features by residue, then fill it in as we go along
feature = ['Other']*(10000)
 
for aa in helix_aa:
    # We add 1 because there are b-a+1 residues between a and b, inclusive
    num_helix_res += float(aa[8]) - float(aa[5]) + 1
    for i in range(int(aa[5]), int(aa[8])+1):
        feature[i] = 'Helix'
 
num_sheet_res = 0.0
for sheet in sheet_aa:
    num_sheet_res += float(sheet[9]) - float(sheet[6]) + 1
    for i in range(int(sheet[6]), int(sheet[9])+1):
        feature[i] = 'Sheet'
 
 
 # atom[4] == chain id
 # atom[5] == residue #
 # atom[10] == b-factor
 
 
helix_bfactors = {}
sheet_bfactors = {}
other_bfactors = {}
 
for atom in atoms:
    Chain = atom[4]
    BFactor = float(atom[10])
    ResidueNum = int(atom[5])
 
    if feature[ResidueNum] == 'Helix':
        if Chain not in helix_bfactors:
            helix_bfactors[Chain] = []
        helix_bfactors[Chain].append(BFactor)
    elif feature[ResidueNum] == 'Sheet':
        if Chain not in sheet_bfactors:
            sheet_bfactors[Chain] = []
        sheet_bfactors[Chain].append(BFactor)
    else:
        if Chain not in sheet_bfactors:
            other_bfactors[Chain] = []
        other_bfactors[Chain].append(BFactor)
 
for chain in helix_bfactors:
    # I could have used any of the different bfactor listings
    avg_helix = sum(helix_bfactors[chain])/len(helix_bfactors[chain])
    avg_sheet = sum(sheet_bfactors[chain])/len(sheet_bfactors[chain])
    avg_other = sum(other_bfactors[chain])/len(other_bfactors[chain])
    print '%s\t %5f\t %5f\t %5f' % (chain, avg_helix, avg_sheet, avg_other)
 
 

Solutions


1) Exception handling we have known.

print "Exercise 1, part A"
list_of_letters = ['a', 'a', 'b', 'c','c','c','d','e']
print 'ORIGINAL'
set_of_letters = set(list_of_letters)
print set_of_letters
 
print 'DISCARD, no error happens if the value is not in the set.'
set_of_letters.discard('q')
print set_of_letters
 
print "REMOVE, an error does occur when the value isn't in the set"
set_of_letters.remove('q')
print set_of_letters
 
The output is:
Exercise 1, part A
ORIGINAL
set(['a', 'c', 'b', 'e', 'd'])
DISCARD, no error happens if the value is not in the set.
set(['a', 'c', 'b', 'e', 'd'])
REMOVE, an error does occur when the value isn't in the set
Traceback (most recent call last):
File "ex1.py", line 14, in <module>
set_of_letters.remove('q')
KeyError: 'q'

And for part B,
print "Exercise 1, part B"
beers = set(['ipa','lager','paleale','saison','sour'])
def beerhour(beer):
        try: beers.remove(beer)  ##See if the command works
        except KeyError: print "Hey! Where's the " + beer +"?!" ##print statement when t
here is a key error rather than giving Traceback error
        else: print "Sweet, there's a " + beer + "!" ##If everything is fine, keep doing
 everything here
        finally: ##Regardless of everything working fine or the presence of a key error,
 do the following commands
                print "I see that " + ', '.join(beers) + " are all that's left."
                return beers
 
print "Fernando grabs a sour."
beerhour("sour")
print "James wants a sour."
beerhour("sour")
 
The output is:
Exercise 1, part B
Fernando grabs a sour.
Sweet, there's a sour!
I see that lager, ipa, saison, paleale are all that's left.
James wants a sour.
Hey! Where's the sour?!
I see that lager, ipa, saison, paleale are all that's left.

2) Modify the FASTA parser in the sequence_tools.py module to handle:

In sequence_tools.py, I now have the function fastaparser_dealwerrors():
def fastaparser_dealwerrors(filename):
        try: myfile = open(filename,'r')  ##try/except IOError answers part 1
        except IOError:
                return "Filename no good! Please verify filename and try again!"
        mydict = {}
        for ind,line in enumerate(myfile):
                if ind == 0:
                        if line[0] != '>': ##Answers part 2!
                                return "Not a FASTA file! Try again!"
                if line[0] == '>':
                        mygene = line[1:-1]
                else:
                        myseq = line[:-1].upper() ##We had this already--it answers part 3
                        mycheck = sum([1 for i in myseq if i not in "AGTCN"])
                        if mycheck != 0: ##We had this already, it answers part 2!
                                return "Weird characters in sequence! Try again!"
                        if mygene not in mydict: mydict[mygene] = myseq
                        else: mydict[mygene] += myseq
        myfile.close()
        return mydict
 
Then, using the script ex2.py below,
import sequence_tools as st
import sys
 
print st.fastaparser_dealwerrors(sys.argv[1])
 
I try out several broken fasta files, with the first four lines printed below of each (I only tried to break the first FASTA sequence in the seq.FASTA file).
seq_error_badbase.FASTA
>gene1
JKeAGACGTAGTGCCAGTAGCGCGATGTAGCG
ATGACGCATGACGCGCGACGCGCGAGTGAGCC
ATACGCACGCATTGGCA
 
seq_error_lowercase.FASTA
>gene1
ataAGACGTAGTGCCAGTAGCGCGATGTAGCG
ATGACGCATGACGCGCGACGCGCGAGTGAGCC
ATACGCACGCATTGGCA
 
and seq_error_nocarat.py
gene1
ATGAGACGTAGTGCCAGTAGCGCGATGTAGCG
ATGACGCATGACGCGCGACGCGCGAGTGAGCC
ATACGCACGCATTGGCA
 
I get:
$ python ex2.py seq_error_badbase.FASTA
Weird characters in sequence! Try again!

$ python ex2.py seq_error_lowercase.FASTA
{'gene1': 'ATAAGACGTAGTGCCAGTAGCGCGATGTAGCGATGACGCATGACGCGCGACGCGCGAGTGAGCCATACGCACGCATTGGCA', 'gene2': 'ATGTTCGACGCATACGACGCGCAGTACCAGCAATGACGCACCGGGATACACGACGCGGATTTTTACGCACCGAGATAGCATAAAAGACCATTAG', 'gene3': 'TTATGGCACCCACTAGAGCCAGATTATTTTAAA'}

$ python ex2.py seq_error_nocarat.FASTA
Not a FASTA file! Try again!

$ python ex2.py madeupfile.FASTA
Filename no good! Please verify filename and try again!

3) Modify the FASTA parser to identify file compression format and read compressed files.

The open_file_by_mimetype() function in file_tools.py is:
def open_file_by_mimetype(filename):
        import mimetypes, gzip
        ##condition for whether it's a zipped file
        if mimetypes.guess_type(filename)[1] == 'gzip':
                try:
                        ##This is one way of opening a gzipped file
                        fh = gzip.open(filename,'r')
                        return fh
                except IOError: ##account for file not existing
                        return '%s does not exist!' % (filename)
        else:
                try:
                        fh = open(filename,'r')
                        return fh
                except IOError: ##account for file not existing
                        return '%s does not exist!' % (filename)
 
The new read_fasta() function in sequence_tools.py is:
def read_fasta(filename):
        from file_tools import open_file_by_mimetype as ofbm
        myfile = ofbm(filename)
        ##Condition on whether what is returned from open_file_by_mimetype() function is
 a string (some error occurred) or an open file
        if type(myfile) != str:
                mydict = {}
                for ind,line in enumerate(myfile):
                        if ind == 0:
                                if line[0] != '>': return "Not a FASTA file! Try again!"
                        if line[0] == '>':
                                mygene = line[1:-1]
                        else:
                                myseq = line[:-1].upper()
                                mycheck = sum([1 for i in myseq if i not in "AGTCN"])
                                if mycheck != 0: return "Weird characters in sequence! Try again!"
                                if mygene not in mydict: mydict[mygene] = myseq
                                else: mydict[mygene] += myseq
                myfile.close()
        else: mydict = "%s is a bad filename!" % (filename)
        return mydict
 
Now, to try it all out,
import sequence_tools as st
 
x = "seq.FASTA"
print "For", x, " we get", st.read_fasta(x)
print
 
x = "zippedseq.FASTA.gz"
print "For", x, " we get", st.read_fasta(x)
print
 
x = "afile.gz"
print "For", x, " we get", st.read_fasta(x)
 
The output is:
For seq.FASTA we get {'gene1': 'ATGAGACGTAGTGCCAGTAGCGCGATGTAGCGATGACGCATGACGCGCGACGCGCGAGTGAGCCATACGCACGCATTGGCA', 'gene2': 'ATGTTCGACGCATACGACGCGCAGTACCAGCAATGACGCACCGGGATACACGACGCGGATTTTTACGCACCGAGATAGCATAAAAGACCATTAG', 'gene3': 'TTATGGCACCCACTAGAGCCAGATTATTTTAAA'}

For zippedseq.FASTA.gz we get {'gene1': 'ATGAGACGTAGTGCCAGTAGCGCGATGTAGCGATGACGCATGACGCGCGACGCGCGAGTGAGCCATACGCACGCATTGGCA', 'gene2': 'ATGTTCGACGCATACGACGCGCAGTACCAGCAATGACGCACCGGGATACACGACGCGGATTTTTACGCACCGAGATAGCATAAAAGACCATTAG', 'gene3': 'TTATGGCACCCACTAGAGCCAGATTATTTTAAA'}

For afile.gz we get afile.gz is a bad filename!

4) All code is bug-free until your first user.

Below, we have the same code, but now with three methods on the bottom. Method 1 is the original method that breaks when we put in 1N1F.pdb. Method 2 uses try and exceptions to deal with the key error that causes the problem. However, while it fixes the problem, there is still a key problem, where if the sheet_bfactors or other_bfactors have chains that are not in helix_bfactors, they are not printed out! Thus, Method 3 is a catch-all, that prints out a list of tuples lof all the chains in all the conditions. 1N1F.pdb only has chain A, so the same information is still produced.

Note: I also change the way we input filenames to call from the terminal, for easier testing.
#!/usr/bin/python
##SCRIPT TO PARSE OUT SECONDARY STRUCTURE INFORMATION
###These indicate our changes!
import sys, os
 
full_seq = []
helix_aa = []
sheet_aa = []
 
import sys, os
 
full_seq = []
helix_aa = []
sheet_aa = []
atoms = []
f1 = open(sys.argv[1] ,'r')  ###Call from command line
for next in f1:
    tmp = next.strip().split()
    if tmp[0] == 'SEQRES':
        if tmp[2] == 'A':
            full_seq.extend(tmp[4:])
    elif tmp[0] == 'HELIX':
        try:
            int(tmp[5])
        except:
            tmp[5] = tmp[5][:-1]
        helix_aa.append(tmp[:9])
    elif tmp[0] == 'SHEET':
        sheet_aa.append(tmp[:10])
    elif tmp[0] == 'ATOM':
        if len(tmp) < 12:
            begin = tmp[0:2]
            end = tmp[3:]
            middle = [tmp[2][:3], tmp[2][4:]]
            tmp = begin + middle + end
        try:
            int(tmp[5])
        except:
            continue
        atoms.append(tmp)
 
######################
 
num_helix_res = 0.0
print "There are %s residues in the sequence" % len(full_seq)
 
# Set up a listing of features by residue, then fill it in as we go along
feature = ['Other']*(10000)
 
for aa in helix_aa:
    # We add 1 because there are b-a+1 residues between a and b, inclusive
    num_helix_res += float(aa[8]) - float(aa[5]) + 1
    for i in range(int(aa[5]), int(aa[8])+1):
        feature[i] = 'Helix'
 
num_sheet_res = 0.0
for sheet in sheet_aa:
    num_sheet_res += float(sheet[9]) - float(sheet[6]) + 1
    for i in range(int(sheet[6]), int(sheet[9])+1):
        feature[i] = 'Sheet'
 
 # atom[4] == chain id
 # atom[5] == residue #
 # atom[10] == b-factor
 
 
helix_bfactors = {}
sheet_bfactors = {}
other_bfactors = {}
 
for atom in atoms:
    Chain = atom[4]
    BFactor = float(atom[10])
    ResidueNum = int(atom[5])
 
    if feature[ResidueNum] == 'Helix':
        if Chain not in helix_bfactors:
            helix_bfactors[Chain] = []
        helix_bfactors[Chain].append(BFactor)
    elif feature[ResidueNum] == 'Sheet':
        if Chain not in sheet_bfactors:
            sheet_bfactors[Chain] = []
        sheet_bfactors[Chain].append(BFactor)
    else:
        if Chain not in sheet_bfactors:
            other_bfactors[Chain] = []
        other_bfactors[Chain].append(BFactor)
 
###################
#Method 1
#This is the original code that breaks with 1N1F
 
for chain in helix_bfactors:
    # I could have used any of the different bfactor listings
    #print helix_bfactors.keys()
    #print sheet_bfactors.keys()
    #print other_bfactors.keys()
    avg_helix = sum(helix_bfactors[chain])/len(helix_bfactors[chain])
    avg_sheet = sum(sheet_bfactors[chain])/len(sheet_bfactors[chain])
    avg_other = sum(other_bfactors[chain])/len(other_bfactors[chain])
    print '%s\t %5f\t %5f\t %5f' % (chain, avg_helix, avg_sheet, avg_other)
 
###################
'''
######################
#Method 2
# This is my code to modify the program in order to make it robust to different .pdb's
######################
for chain in helix_bfactors:
    try:
        avg_helix = sum(helix_bfactors[chain])/len(helix_bfactors[chain])
    except KeyError:
        avg_helix = 0
    try:
        avg_sheet = sum(sheet_bfactors[chain])/len(sheet_bfactors[chain])
    except KeyError:
        avg_sheet = 0
    try:
        avg_other = sum(other_bfactors[chain])/len(other_bfactors[chain])
    except KeyError:
        avg_other = 0
    print '%s\t %5f\t %5f\t %5f' % (chain, avg_helix, avg_sheet, avg_other)
 
######################
'''
'''
######################
#Method 3
# This is another way to modify the program in order to make it robust to different .pdb's
######################
 
bfactors = {'Helix':helix_bfactors, 'Beta sheet':sheet_bfactors, 'Other':other_bfactors}
mytuples = []
for y in bfactors:
        for x in bfactors[y]:
                myavg = sum(bfactors[y][x])/len(bfactors[y][x])
                mytuples.append((x,y,myavg))
 
print sorted(mytuples)
'''
 
 
 
With Method 1:
$ python ex4.py 1N1F.pdb
Traceback (most recent call last):
File "ex4.py", line 101, in <module>
avg_sheet = sum(sheet_bfactors[chain])/len(sheet_bfactors[chain])
KeyError: 'A'

With Method 2:
$ python ex4.py 1N1F.pdb
A 36.237181 0.000000 57.820000

With Method 3:
[('A', 'Helix', 36.23718079673137), ('A', 'Other', 57.82)]