Fancy Data Structures


Topics:

  • Tuples
  • Sets
  • Lists of lists
  • Dictionaries of dictionaries
  • Lists of dictionaries, dictionaries of lists
  • Nested structures

Introduction


Let's say that you have a dictionary. In fact, let's say that you are a collector of English dictionaries, and you have all of them, from Cawdrey's 1604 A Table Alphabeticall all the way to a hard copy of the ever-contemporary Wiktionary project of Wikipedia. Finally, let us also suppose that you have some interest in the evolution of words over time. After all, why did you go out and buy all those dictionaries in the first place?

Given the number of dictionaries and the number of words, it seems only natural that you would need some knowledge of programming to tackle this project. However, it is here that we run into a problem: how should we best store this immense and complex amount of data? We have many dictionaries, each with many words; could we use what we've learned of lists and strings to store it? We certainly could have a list of strings, but then it would be difficult to associate them with their definitions. This is where the Pythonean dictionary can come in handy!

# define dictionaries
wiktionary = {}
 
 
aTableAlphabeticall = {}
 
# insert words, starting with, naturally, 'ventricle'
wiktionary['ventricle'] = '''1. (anatomy) One of two lower chambers of the heart.\
2. (anatomy) One of four cavities in the brain.'''
 
 
aTableAlphabeticall['ventricle'] = 'the stomacke which receiues the meate'

The trouble is, you've got thousands of dictionaries, and you're trying to make sense of them in relation to each other. This way, you'd have thousands upon thousands of lines of code just devoted to defining dictionaries, and afterwards, you've still got to refer to each of them individually in order to do any kind of analysis. This doesn't sound right. Wouldn't it be great to put all of those dictionaries into some kind of big list, perhaps chronologically arranged, so that you could tackle the data entry and analysis systematically? That's right, it would be great.

Similarly, we very often run into non-trivial data in biology. For instance, you might want a list of binding site positions for each of ten different transcription factors, or we might want to describe the single nucleotide polymorphisms in each of a thousand people. We often want to store several kinds of information about each position in a protein sequence. In order to accomplish these tasks and many others, we need fancier data structures: lists of lists, dictionaries of dictionaries, lists of dictionaries, even lists of dictionaries of dictionaries of lists of dictionaries of lists. Although we try to stay away from the latter.

This morning will be devoted to teaching you how to create fancy-pants data structures and how to combine them with tools that we're already learned, such as loops.

A Bit more on tuples


As we saw yesterday, a tuple is like a frozen list: once you've made a tuple, it is immutable, and you can't change it (although you can make a new one with similar elements. Since the keys of a dictionary need to be immutable, this is useful if you want to key a dictionary based on two things at once. The syntax for a tuple is much like a list, except instead of [square brackets], it's conventional to use (parentheses).

mytuple = (1,2)
# Actually, the parentheses are optional (but helpful for clarity)
mytuple = 1,2
 
mylist = [1,2]
 
mydict = {}
mydict[mytuple] = 'hello'
print mydict
mydict[mylist] = 'world'
print mydict
$ ./dict_tuple.py
{(1, 2): 'hello'}
Traceback (most recent call last):
File "./dict_tuple.py", line 12, in <module>
mydict[mylist] = 'world'
TypeError: unhashable type: 'list'

Also, if your tuple only has one item, you need to use a comma to make it clear that the tuple is a tuple and not just a value in parentheses:

tuple_A = ("Is this a tuple?")
tuple_B = ("What about this?", )
 
print tuple_A, type(tuple_A)
print tuple_B, type(tuple_B)
$./tuple_type.py
Is this a tuple? <type 'str'>
('What about this?',) <type 'tuple'>

Tuples actually underlie the syntax for the in-place swap:

a = 1
b = 2
a,b = b,a
print a, b
 
#Is equivalent to saying:
a = 1
b = 2
mytuple = (b,a)
a,b = mytuple

$ ./tuple_switch.py
1 2
2 1


We won't be using tuples extremely heavily, but be aware that if you ever have a function with multiple simultaneous return values, under the hood, it's actually combining them into a tuple.

Sets


A set is another nifty data structure. Like a set in mathematics, it has a bunch of elements with no repeats. To build a set, you pass in a list, and it will automatically remove duplicates

myset = set([1,1,2,3,5,8])
print myset
 
# You can also use the new syntax:
myset = {1,1,2,3,5,8}
print myset
 
# or you can build it up as you go along:
myset = set()
myset.add(1)
myset.add(1)
myset.add(2)
myset.add(3)
myset.add(5)
myset.add(8)
print myset
$./simple_set.py
set([8, 1, 2, 3, 5])
set([8, 1, 2, 3, 5])
set([8, 1, 2, 3, 5])


Sets make doing Venn diagrams of your data really easy:

UCs = set(["UC Berkeley", "UCLA", "UCSF", "UCSB", "UC Riverside", "UC Davis",
           "UCSD", "UC Santa Cruz", "UC Irvine", "UC Merced"])
bay_area_schools = set(["UC Berkeley", "Stanford", "UCSF", "San Jose State",
           "Santa Clara University", "University of San Francisco" ])
 
bay_area_UCs = UCs.intersection(bay_area_schools)
print "Bay Area UCs", bay_area_UCs
non_bay_area_UCs = UCs.difference(bay_area_schools)
print "UCs in less awesome places", non_bay_area_UCs
bay_area_non_UCs = bay_area_schools.difference(UCs)
print "Bay area schools that aren't UCs", bay_area_non_UCs
 
california_schools = UCs.union(bay_area_schools)
$./school_sets.py
Bay Area UCs set(['UC Berkeley', 'UCSF'])
UCs in less awesome places set(['UC Santa Cruz', 'UC Irvine', 'UCLA', 'UC Davis', 'UC Riverside', 'UCSD', 'UCSB', 'UC Merced'])
Bay area schools that aren't UCs set(['University of San Francisco', 'San Jose State', 'Santa Clara University', 'Stanford'])


There are also ways to remove things from a set. Let's say that due to rampant grade inflation and falling academic standards, that school south of here loses it's accreditation:

bay_area_schools.remove("Stanford")
# or
bay_area_schools.discard("Stanford")

The difference between remove and discard is that discard doesn't care whether or not the item is already in the set or not, whereas if the item isn't in the set, remove will give an error.

There are other set functions, but this covers most of what I use them for, and should put you in good stead as well.

Lists of lists


Let's create a few lists. Being self-centered, let's make the list of our lives:

#!/usr/bin/env python
 
# make a couple lists
coffee = ['yalis','peoples','brewedawakening']
lab = ['stanley','barker','LSA','Koshland','VLSB']

As it turns out, you can pretty much put anything into a list that you like, including other lists. Accessing and changing individual elements has a fairly straightforward syntax:

# make a list of lists!
importantPlaces = [coffee,lab]
 
# another way to write it
importantPlaces = [['yalis','peoples','brewedawakening'],
                   ['stanley','barker','LSA', 'Koshland','VLSB']]
# Note that you're allowed to split long lines if it's inside of a parenthesis,
# bracket, or curly brace
 
aList = importantPlaces[0]
print aList
aPlace = aList[0]
print aPlace
# or
aPlace = importantPlaces[0][0]  # aPlace is 'yalis'
importantPlaces[0][0] = 'peets'
 
anotherPlace = importantPlaces[0][0]

$./life_list.py
['yalis', 'peoples', 'brewedawakening']
yalis


All of the list operations from yesterday still apply:

# make a list of lists!
home = ['rockridge','the mission','lake merritt','gourmet ghetto']
importantPlaces.append(home)  # important places is now [coffee,lab,home]
importantPlaces[2].append('bushrod park')  # home now includes 'bushrod park'
 
print importantPlaces[2]
print home

$./life_list.py
['rockridge', 'the mission', 'lake merritt', 'gourmet ghetto', 'bushrod park']
['rockridge', 'the mission', 'lake merritt', 'gourmet ghetto', 'bushrod park']

Take note of that very last step: although you made a change while naming importantPlaces, you also changed home. That's because they're one and the same: importantPlaces[2] is actually a location in the warehouse, and Python knows when it sees a location to pretend like that thing is inside the list. How would you make importantPlaces carry a copy of home instead of home itself?

# make a list of lists!
importantPlaces.append([])  # important places is now [coffee,lab,[]]
importantPlaces[2].append(home[0])
importantPlaces[2].append(home[1])
...
 
# or, using a shorthand:
importantPlaces.append(home[:])
importantPlaces[2][3] = 'alcatel'
# replace 'gourmet ghetto' with name of landmark liquor store
print importantPlaces
print home
$./life_list.py
[['peets', 'peoples', 'brewedawakening'], ['stanley', 'barker', 'LSA', 'Koshland', 'VLSB'], ['rockridge', 'the mission', 'lake merritt', 'alcatel']]
['rockridge', 'the mission', 'lake merritt', 'gourmet ghetto']





This may seem like a subtlety now, and it some ways it is, but keep it in mind when you start coding and start seeing errors. Often a copy is what you want.

Dictionaries of dictionaries


Dictionaries of dictionaries work the same way. Let's return to the example used in the introduction:

#!/usr/bin/env python
 
# we keep our definitions in a dictionary
definitions = {}
 
# and, as above, we have several dictionaries in this dictionary
definitions['wiktionary'] = {}
definitions['aTableAlphabeticall'] = {}
 
print definitions
 
# and each dictionary is full of definitions
definitions['aTableAlphabeticall']['statue'] = 'an image of wood, or any other matter'
definitions['aTableAlphabeticall']['vapor'] = 'moisture, ayre, hote breath, or reaking'
 
print definitions
$ ./dict_dict.py
{'wiktionary': {}, 'aTableAlphabeticall': {}}
{'wiktionary': {}, 'aTableAlphabeticall': {'vapor': 'moisture, ayre, hote breath, or reaking', 'statue': 'an image of wood, or any other matter'}}

Lists of dictionaries, dictionaries of lists


Can we mix data structures? Of course! The main thing to keep in mind (besides what your data structure actually looks like) is that if you're building something up from scratch, you need to match the square brackets and curly braces:

a = [{},{}, ]  # a is a list of empty dictionaries
b = {'l1': [], 'l2': []} # b is a dictionary of empty lists
c = [{]}  # c is a Syntax Error!
d = {[}] # so is D!

Finally, everything that we've learned so far can be elaborated to even more complex data structures. After, all, there are often multiple definitions for a word...

# let's make the ventricle entry a little more sensible
listOfDictionaries = [{},{}]
listOfDictionaries[1]['ventricle'] = []
listOfDictionaries[1]['ventricle'].append('One of two lower chambers of the heart')
listOfDictionaries[1]['ventricle'].append('One of four cavities in the brain')
 
print listOfDictionaries
$./list_dict.py
[{}, {'ventricle': ['One of two lower chambers of the heart', 'One of four cavities in the brain']}]


Also, you can put different types of things into the same level of a list or dictionary, but I don't recommend it:

list_of_stuff = [[], 1, "Hello", {'name': 'Batman'}]
#This is totally legal, but just try writing a for loop that handles each one correctly!
 
for thing in list_of_stuff: print thing
# This is about the only thing I can think of that would work on
# all of the things in that list...

Fancy loops for fancy data structures


You are also allowed to loop over a dictionary. Be aware that it loops over the keys of the dictionary (i.e. the ones you put in brackets to get the values), and that you shouldn't count on any particular ordering to those keys:

knights = {'gallahad': 'the pure', 'robin': 'the brave'}
for x in knights:
    print x
    print knights[x]

$./loops.py
gallahad
the pure
robin
the brave

It turns out that you can put just about any valid Python code you want inside of a for loop. That happens to include other for loops! This is really useful for our fancy nested structures:

liLi = [[1,2,3],[7,8,9]]
for x in liLi:
    print x
    for y in x:
        print y
    print
$ ./list_loops.py
[1, 2, 3]
1
2
3

[7, 8, 9]
7
8
9




Exercises


1. A dictionary of dictionaries.


a) You want to store sequence information from human, mouse, and rat. Each species has a set of gene names corresponding to a set of sequences-- make a dictionary that has keys 'human', 'mouse', and 'rat', with each key having as a value a dictionary with gene names as keys and sequences as values.

The data are:
Human genes:
'TallnessGene' has sequence 'AATAGCAG'
'SmartnessGene' has sequence 'TGACGGA'

Mouse genes:
'FuzzynessGene' has sequence 'CCCCCCA'
'BeadyLittleEyesGene' has sequence 'ATAGCGC'

Rat genes:
'FuzzynessGene' has sequence 'CCCTCCA'
'BiggerThanMouseGene' has sequence 'GGACAATT'

b) Using the 'keys' method, print the names of the human genes.
c) Print out the sequence of FuzzynessGene in rat and mouse.
d) Print the third nucleotide of the human tallness gene.

2. A list of lists.


You want to store the results of three different time series experiments, each with four data points. You should do this by creating a list with three elements. Each of these elements is a four element list. There are a few ways to do this, which are outlined in parts (a), (b), and (c) below.

run1: 2,3,5,5
run2: 2,2,4,5
run3: 3,3,4,6

a) Create an empty list. Use the append method three times to make this a list of three empty lists. For each of these empty lists, use the append method four times to populate the list of lists with the data above.

b) Create a list of three empty lists without using either 'append' or '*' (you will run into a complication of using '*' in this context in the last exercise today). For each of these empty lists, use the append method four times to populate the list of lists with the data above-- feel free to copy and paste this part from your solution to part (a).

c) Create your list of lists, complete with data, all in one line.

3. A dictionary of lists.


Here, you have some structural data for several different species-- you want to store the B-factor of each position in several orthologous proteins. I picture this as a dictionary of lists: the dictionary is keyed with the species name, which has as a value a list that relates the position to the B-factor.

Human: 5,4,6,7
Mouse: 8,12,11,14
Rat: 10,11,13,15

a) Create an empty dictionary. Create three key-value pairs with the names as keys and empty lists as values. Use the append method to populate this dictionary of lists with data.

b) Create the populated dictionary all in one line, as in problem 2c.

4. Run Lola Run


Return to the time point data from problem two. Make three lists called 'run1', 'run2', and 'run3', and from them make a list of lists:

run1 = [2,3,5,5]
run2 = [2,2,4,5]
run3 = [3,3,4,6]

listOfRuns = [run1,run2,run3]

a) Alter the first element of run1. Has anything happened to listOfRuns?
b) Specify listOfRuns similarly, but using copies of run1, run2, and run3 instead of the lists themselves. Now, alter the first element of run1 and see what has happened to listOfRuns.

5. Multiplying Lists.

Use '*' (multiplication) to create a list of three empty lists. Use the append method to add the element "i belong first!" to the first list of the list of lists. Print out the list of lists. What has happened? This is hard-- don't worry about getting a TA.

Solutions:

Check here for more solutions - credit to Avi.

1. A dictionary of dictionaries.


 
##  (A)
 
##create a top level dictionary
 
species = {}
species['human'] = {}
species['mouse'] = {}
species['rat'] = {}
 
## add human data
species['human']['TallnessGene'] = 'AATAGCAG'
species['human']['SmartnessGene'] = 'TGACGGA'
 
## add mouse data
species['mouse']['FuzzynessGene'] = 'CCCCCCA'
species['mouse']['BeadyLittleEyesGene'] = 'ATAGCGC'
 
## add rat data
species['rat']['FuzzynessGene'] = 'CCCTCCA'
species['rat']['BiggerThanMouseGene'] = 'GGACAATT'
 
##   (B)
print species['human'].keys()
 
 
##   (C)
print species['mouse']['FuzzynessGene'],species['rat']['FuzzynessGene']
 
##   (D)
print species['human']['TallnessGene'][2]
 
 


2. A list of lists.


 
##    (A)
 
ListA = []
ListA.append([])
ListA.append([])
ListA.append([])
 
print ListA
 
ListA[0].append(2)
ListA[0].append(3)
ListA[0].append(5)
ListA[0].append(5)
 
print ListA
 
ListA[1].append(2)
ListA[1].append(2)
ListA[1].append(4)
ListA[1].append(5)
 
print ListA
 
ListA[2].append(3)
ListA[2].append(3)
ListA[2].append(4)
ListA[2].append(6)
 
print ListA
 
##    (B)
 
ListB = [[], [], []]
 
ListB[0].append(2)
ListB[0].append(3)
ListB[0].append(5)
ListB[0].append(5)
 
ListB[1].append(2)
ListB[1].append(2)
ListB[1].append(4)
ListB[1].append(5)
 
ListB[2].append(3)
ListB[2].append(3)
ListB[2].append(4)
ListB[2].append(6)
 
print ListB
 
##    (C)
 
ListC = [ [2,3,5,5] , [2,2,4,5], [3,3,4,6]]
print ListC
 
 


3. A dictionary of lists.


##    (A)
 
B_factor_dict = {}
 
B_factor_dict['human'] = []
B_factor_dict['mouse'] = []
B_factor_dict['rat'] = []
 
B_factor_dict['human'].append(5)
B_factor_dict['human'].append(4)
B_factor_dict['human'].append(6)
B_factor_dict['human'].append(7)
 
B_factor_dict['mouse'].append(8)
B_factor_dict['mouse'].append(12)
B_factor_dict['mouse'].append(11)
B_factor_dict['mouse'].append(14)
 
B_factor_dict['rat'].append(10)
B_factor_dict['rat'].append(11)
B_factor_dict['rat'].append(13)
B_factor_dict['rat'].append(15)
 
print B_factor_dict
 
 
 
##    (B)
B_factor_dict = {'human' : [5,4,6,7],'mouse' : [8,12,11,14],'rat' : [10,11,13,15]}
 
print B_factor_dict
 


4. Run Lola Run


 
##    (A)
 
run1 = [2,3,5,5]
run2 = [2,2,4,5]
run3 = [3,3,4,6]
 
listOfRuns = [run1,run2,run3]
 
print listOfRuns
 
run1[0] = 9001
 
print listOfRuns
 
##changing the first element of run1 also changed listOfRuns
 
 
 
##    (B)
 
run1 = [2,3,5,5]
run2 = [2,2,4,5]
run3 = [3,3,4,6]
 
listOfRuns = [run1[:],run2[:],run3[:]]
print listOfRuns
run1[0] = 9001
 
print listOfRuns
print run1
##The list remains the same
 
 

5. Multiplying Lists.

List = 3*[[]]
print List
List[0].append('i belong first')
 
print List





java -Xmx8g -Djava.io.tmpdir=strach/ -jar /home/james/Programs/GenomeAnalysisTK.jar -l INFO -R ~/ReferenceGenomes/StickleMasked/stickleMasked.fa -T UnifiedGenotyper --genotype_likelihoods_model BOTH -I AG07A_final.bam -I AG07B_final.bam -I AG07C_final.bam -I AG07D_final.bam -I AG07E_final.bam -I AG07F_final.bam -I AG08A_final.bam -I AG08B_final.bam -I AG08C_final.bam -I AG08D_final.bam -I AG08E_final.bam -I AG08F_final.bam -I AMG01_final.bam -I AMG02_final.bam -I PCAG5_final.bam -I PCAG6_final.bam -I PCAG7_final.bam -I PCAG8_final.bam -I JCH002A_final.bam -I JCH002B_final.bam -I JCH002C_final.bam -I JCH002D_final.bam -I JCH002E_final.bam -I JCH002F_final.bam -o MOAV_1020414.vcf &