First Morning, Introductions, and Lesson Plans



Your instructors are: Aisha Ellahi, Melinda Yang, James Hart, Fernando Racimo, and Courtney French.

Your teaching assistants are: Christopher Hann-Soden, Gavin Schlissel, Avi Flamholz, David Detomaso, and Jeffrey Spence.


Topics:

  • Expectations for the course
  • Navigating the UNIX shell
  • Viewing the content of files
  • How to get help
  • Text editors


Introduction


Welcome to the QB3 Introduction to Programming for Bioinformatics bootcamp!

Overview for today:

1) Introduce the instructors and TAs.
2) Course goals, we will cover, what we will not cover.
3) Why should biologists know how to code?
4) Class structure & organization
5) Learn how to function in a UNIX environment.

What are you going to learn?

You are going to learn python :O)

Python is a simple and powerful programming language that is used for many applications, from simple tasks to large software development projects. It has become popular as both a first language for beginning students and an everyday one for advanced programmers. Python is used by a range
of companies including Netflix, Google, Microsoft, YouTube, and Industrial Light & Magic.


Our goal is to show you how to apply programming to the problems and tasks that you face in the lab. By the end of this course, you will be able to do the following:
  • Extract data from large files.
  • Parse a .fasta file and translate a file of sequences.
  • Access biological databases using Python and grab the information you need (for example, get DNA sequences from Genbank).
  • Organize your data in python's data structures and write it out to files.
  • Perform automated tasks quickly and efficiently.
  • Make the computer make decisions for you.
  • Understand the basic theory behind commonly used bioinformatics software and how to use them (for example, samtools, BLAST, bowtie, cufflinks, etc).
  • Apply statistical tests to your data.
  • Make publication-quality figures with your data.

Although we will mostly focus on programming as it pertains to biology, our aim is for you to leave this course with a sufficiently generalized knowledge of programming (and the confidence to read the manuals) that you will be able to apply your skills to whatever you happen to be working on.

What are you not going to learn?


With only two weeks to teach this course, there are many topics that we are not going to cover (for example, classes and unit testing). This course is intended to be an introduction to the basics with a focus on the practical uses of coding in biology. From this course, you will gain the ability to at least "talk the talk" and understand where and how to seek out more information should you choose to go further in your development as a "coder." If you want to learn more advanced things, your best bet is to take a full class on coding in a more formal setting to understand the theory and delve into more applied skills (like software development). We are (primarily) experimental biologists that use programming to manage and handle large data sets, allowing us to speed up tasks, efficiently use
existing software to answer biological questions, build pipelines for data analysis and management, and perform basic statistical tests and analysis. If you are interested in bioinformatics theory or algorithm development, this is NOT your class!

Why Learn How To Code?


So if we're not going to become seasoned programmers or bioinformaticians after two weeks, why learn how to code in the first place? The answer is that as a scientist, the ability to work with and understand your own data makes you a stronger scientist. We are now squarely in the age of "-omics" and big data, and unfortunately, it doesn't look like the clock is turning back anytime soon! When you can interrogate your own data, you understand its features (both strengths and limitations) and can therefore confidently know how well it addresses a biological question. Knowing how to code gives you back the power over your own data, instead of having to rely on someone else to tell you what it means. You can ask better questions, design better experiments, and therefore better answer the specific biological questions that interest you.

Things to Keep in Mind


Coding is Hard


Coding has a STEEP learning curve! IT'S HARD! Learning to program is like learning a new language. Furthermore, the world of programming has its own culture and lexicon, and for novice coders, it can be a little intimidating. Learning a new language requires adjusting the way you think about solving a problem and communicating that solution. Even more confusing, each programmer develops his/her own style (accent). In practice, this means that there is almost never a "RIGHT ANSWER," but rather that there are almost infinite ways to solve almost any problem. The good news is if you like problem solving, you'll love this course! The bad news is that it will be hard. Don't despair! Our goal is to create a safe space for you to immerse yourself in learning this important skill. Ask questions, be engaged, and keep coming to class, even if you're behind! Bring a stuffed animal if it helps.

In the course of the class, you will be exposed to several styles and see several ways to get to more complicated solutions. The ultimate goal, though, is to give you the tools to begin to develop a style of your own. Later on, you may also want to read PEP 8, which is a style guide for Python, much like the vaunted Strunk and White is a style guide for English.

You have two incredibly useful resources at your disposal during the labs: first, you have us, the TAs and instructors, who are all familiar with the language and here to help you out. Second, you have the extensive documentation about python and programming that we will be introducing you to in the course of the class. Here are links to some of these resources here:

Learning Python
Python Pocket Reference
Python Website (documentation)
Linux Pocket Guide
Python Code Visualization

Python is BIG


Thousands of lab scientists, computer scientists, and programmers have
used it, contributed to it, and extended it to their own sub-fields. And they keep improving, and thus changing it! We can't get to all of it, even if we had much longer than two weeks. Included in the list of subjects we will not cover in any depth is object-oriented programming, writing parallel programs,
integrating your code with faster code written in C or C++, or a host of other powerful-but-subtle methods and topics. We will teach you enough that you should be able to go learn about them if and when you want.

Python is evolving


Like most things in the computing world, things move fast and change is constant. At the moment, there are two major versions of Python available: Python 2.7 and Python 3.4. In this course, we'll be using Python 2.7. Why aren't we using the more up-to-date Python 3? There's a bit of history: For most of Python's lifetime, each new version of the language would introduce new features, but would try very, very hard to not break any code that other people had already written; in other words, most changes were backwards compatible. In about 2006, however, Python's creator and "Benevolent Dictator For Life" Guido van Rossum decided that there were a number of things that he had gotten wrong in the original Python specification, and would like to change. Making those changes would break other people's code, so it was decided to make them all at once. Although that version was released almost 6 years ago, there are still some add-ons to the core language that haven't completely switched (though this number is decreasing constantly). The number of "gotchas" between Python 2 and 3 is relatively small, and we'll try to point out where there are differences, so if you do decide to make the leap, things should go relatively smoothly.


How are you going to learn it?


This year, we've introduced a new change in how the course is typically run: we've doubled the class size. Historically, this has been a small class of about 25-30 students, in an intimate classroom setting, where the students got a lot of one-on-one time with instructors and TAs. While that format works great, the downside was that we could not accommodate everyone that was interested in the course. Every year, we get over a hundred responses to our course announcement, and as you may have found from experience, spaces fill up FAST. Clearly, this class is in high demand, and this year, we decided to cut into that demand and see how well a larger class would work. The good news is that we've been able to enroll more students. The bad news is that each of you gets less one-on-one time with us and there are fewer TAs to go around. Unfortunately, you are the guinea pigs, and we ask that you be patient with us in working through the rough patches that we'll likely encounter. Every year, there are rough patches particular to that year; this year will be no different, and we ask that everyone be patient in trying to implement a larger class size.

The course is broadly divided into two parts.

Week 1: Learning Python Basics

In the first week of the course we will learn the very basics of programming practice and the
fundamentals of Python syntax, including:

- how to get information from files
- how to store information
- how to do interesting and complicated things with the information
- how to print information back out
- how to incorporate other people's code to do more faster, with less effort

Week 2: Python Applied to Bioinformatics

In the second week, we will use real data from an RNA-Seq experiment for a range of biological applications. The second week will show us:

- incredibly useful modules for scientists (and biologists) using python
- how to build and manage a data analysis pipeline
- how to call other programs from within our Python code
- how to perform scientific data analysis
- how to visualize our data

Our daily schedule will generally proceed like this:

Start at 8:30 am
1-2 hour lecture
2-2.5 hours lab for exercises

Lunch from 12:15 - 1

1-2 hour lecture
2-3 hours lab for exercises
Leave at 5 pm

You will have a number of exercises each day covering the breadth of the lectures. You will not be graded on these but it's REALLY important for you to be able to demonstrate what you've just learned. Just like learning French, one learns programming by doing. If you only finish half the problems, you've only really learned half the material. You get what you put in.

Questions?

Using the UNIX shell


You will spend nearly all of your time in one of three places: the shell, the text editor, or the interactive interpreter. The shell allows you to move and copy files, run programs, and more, while the text editor is where you will write your programs. We will focus mostly on the shell this morning and touch on the basic usage of a popular text editor, aquamacs. We will begin Python this afternoon.

A large fraction of what you can do in the shell (also called the "command line") can be done using the windowed operating system you're used to. While for the simplest of tasks, the command line may seem like a step backwards. But for anything even mildly more complicated (for example, "move
every file with 2012 anywhere in its name to the folder Backup"), it can save a lot of time. And then there are the programs that can only be run from the command line, which are much easier to write and more flexible in what they can do.


Informative Interlude: Some notes on the formatting of the lessons for this course


Periodically in these lessons, we may stop with an informative interlude outlined with a horizontal
line above and below (like the one two lines up!). In this case, we're taking a quick break to discuss
this and other aspects of the formatting.

For this and all further examples, a $ represents your shell prompt, and boldface indicates the
commands to type at the prompt. Italics will be used for output you should see when you take the
described action.

Finally, when we use actual python code examples, they will be contained in the shaded boxes,
such as:

This is where code will appear.

You'll notice that some of the words are in different colors. These words mean special things in Python. The wiki software understands that, and will color the code to make the structure more clear. Many editors like aquamacs also have a "syntax highlighting" feature, and this can actually be a useful hint when something inevitably goes wrong.

This concludes our first informative interlude.


Let's start by opening a new terminal window...

How do I move around?


One of the basic concepts is that your shell is always based somewhere in the directory structure.
An analogy here is if you have only a single window open (e.g. in the Finder on a Mac, or the
directory browser in Linux).

pwd
[where am I?]
(Print Working Directory) Prints the directory in which you are at the current moment. If you create any files, they will appear in this spot. When you first open the terminal shell, you will be in your "home" directory.

$ pwd
/Users/aellahi

cd
[move to a new directory]
(Change Directory) Given a path, this command moves your "current location" to the specified directory.

$ cd PythonCourse

$ pwd
/Users/aellahi/PythonCourse

To go up, use the command cd ..

$ cd ..
$ pwd
/Users/aellahi

Thus far, these have been relative paths (i.e. relative to your current directory), but you can also use an absolute path (which will start with a /):

$ pwd
/Users/aellahi
$ cd /Library/Frameworks/Python.framework/Versions/Current/bin/

A shortcut for your home directory is ~:
$ cd ~
$ pwd
/Users/aellahi

And you can use these as part of a path as well:

$ cd ~/PythonCourse

Another way to get to the home directory is to simply type "cd":

$ cd
$ pwd
/Users/aellahi

An aside on directories...
Directories in UNIX are set up the same way as your regular computer. Just as you would open up a window into your directories and click to open up folders, here you use cd to go through the directories. You are just typing the command instead of clicking.

ls directory_path
[lists contents of a directory] (LiSt) Shows the files and directories.

$ cd ~/PythonCourse
$ ls
ProgramFiles/
install_test.py
macos_install.txt

ls has many options. Here are some of the more useful ones to know:

ls -l
[lists the long form of the directory entries' security permissions, owners of files, sizes, date created]

ls -lh
[lists the long form of the directory entries, but with the sizes in a human-readable format (i.e. MB and GB instead of the number of bytes)]

ls -lt
[shows long listing, and sorts by modification time]

ls -lr
[reverses the list]

ls ..
[list contents of the directory above]

ls A_PATH
[list contents in the directory specified by A_PATH, which can be either relative or absolute.]

$ ls -ltr
[combine -l, -t, -r options]

Making your mark...


$ cd PythonCourse

mkdir directory_name
[Create a given directory] (MaKe DIRectory) Exactly what it says - let's you create new directories.

$ mkdir S1.1
$ cd S1.1
$ echo 'Hello World' > python_notes.txt
$ ls
python_notes.txt
$ mkdir data
$ ls
data python_notes.txt

cp original_name copy_name
[copy file or directory] (CoPy) Create a copy of the original file$

$ cp python_notes.txt python_notes2.txt
$ ls
data/
python_notes.txt
python_notes2.txt

echo 'To Do' > project_notes.txt
$ ls
project_notes.txt
data/
python_notes.txt
python_notes2.txt

$ cp project_notes.txt backup.txt
$ ls
backup.txt
project_notes.txt
data/
python_notes.txt
python_notes2.txt

mv source destination
[move files or directories](MoVe) Rename a file or directory. Renaming is the same as moving within the same directory.

$ mv backup.txt project_notes.txt

$ ls
project_notes.txt
data/
python_notes.txt
python_notes2.txt


Peeking inside files


less file_name
[view contents of a file] less shows the contents of a file, and allows you to scroll and search the contents. However, less can only be used for simple text files, so you cannot reliably view contents of, say, MS Word documents with less. Fortunately, most of the files we'll be dealing with will be plain text files

So let's see this works. Download this Pythons of the World text file and save it to your ~/PythonCourse directory. To read into this file, type:

$ less pythons_of_the_world.txt

The Pythonidae, commonly known simply as pythons, from the Greek word Python (πυθων), are a family of nonvenomous (though see the section "Toxins" below) snakes found in Africa, Asia and Australia. Among its members are some of the largest snakes in the world. Eight genera and 26 species are currently recognized.[2]

Contents [hide]
1 Geographic range
2 Conservation
3 Behavior
4 Feeding
5 Toxins
6 Reproduction
7 Captivity
8 Genera
9 Taxonomy
10 References
11 External links
Geographic range[edit]
Pythons are found in sub-Saharan Africa, Nepal, India, Sri Lanka, Burma, southern China, Southeast Asia and from the Philippines southeast through Indonesia to New Guinea and Australia.[1]

Some useful navigational tips for less:
- Use the "enter" key to progress one line at a time through the text.
- You can use the arrow keys to move up or down a line in the text.
- The spacebar will advance an entire page.
- You can search for a word by typing a slash (e.g. /) followed by the search word.
- To quit, type q.
- To see the full help screen, type h.



Optional Informative Interlude: UNIX names tend to be overly clever.


As you've seen with the basic commands thus far, the names are generally descriptive abbreviations of the program's function. For example, mkdir is for making a directory, ls is for listing the contents of a directory, etc. However, programmers, especially UNIX programmers, tend to get increasingly clever as things progress. Unaware of the fact that this practice makes things opaque, the typical programmer cries out for attention by making program names self-referentially clever. less is a good example of this. In the olden days, the most basic ways to view a text file could not divide files into individual pages, thus a multipage document would scroll off the screen before the first page could be read. As a solution, a program called more was written, which paused at the bottom of each page and prompted the user to press the spacebar for "more." The program name here is reasonably descriptive, but more had some noticeable feature deficiencies: you could neither advance the text one line at a time nor navigate backward in the document without reloading the whole file. The program written to accommodate these features is less. The cleverness of the name is revealed by the paradoxical adage "less is more ." Your teachers and TAs may use the more command interchangeable with less throughout the class.



head filename
[print first 10 lines of the file]

By default, head prints the top 10 lines of the input file. To print a different number, say 12, lines:
$ head -n 12 filename
$ head pythons_of_the_world.txt
The Pythonidae, commonly known simply as pythons, from the Greek word Python (πυθων), are a family of nonvenomous (though see the section "Toxins" below) snakes found in Africa, Asia and Australia. Among its members are some of the largest snakes in the world. Eight genera and 26 species are currently recognized.[2]

Contents [hide]
1 Geographic range
2 Conservation
3 Behavior
4 Feeding
5 Toxins
6 Reproduction
7 Captivity



tail filename
[print the last ten lines of the file]

$ tail pythons_of_the_world.txt
^ Jump up to: a b c d e McDiarmid RW, Campbell JA, Touré T. 1999. Snake Species of the World: A Taxonomic and Geographic Reference, vol. 1. Herpetologists' League. 511 pp. ISBN 1-893777-00-6 (series). ISBN 1-893777-01-4 (volume).
^ Jump up to: a b c d e "Pythonidae". Integrated Taxonomic Information System. Retrieved 15 September 2007.
Jump up ^ "Huge, Freed Pet Pythons Invade Florida Everglades", National Geographic News. Accessed 16 September 2007.
Jump up ^ Hardy, David L. (1994). "A re-evaluation of suffocation as the cause of death during constriction by snakes". Herpetological Review 229: 45-47.
Jump up ^ Mehrtens JM. 1987. Living Snakes of the World in Color. New York: Sterling Publishers. 480 pp. ISBN 0-8069-6460-X.
Jump up ^ Stidworthy J. 1974. Snakes of the World. Grosset & Dunlap Inc. 160 pp. ISBN 0-448-11856-4.
Jump up ^ Carr A. 1963. The Reptiles. Life Nature Library. Time-Life Books, New York. 192 pp. LCCCN 63-12781.
Jump up ^ Bryan G. Fry, Nicolas Vidal, Janette A. Norman, Freek J. Vonk, Holger Scheib, S. F. Ryan Ramjan, Sanjaya Kuruppu, Kim Fung, S. Blair Hedges, Michael K. Richardson, Wayne. C. Hodgson, Vera Ignjatovic, Robyn Summerhayes, Elazar Kochva (2006). "Early evolution of the venom system in lizards and snakes". Nature 439 (7076): 584–588. doi:10.1038/nature04328. PMID 16292255.
Jump up ^ Bryan G. Fry, Eivind A. B. Undheim, Syed A. Ali, Jordan Debono, Holger Scheib, Tim Ruder, Timothy N. W. Jackson, David Morgenstern, Luke Cadwallader, Darryl Whitehead, Rob Nabuurs, Louise van der Weerd, Nicolas Vidal, Kim Roelants, Iwan Hendrikx, Sandy Pineda Gonzalez, Alun Jones, Glenn F. King, Agostinho Antunes, Kartik Sunagar (2013). "Squeezers and leaf-cutters: differential diversification and degeneration of the venom system in toxicoferan reptiles". Molecular & Cellular Proteomics 12 (7): 1881–1899. doi:10.1074/mcp.M112.023143.
Jump up ^ "The Keeping of Large Pythons" at Anapsid. Accessed 16 September 2007.



cat file1 file2 ...
[print named files to the screen]
(conCATenate) If given just one file, cat will print the contents of the file to the screen. Given multiple files, it will print one after another.

Let's start by making two files, cat1.txt and cat2.txt:

$ echo 'HEY EVERYONE!!!' > cat1.txt
$ echo 'WISH I WAS OUTSIDE PLAYING!!! :O(' > cat2.txt

To view the contents, type:

$ cat cat1.txt
HEY EVERYONE!!!
$ cat cat2.txt
WISH I WAS OUTSIDE PLAYING :O(
$ cat cat1.txt cat2.txt
HEY EVERYONE!!!
WISH I WAS OUTSIDE PLAYING :O(

grep 'search_string' file1 [file2 ...]
(Global Regular Expression Print)
Searches for the "search string" in a text file and prints out all lines where it find the desired text. The search string can be a simple word, or a complicated specification of matches/mismatches.

$ grep "python" pythons_of_the_world.txt
The Pythonidae, commonly known simply as pythons, from the Greek word python (πυθων), are a family of nonvenomous snakes found in Africa, Asia and Australia. Among its members are some of the largest snakes in the world. Eight genera and 26 species are currently recognized.[2]
In the United States, an introduced population of Burmese pythons, Python molurus bivittatus, has existed as an invasive species in the Everglades National Park since the late 1990s.[3]
Many species have been hunted aggressively, which has decimated some, such as the Indian python, Python molurus.
Black-headed python,
Larger specimens usually eat animals about the size of a house cat, but larger food items are known: some large Asian species have been known to take down adult deer, and the African rock python, Python sebae, has been known to eat antelope. Prey is swallowed whole, and may take anywhere from several days or even weeks to fully digest.
Contrary to popular belief, even the larger species, such as the reticulated python, P. reticulatus, do not crush their prey to death; in fact, prey is not even noticeably deformed before it is swallowed. The speed with which the coils are applied is impressive and the force they exert may be significant, but death is caused by suffocation, with the victim not being able to move its ribs to breathe while it is being constricted.[5][6][7]
Apodora Kluge, 1993 1 0 Papuan python Most of New Guinea, from Misool to Fergusson Island
Bothrochilus Fitzinger, 1843 1 0 Bismark ringed python The islands of the Bismark Archipelago, including Umboi, New Britain, Gasmata (off the southern coast), Duke of York and nearby Mioko, New Ireland and nearby Tatau (off the east coast), the New Hanover Islands and Nissan Island
Leiopython Hubrecht, 1879 1 0 D'Albert's water python Most of New Guinea (below 1200 m), including the islands of Salawati and Biak, Normanby, Mussau, as well as a few islands in the Torres Strait
Carpet python,
Green tree python,
Albino Burmese python,
Borneo short-tailed python,


The -c argument counts the number of lines with a match (not the number of matches).

$ grep -c python pythons_of_the_world.txt
13

The -v argument inVerts the search (i.e. prints lines that *don't* contain your search string).

cut -f column_number(s) file

Many of the data files we'll be dealing with are actually tables, usually separated by tabs. The cut command will pull out the column numbers you specify and print them out to the shell, while leaving the original file alone.

Special characters


wildcard matching with the *


The star functions as a "wild-card" character that matches any number of characters.

$ ls
cat1.txt cat2.txt pythons_of_the_world.txt
$ ls *.txt
cat1.txt cat2.txt linux_text.txt

The star can go anywhere in a list of arguments you're supplying, even in the middle of words! There are other wildcards you can use but * is the most common.


pipe |

(the one above the backslash "\" key)

Piping with | connects UNIX commands, allowing the output of one command to "flow through the pipe" to another. This lets you chain programs together, such that each one only needs to worry about one step of the process (either generating, filtering, or modifying data), without knowing or caring where it came from or where it's going to.

$ env
TERM_PROGRAM=Apple_Terminal
SHELL=/bin/bash
TERM=xterm-256color
TMPDIR=/var/folders/fk/6tsd33ts4s9fjs00wj2n98lr0000gn/T/
PERL5LIB=/Volumes/RineData/vcftools_0.1.12a/perlalias
Apple_PubSub_Socket_Render=/tmp/launch-BjZJF7/Render
TERM_PROGRAM_VERSION=326
OLDPWD=/Users/aishaellahi85/PythonCourse/ProgramFiles
TERM_SESSION_ID=60157A77-F121-4665-960C-DC6E2927470A
USER=aishaellahi85
SSH_AUTH_SOCK=/tmp/launch-Sh9rw7/Listeners
CF_USER_TEXT_ENCODING=0x1F5:0:0
VIRTUAL_ENV=/Users/aishaellahi85/Library/Enthought/Canopy_32bit/User
PATH=/Library/Frameworks/Python.framework/Versions/Current/bin:/Users/aishaellahi85/Library/Enthought/Canopy_32bit/User/bin:/Library/Frameworks/Python.framework/Versions/2.7/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/Users/aishaellahi85/ncbi-blast-2.2.28+/bin:/Users/aishaellahi85/samtools-0.1.19:/Users/aishaellahi85/samtools-0.1.19/misc:/usr/local/git/bin/:/Users/aishaellahi85/mybin:/Volumes/RineData/vcftools_0.1.12a/bin:/Volumes/RineData/RineLabData/tabix-0.2.6
CHECKFIX1436934=1
MKL_NUM_THREADS=1
PWD=/Users/aishaellahi85/PythonCourse
LANG=en_US.UTF-8
PS1=(Canopy 32bit) \h:\W \u\$
SHLVL=1
HOME=/Users/aishaellahi85
PYTHONPATH=/Users/aishaellahi85/PythonCourse/pylib/:/Users/aishaellahi85/myPython_modules/:Library/Frameworks/R.framework/Versions/3.0/Resources/lib/:/Library/Frameworks/R.framework/Versions/3.0/Resources/bin/
LOGNAME=aishaellahi85
aquamacs=/Application/Aquamacs.app/Contents/MacOS/Aquamacs
R_HOME=/Library/Frameworks/R.framework/Resources/
DISPLAY=/tmp/launch-X9aUrf/org.macosforge.xquartz:0
_=/usr/bin/env



$ env | head
TERM_PROGRAM=Apple_Terminal
SHELL=/bin/bash
TERM=xterm-256color
TMPDIR=/var/folders/fk/6tsd33ts4s9fjs00wj2n98lr0000gn/T/
PERL5LIB=/Volumes/RineData/vcftools_0.1.12a/perlalias
Apple_PubSub_Socket_Render=/tmp/launch-BjZJF7/Render
TERM_PROGRAM_VERSION=326
TERM_SESSION_ID=E56F9C2B-CC40-47FA-A17B-A188095EA75F
USER=aishaellahi85
SSH_AUTH_SOCK=/tmp/launch-Sh9rw7/Listeners


$ env | grep HOME
HOME=/Users/aellahi


Redirection with >


In addition to redirecting output to another command, the results can be sent into a file with the >

$ cat cat1.txt cat2.txt > wishes.txt
$ cat wishes.txt
HEY EVERYONE!!!
WISH I WAS OUTSIDE PLAYING :O(

Or you can append to the end of a file with >>
$ echo "Just kidding, I love Programming!" >> wishes.txt
$ cat wishes.txt
HEY EVERYONE!!!
WISH I WAS OUTSIDE PLAYING :O(
Just kidding, I love Programming!"

Permissions


Unlike the computers you are used to, UNIX doesn't automatically know what to do with files (e.g. It won't know to use Word to open a .doc document), and it doesn't even know whether a file is data or a program (and as we'll see with the programs we write, it might be different things at different times)

The first thing that controls a file is the file's permissions. You can control who can read, write, and execute (run as a program) each of your files.

$ ls -la


The first letter tells you whether it is a directory.

The next set of letters tell you if a file is readable (r), writable (w), or executable (x).

The 2nd-4th letters tell you what *your* permissions are, 5th-7th tell you what your group's permissions are, and the last three tell you what the rest of the world's permissions are. Unix was designed to be a multi-user operating system, so even if you're the only one who uses the computer, it maintains the distinction for you, versus your group, versus everyone else.

chmod [flags] [filename]
Modify permissions.

$ echo 'script' >script.py
$ ls -l script.py
-rw-r--r-- 1 aellahi staff 7 Jul 13 20:52 script.py
$ chmod +x script.py
$ ls -l script.py
-rwxr-xr-x 1 aellahi staff 7 Jul 13 20:52 script.py*

James is going to explain how to get UNIX to run your executable scripts this afternoon. However, if you try running a program and it's not working at some point in the class, double check the permissions!!!

Help, I'm stuck!


man command_name
[what does that command do again?]

Most commands have many useful flags beyond what I've shown you. For information on a particular command, look at the MANual pages with man.
$ **man chmod**
CHMOD(1)                  BSD General Commands Manual                 CHMOD(1)
 
NAME
     chmod -- change file modes or Access Control Lists
 
SYNOPSIS
     chmod [-fv] [-R [-H | -L | -P]] mode file ...
     chmod [-fv] [-R [-H | -L | -P]] [-a | +a | =a] ACE file ...
     chmod [-fhv] [-R [-H | -L | -P]] [-E] file ...
     chmod [-fhv] [-R [-H | -L | -P]] [-C] file ...
     chmod [-fhv] [-R [-H | -L | -P]] [-N] file ...
 
DESCRIPTION
     The chmod utility modifies the file mode bits of the listed files as
     specified by the mode operand. It may also be used to modify the Access
     Control Lists (ACLs) associated with the listed files.
 
     The generic options are as follows:
 
     -f      Do not display a diagnostic message if chmod could not modify the
             mode for file.
 
     -H      If the -R option is specified, symbolic links on the command line
             are followed.  (Symbolic links encountered in the tree traversal
             are not followed by default.)
 
     -h      If the file is a symbolic link, change the mode of the link
             itself rather than the file that the link points to.
 
     -L      If the -R option is specified, all symbolic links are followed.
 
     -P      If the -R option is specified, no symbolic links are followed.
             This is the default.
 
     -R      Change the modes of the file hierarchies rooted in the files
             instead of just the files themselves.
 
     -v      Cause chmod to be verbose, showing filenames as the mode is modi-
             fied.  If the -v flag is specified more than once, the old and
             new modes of the file will also be printed, in both octal and
             symbolic notation.
 
     The -H, -L and -P options are ignored unless the -R option is specified.
     In addition, these options override each other and the command's actions
     are determined by the last one specified.
 
     Only the owner of a file or the super-user is permitted to change the
     mode of a file.
...

Text Editors


Lastly, now that we can see into files, it would be nice to be able to create and edit our own files. And... our first lesson in programming accents: different programmers use different text editors. I am going to introduce three common options here. Each has pluses and minuses depending on your needs. There is no 'right' one, so play around and pick your favorite. Note that each of the teachers will be using their fav, so don't worry if they are using something different than you. These can all operate in the terminal window, and for some quick edits, it may make sense to do it that way, although they also have standalone programs. Often, you'll want to have that window open editing your code, save, jump over to the terminal, and then run your code.

Program 1: vi
open a file: vi [filename]
write to file: [shift] + i
save a file: [shift] + :w
close: [shift] + :q

Vi is somewhat unique in that it has a couple major "modes". The default, "normal mode" is not actually the one where you write text, so you need to use i to go into "insert mode", then ESC to get back to normal mode, where you can save, search, etc. For more info, see: Introduction to Vi

Program 2: emacs
open a file: emacs [filename]
save a file: CTRL-X CTRL-S
close: CTRL-X CTRL-C

Emacs (including Aquamacs) has many, many short-cut keys or "accelerators." A quick Googling of "Emacs Cheat Sheet" will reveal several resources, such as this one from the Princeton CS department: Emacs Cheat Sheet

And then there is my personal favorite:
Program 3: Aquamacs
open a file: aquamacs [filename] &
save a file: CTRL-X-S
write a file: CTRL-X-W
close a file: command + W

Here are a few more helpful cheat sheets:
Unix commands cheat sheet
Emacs cheat sheet

Final Installation Check


Ok, now that we have a handle on the terminal, let's do a final installation check of the programs we had you install. In your terminal window, type the following commands:

$ ipython
[this should start ipython from whatever python distribution you installed]

Raise your hand if this does NOT work for you. Now, try:

$ aquamacs &

This command should start aquamacs and open a new text editor window.

To test git, type:

$ git
usage: git [--version] [--help] [-C <path>] [-c name=value]
[--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
[-p|--paginate|--no-pager] [--no-replace-objects] [--bare]
[--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
<command> [<args>]

The most commonly used git commands are:
add Add file contents to the index
bisect Find by binary search the change that introduced a bug
branch List, create, or delete branches
checkout Checkout a branch or paths to the working tree
clone Clone a repository into a new directory
commit Record changes to the repository
diff Show changes between commits, commit and working tree, etc
fetch Download objects and refs from another repository
grep Print lines matching a pattern
init Create an empty Git repository or reinitialize an existing one
log Show commit logs
merge Join two or more development histories together
mv Move or rename a file, a directory, or a symlink
pull Fetch from and integrate with another repository or a local branch
push Update remote refs along with associated objects
rebase Forward-port local commits to the updated upstream head
reset Reset current HEAD to the specified state
rm Remove files from the working tree and from the index
show Show various types of objects
status Show the working tree status
tag Create, list, delete or verify a tag object signed with GPG

'git help -a' and 'git help -g' lists available subcommands and some
concept guides. See 'git help <command>' or 'git help <concept>'
to read about a specific subcommand or concept.


Raise your hand if any of these do NOT work for you.

Questions?



Exercises


1) Cerevisiae chromosomes

a) In your top-level directory, make a new directory called "fasta_files" and change into it
b) Go to http://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/
and individually download each of the files ending in .fsa. These are the chromosomes of the yeast, S. cerevisiae. You may have to right-click these files depending on your web browser (and be aware, some browsers will save your file with a .txt extenstion).
c) Make a single whole genome file called "cerevisiae_genome.fasta"
d) Count the chromosomes in the whole genome file using commands from the lecture. (HINT: Each of the original FASTA files contains a single chromosome).
e) Look up the command 'wc' and find out what it does. Get size of total genome. (HINT: The size of the genome can be determined by counting the number of characters not on the same line as a fasta header).

2) Cerevisiae genes

a) Get the list of cerevisiae chromosome features: http://downloads.yeastgenome.org/curation/chromosomal_feature/SGD_features.tab
Columns within SGD_features.tab:
 
1.   Primary Standfor Gene Database ID (SGDID) (mandatory)
2.   Feature type (mandatory)
3.   Feature qualifier (optional)
4.   Feature name (optional)
5.   Standard gene name (optional)
6.   Alias (optional, multiples separated by |)
7.   Parent feature name (optional)
8.   Secondary SGDID (optional, multiples separated by |)
9.   Chromosome (optional)
10.  Start_coordinate (optional)
11.  Stop_coordinate (optional)
12.  Strand (optional)
13.  Genetic position (optional)
14.  Coordinate version (optional)
15.  Sequence version (optional)
16.  Description (optional)
 

b) Count total genes
c) Count only verified genes. Count only uncharacterized genes.
d) What other types of genes are in this file? For this, you may want to use the sort command with the -u flag, which will sort the input alphabetically, then take only unique lines.

3) Star-struck
From the same directory as your fasta files, see if you can predict what each of these commands will do (then try it)
a) head *
b) head *.fsa
c) head chr1*.fsa
d) head chr1*
e) head chr*1.fsa
f) head chr*1
g) grep 'S288C' *
h) grep 'S288C' *.fsa
i) grep 'BK006935.2' *
j) cat * | grep 'BK006935.2' (what's the difference in the output between this one and the last one?)
k) head *.fsa | grep 'chr'
l) head *.fsa | grep 'chromosome' (what's the difference in the output between this one and the last one?)


4) Building a pipeline
a) Download the files below into your S1.1 directory. These are detected terminator sequences in the E. coli genome (using the program GeSTer, if you're curious).
b) The command grep '/G=[^ ]*' somefile will find all lines that match /G=somegenename, where somegenename is a sequence of non-blank characters. Read the output of man grep and figure out how to -only print /G=somegenename, rather than the whole line.
c) Pipe the results of part b) through a cut command to get only everything after the =
d) Store the results of part c) in a file named "terminated_genes.txt'
e) BONUS: google for a Unix command that only keeps each gene once, rather than once per annotated terminator.




5) Moving beyond the lecture
a) Use google and any other references you want to find a command that tells you how much disk space you have left.
b) Use the 'man' command to see how it works.
c) How much space is left on your system? Make the command output in terms of gigabytes and megabytes-- 'human-readable' form.

Solutions



1) Cerevisiae Chromosomes
a, b) Just do it!
c)
$ cat *.fsa > cerevisiae_genome.fasta
d)
$ grep -c '>' cerevisiae_genome.fasta
e)
$ grep -v '>' cerevisiae_genome.fasta | wc
202628 202628 12359733
Then, subtract 202628 (the number of "newline" characters) from 12 359 733. In short, still about 12 megabases.

2) Cerevisiae Genes
a) Do it!
b)
$ grep -c ORF SGD_features.tab
6653
c)
$ grep ORF SGD_features.tab | grep Verified | wc -l
5097

$ grep ORF SGD_features.tab | grep Uncharacterized | wc -l
724

d)
$ cut -f 3 SGD_features.tab | sort -u

Dubious
Uncharacterized
Verified
Verified|silenced_gene
silenced_gene


3) Star-struck
a) This prints the first 10 lines of every file in the directory
b) This prints the first 10 lines of every file in the directory that ends with .fsa (but not .fasta)
c, d) The first 10 lines of every file in the directory where the filename starts with chr1 (i.e. chr10.fsa, chr11.fsa, etc)
e) The first 10 lines of chr01.fsa and chr11.fsa
f) Nothing! There is no file that starts with chr and ends with 1
g, h) Note again that cerevisiae_chromosome.fasta does not end with .fsa!
i, j)
$ grep 'BK006935.2' *
cerevisiae_genome.fasta:>tpg|BK006935.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=I] [note=R64-1-1]
chr01.fsa:>tpg|BK006935.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=I] [note=R64-1-1]
$ cat * | grep 'BK006935.2'
>tpg|BK006935.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=I] [note=R64-1-1]
>tpg|BK006935.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=I] [note=R64-1-1]

Note that the first one, grep will put the name of the file it found it in at the beginning of the line, whereas in the second, once they've ben cat'ed together, the filename goes away.

k, l)
==> chr01.fsa <==
>tpg|BK006935.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=I] [note=R64-1-1]
==> chr02.fsa <==
>tpg|BK006936.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=II] [note=R64-1-1]
==> chr03.fsa <==
>tpg|BK006937.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=III] [note=R64-1-1]
==> chr04.fsa <==
>tpg|BK006938.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=IV] [note=R64-1-1]
==> chr05.fsa <==
>tpg|BK006939.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=V] [note=R64-1-1]

...

Note that when you do head on multiple files, it includes the name of each file in ==> <==. When grep'ing for chr, you find the chr in the name of the file, as well as the chr inside the file, whereas chromosome only matches inside the file.


4) Building a Pipeline
a) something like:
$ cp Downloads/palins* PythonCourse/S1.1
works
b)
$ grep -o '/G=[^ ]*' palins*
c)
$ grep -o '/G=[^ ]*' palins* | cut -d '=' -f 2
d)
$ grep -o '/G=[^ ]*' palins* | cut -d '=' -f 2 > terminated_genes.txt
e)
$ grep -o '/G=[^ ]*' palins* | cut -d '=' -f 2 | sort -u > terminated_genes.txt

5) Moving beyond the lecture
Google is your friend! I searched for "unix disk space" and clicked on the first result...

$ df -h