introduction. why learn computer programming? general data processing. you have ideas that can’t...

22
Introduction

Post on 22-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Introduction

Page 2: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Why Learn Computer Programming?• General data processing. You have ideas that can’t be carried

out by hand, and no one else has done for you.– This is especially true with graphics: you can draw images of your data

• Changing file formats. Every bioinformatic program expects input files to be in a certain format, and produces output in another format. And, the formats are all different. You can easily solve this problem with a bit of programming.

• Running other peoples’ programs automatically.– Many bioinformatic programs are only written for the Linux operating

system and are not available on the web, for Windows, or for Macs.– If you need to run the same program repeatedly with small variations,

writing a program to do it is far easier than endlessly pointing, clicking, copying, and pasting.

Page 3: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Why Perl?• There are many computer languages, so why learn

this one?• It is very good for string handling: DNA sequences,

for example. Also large data files.• It has several convenient features: don’t have to

worry about memory allocation, no separate compiling step, many useful built-in functions.

• It has a large community of users, especially in bioinformatics.– CPAN – BioPerl

Page 4: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Unix • In the consumer world, the two main operating systems

are Windows and Macintosh. • Most scientific computing is done with the Unix/Linux

operating system.• Unix was created by Ken Thompson and Dennis Ritchie

(who also co-invented the C language) at Bell Laboratories in 1970. – It made the shared use of computers easy and automatic:

multi-tasking, timesharing• Unix proved to be both very useful and very easy to port

to many types of computer. Many variants were spawned, both academic (esp. BSD, from UC Berkeley) and commercial.

• In the early 1980s, AT+T attempted to commercialize Unix. Due to various legal agreements, it proved possible to develop free, open source Unix-like variants that were free of commercial restrictions.– Much development, esp. of useful commands, by the GNU

Project run by Richard Stallman. This is the beginning of the open source software movement.

Page 5: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Linux• In the early 1990’s Linus Torvalds (who was

just a hobbyist at the time) invented a Unix-like system that used important GNU foundation programs (esp. gcc, the C compiler) and was free of all copyrighted/patented code. – Linux – For various important and/or random reasons,

it caught on. It is now very widely used in the Open Source community

• An active community continues to develop Linux, and many minor variants exist.

• Attacks by various groups that want to get patent royalties on Linux have occurred, but so far Linux has remained freely available to anyone.

Page 6: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Biolinx Access• Our departmental Linux server is “biolinx”, or more officially

biolinx.bios.niu.edu.– IP number is 131.156.41.4. (131.156 is NIU, .41 is BIOS). This number will work

even if the Domain Name Server (DNS) system is not functioning.

• We are going to use the command line system and not a GUI.• To access it, you need to use a SSH (Secure Shell) client, which allows

communication with the biolinx server using a different communication protocol and port than HTTP (web browser).– SSH is a replacement for telnet. SSH encrypts all information, especially your

password (unlike telnet).

• We generally use the PuTTY SSH program, which can be downloaded at: http://www.chiark.greenend.org.uk/~sgtatham/putty/ – Just download and run the putty.exe binaries. Not sure what to do about

Macs…try Google.– The first time you try to contact biolinx, it will give you a message about

encryption. As in all public key encryption schemes, your local computer and the server need to trade keys (which are very large primer numbers). So, accept this message, and all will be well.

Page 7: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Directory Structure• Files on Unix systems are stored in a tree-like hierarchy. • Two very useful commands:

– pwd (“print working directory”) prints the current directory – cd (“change directory”) moves you through the file structure.

• The root directory is called “/”. – Under / are several sub-directories: /home, /var, /etc, /lib, etc. For

the purposes of this class, and most uses you will have, the most important are /home and /srv.

– Your home directory (where you are when you log in) is a sub-directory of /home, based on your ZID: /home/z123456.

– Another useful directory: /home/share/bios646, and its sub-directory /home/share/bios646/spr12. Our BIOS 646 class directories.

• Try entering the pwd command (hit Enter after you type in the letters. The system returns something like /home/z123456, your current directory.– The command “cd”, with no arguments, always brings you to your

home directory: useful if you get lost in the system.

Page 8: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Moving Around• The list of sub-directories leading to where you are is called the

“path”.• The path can be relative or absolute.

– Absolute paths always start with /. They list the sub-directories starting with the root (/) and ending with where you are. Ex: /home/z123456

– Relative paths never start with /. They list the sub-directories starting from where you are and ending where you want to be.

• The cd command can use either relative or absolute paths– For example, if you were in /home and wanted to get to the current

course directory, you could either use an absolute path: cd /home/share/bios646/spr12, or you could use a relative path: cd share/bios646/spr12.

• One more trick: moving up a level is done with the command cd .. That is, “..” refers to one level above where you are.– E.g. you are in /home/share/bios646/spr12 and you want to get to

your home directory using a relative path: cd ../../../z123456 Use “pwd” to be sure you got there.

• A single dot “.” refers to your current directory.

Page 9: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Listing Files• ls means “list files”. The name of every file is displayed. By

default, in alphabetical order (well, ASCII order, actually). • More useful is ls –l . This lists the file names and a lot of

useful information about them: file permissions, ownership, size, when they were last modified. – What if the list scrolls off the top of the screen? Try ls –l |

more. This is pronounced “pipe more”. The pipe takes the output of ls –l and sends it to another command “more”. More gives you a page of material at a time: you can advance it with the space bar.

– What if you want to see the most recently modified files first: ls –lt is the command to use: it sorts the files by date. ls –lt | more if there are too many to fit on one screen.

– Many more modifications to the ls command can be found if you type “man ls”. “man” stands for manual: you can get the same info by Googling “man ls”, if you like.

Page 10: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Copying, deleting, moving files• cp source destination is the syntax for the copy command.

“source” is the name of the file you want to copy, and “destination” is the name you want to give the copy.– If you your source or destination is in another directory, you have to

include the path.– If you want to copy a file from one directory to another without

changing the name, just type in the path.– If you want to copy a file from another directory to the directory you

are in, and not change the name, use “.” is indicate the current directory. Ex: “cd /home/share/bios646/BLOSUM62 .” will copy the BLOSUM62 matrix (used in BLAST) to your current directory.

• rm filename is the syntax for removing (deleting) a file. Please note that this is a permanent thing: there is no undo command.

• mv source destination moves a file from one place to another, or renames the file. All it really does is a cp command followed by a rm command.

Page 11: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

File Permissions• The cryptic letters and dashes on the left side of an ls –l command are file

permissions.– They are a group of 10 characters.– Examples: –rw-r—r-- -rwxr-xrx drwxr-xr-x

• Break them up into 4 groups: the first symbol, then three groups of 3.• The first symbol is a ‘-’ if it refers to a file, a “d” if it refers to a directory

below the one you are in, and an ‘l’ if it refers to a link (symbolic link) to a file in another directory.

• The three groups of 3 characters are permissions to read (r), write (w), and execute (x) the file. If the letter is present, permission is granted; if the letter is a ‘-’, permission in NOT granted. Thus a file with r-w permissions would allow reading and execution, but not writing

• The three groups are: the file owner (listed to the right), the group (usually linxgrp here: we don’t do much with groups), and everyone else. Thus, a set of permissions like rwxr-xr– gives the file owner permission to read, write and execute the file, while members of the group can read and execute the file, and everyone else can merely read it.

Page 12: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Changing Permissions• To modify a file you need write permission, and to run a program

(which is just a file), you need execute permission.• chmod is the change permissions command. There are several

versions of the syntax; I usually use 3 octal (base 8) numbers following chmod, followed by the file name. Ex: chmod 755 test.pl. – The three octal digits refer to the three ownership groups: user,

group, everyone.– Each number is the sum of 4 for read permission, 2 for write

permission, and 1 for execute permission. Thus, 6 gives read and write permission and 7 gives read, write, and execute permisison.

– chmod 755 test.pl gives the owner all three permissions and the group and everyone else only have read and execute permissions.

• By default, a newly created file has 644 permissions: the owner can read and write it, and others can only read it.

• To make a program file executable, you will need to chmod it to 755.

Page 13: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Creating and Destroying Directories

• You need to own the directory before you can create a new sub-directory or destroy it. – If you cd /home, you will see the permissions for your

home directory. AS owner, you have rwx permissions, so you can write to it as well as execute commands such as the following:

• To create a subdirectory, use mkdir name. You can then cd into it, and use ls –l to see that it is empty.

• To remove a sub-directory, it must first be empty: use rm to remove any programs (rm * will remove them all: “*” is a wildcard, it stands for anything). Then cd up one level and do rmdir dir_name to remove the sub-directory.

Page 14: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Looking at Files• If you want to read a file, the easiest way is with the

“more” command or the “less” command. Ex: more BLOSUM62 shows you what is in the BLOSUM62 file, one screen at a time. – Use the space bar to move down a whole screen at a time,

and the Enter key to move down one line at a time.• less works better, because the Page Up and Page

Down keys work (they don’t work with more, and it is hard to get more to go backwards in the file).– But less sometimes blocks your view of the last line.– To get out of a program viewed with less, type q.

• cat spews out the whole file at once. You can actually do cat | more or cat | less, but this seems a bit silly.

Page 15: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Editing Files• This is, oddly enough, a real source of social status and

coolness in the Unix world.– The really cool people edit files with emacs– Lesser cool people use vi (or its improved version vim)– Peasants and other scum use pico or nano. (nano is just an open

source version of pico, which got caught up in some copyright dispute).

– I personally nearly always edit my files on my local Windows machine using a text editor (nor MSWord!) and copy the file to biolinx. This way, I always hav a backup copy.

– For this reason, I have never learned emacs or vi, and I just edit programs with pico when I do it directly on biolinx.

– And, that’s what I will teach here. So, to avoid smug looks from computer professionals, I suggest you might want to learn the vi commands.

Page 16: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Pico• The command “pico” will invoke the program, which is a very primitive

WYSIWYG text editor.– Or “nano” if pico doesn’t work.– pico filename opens a particular file for editing; otherwise, you can write to a

specific file when you save it.• To start, just type some text. • The arrow keys can be used to move the cursor to whatever location you

want, even though you can’t point and click.– Delete and backspace work also.

• To cut a line, use control-k. To paste, use control-u. You can cut or paste multiple lines by repeating the command.

• Control-y and control-v move you up and down whole pages, but so do the Page Up and page Down keys.

• To write to a file, use control-o. It will then ask for a file name. You then need to hit Enter to save.– It will tell you if you don’t have write permission. Try saving it to your home

directory in that case.• To exit and return to the command line prompt, use control-x• There are other commands as well: see them all with control-g

Page 17: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

File Transfer• If you want to send a file to biolinx, or get a file out of biolinx (say, to print

it), you need a file transfer program. – In the old days we used an FTP program (and still do on some machines), but

like telnet, FTP transmits passwords in plain text.– Now we use a secure FTP program. For Windows, I use WinSCP (obtained for

free at http://winscp.net/eng/index.php).• When you run WinSCP, you get two windows. On the left is the local

directory you are in, and on the right is the remote (biolinx) directory. You need to navigate both sides to get where you want to be.

• Transferring a file is just a matter of clicking on it (to highlight it), then hitting F5. If there is already a copy present, it will ask permission to overwrite it.

• Once you have to file locally, you can open it with a text editor and print it just like any other file.

• Be aware that Unix file names are case-sensitive, while Windows files are not. juNK.pl and junk.pl are different files on biolinx!

• Another issue: line endings on Windows, Macs, and Unix are all different. Usually this is handled automatically, but if it's a problem, transfer the file as text, not binary and not default. This can cause real hair-tearing problems if you are not aware it.

Page 18: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Running Executable Files

• First, be sure you have execute permission. – If the file doesn’t have execute permission, you will

get a message like “command not found”. Use chmod 755 filename to fix this.

• Then, just type in the file name and hit Enter.• On many machines, you will have to specify that

you want the file present in your current directory. Otherwise, biolinx will start hunting for it in many odd locations.– This is done using “./”, which is the path of the

current directory. Thus the program test.pl could be executed with the command ./test.pl.

Page 19: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Sorcerer's Apprentice Problem

• Normally, an executable file (a program) runs, and when it finishes the operating system returns a command line prompt. A program that is working well often produces no obvious behavior other than getting that prompt sign back.

• Sometimes the program looks like it will take forever to finish: no command line prompt appears no matter how long you wait.– It is easy to write a program containing an infinite loop that will never exit. And, it is

theoretically impossible for the computer system to know in advance if this will happen.

• First solution: hit control-c. This ends many programs instantly, returning you to the command line prompt.

• If that doesn't work, you will need to start a second SSH session with biolinx. Once you have done this, you can see all the programs you have running with the command ps -fu z123456 (using your own zid, of course). You will see a list of programs, and one column is labeled PID (NOT PPID: that’s the ID of the parent of the process you care about). That is the process ID number. Once you have the PID for the program that is running amok, enter the command kill PID (using the actual PID). This will cause it to die immediately.

• “q” will get you out of many text files.

Page 20: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

A couple of useful tricks• If you have a program that is going to run for a long time, and you want to

close your SSH link or turn off your computer, you can run the program in the background using the nohup (“no hangup” )command. – Syntax: nohup your_program &– Once this command has been entered, you can turn the computer off, or do

something else with the interface. – The output from your program will by appended to a file called nohup.out.

• Lengthy outputs from a program can be re-directed from the terminal window to a file, using the file re-direction symbol “>”.– Thus, your_program > output_file will cause everything your program

produces to go into that file instead of to the screen.– However, error messages will still appear on the screen. You can get

them into a file with >&. (Also try 1> and 2>).

Page 21: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

Getting Information about a file• cat is a command that prints out everything in the file. You can see it one

page at a time with cat | more.– cat –n prints out the information with line numbers associated.

• wc is “word count”, which gives the number of words (things delimited by spaces), as well as lines and characters in a file– Try wc BLOSUM62– wc –l gives the number of lines, wc –w gives the number of words, and wc –m

gives the number of characters. • This bit of trickiness (courtesy of Gordon Pusch) is useful if you want to

print a specific line based on its line number. To print line 352: head -n 352 filename | tail -n 1– The idea is that by default head prints the first 10 lines of a file, but the –n

modifier causes it to print 10 lines starting at line 352.– Then, the output of head is sent (piped) to tail, which by default prints the last

10 lines of a file, but the number of lines printed can be modifed by –n, in this case to 1 line.

Page 22: Introduction. Why Learn Computer Programming? General data processing. You have ideas that can’t be carried out by hand, and no one else has done for

grep• grep prints all lines that contain a given character or string (whatever is

between quotes)– Syntax: grep ‘desired string’ filename– Ex: grep ‘Entropy’ BLOSUM62

• grep –n prints the line number before each line of output.• grep –c counts the number of lines with that string. Very useful.• grep –A2 ‘string’ filename prints every line with that string, plus 2 lines

(whatever number you put after the –A) after it. • Similarly grep –B3 ‘string’ filename prints 3 lines before each line that has

your string in it.– You can use both –A and –B in the same command