introduction to unix and perl todd scheetz sept. 6, 2001 computational methods in molecular biology
TRANSCRIPT
Introduction to UNIX and Perl
Todd Scheetz
Sept. 6, 2001Computational Methods in Molecular Biology
Definitions
Operating System• provides a uniform interface between a computer’s hardware and user-level programs.• Manages the low-level functionality of the hardware automatically.
Programming Language• provides a formal structure/syntax for implementing algorithmic procedures.
What is UNIX?
Operating system developed at Bell Labs.• originally written in assembly code• the C programming language was designed to implement a more portable version of UNIX
Multi-userMulti-tasking
What is UNIX? (part 2)
Made available with source code at no cost• could fix bugs, add features or just test alternative methods• EXCELLENT for learning or teaching
Adopted by Berkeley to make BSD• virtual memory• paging• networking (TCP/IP)
What is UNIX? (part 3)
By programmers, for programmers• extensive facilities to allow people to work together and share information in controlled ways• time sharing system
Basic Guidelines• Principle of least surprise• every program should do one thing and do it well
UNIX hierarchy
Adapted from Tanenbaum, p. 273
Hardware (CPU, memory, disks, keyboard, etc.)
UNIX O/S(process mgmt, memory mgmt,file system, I/O, etc.)
Standard Libr.(open, close, fork,read, print, etc.)
Std. Utility Programs(shell, editor, compiler)
Users
User i/f
Library i/f
System call i/f
UNIX Basics
User Accounts - required to log-on to the computer with username and password.
Groups - entity made up of one or more users.
Sharing...
Bob
Stacie
Diane
MikeBill
group1 group2
UNIX Basics
File Sharing - Regulated by three sets of permissions.
Permissions: read, write, execute
Subjects: owner, group, all
R W XUser (u)Group (g)All (a)
-rwxr-xr-x foo.pl-r-xr-xr-x bar.pl-rw------- secret-rw-r--r-- public
UNIX Basics
Super-user accountcomplete access to all files
Required for system administration tasksadd accounts/groupschange permissions/owners of any filechange password of any accountshutdown a machine
UNIX BasicsUNIX Filesystem Hierarchy
/
bin etc usr vartmpdev lib
bin doc lib local
Two shortcuts. - the current directory.. - the directory one level “up”
/usr/usr/bin/usr/local/usr/local/bin
bin etc lib tmp
What is UNIX?
Processes
Each program executes as a process
A process provides encapsulation for the program
Under UNIX, multiple processes can be running at the same time!
How to control processes:^C -- break^Z -- stop& -- start in backgroundps -- show which processes are runningkill -- kill a process
What is UNIX?
grep - show every line from a file that matches a supplied patternEx. grep sub my_program.pl(would return every line in the file that contained the string ‘sub’)
ls - list filesEx. ls *.pl(would list all files in the current directory that end in ‘.pl’)
head - list the first lines in a fileEx. head -20 my_program.pl(would show the first 20 lines from my_program.pl)
sort - performs a lexical sorting of a fileEx. sort my_program.pl
What is UNIX?
UNIX also provides a method for concatenating multiple programs together
Pipes…
Ex.head -20 *.pl | grep File | sort
pipes
UNIX BasicsUNIX Command Summary
pwd - print working directorycd - change directoryls - list filesmv - move a file (relocate/rename)rm - remove a filecp - copy a file
mkdir - make a new directoryrmdir - remove a directorymore - display the contents of a file (one screen as a time)
chmod - change the permissions on a filechgrp - change the group associated with a file
UNIX Shell
Shells
a.k.a. command interpreterthe primary user interface to UNIXinterpret and execute commands
1. Interactive use2. Customization of UNIX session (environment)3. programmability
/bin/sh - Bourne shell/bin/csh - C shell/bin/bash - Bourne again shell/bin/tcsh - modified, updated C shell
UNIX Shell
bash
prompt -- by default shows who you are, what machine the shell is running on, and what directory you are in.
PATH -- environment variable that defines where the shell should look for the programs you are running.
/bin/usr/bin/usr/local/bin/usr/X11R6/bin/usr/sbin.
Installing Software
Pre-built vs. source
RPM vs. “raw” binaries
Processdownloadingextractingcompilinginstallationconfiguration
Mini-Tour of UNIX
Go through the most common commands.
Perl
Basics of a Perl program under UNIX
Perl is an interpreted language
The first line of a Perl program (in UNIX) is...#!/usr/bin/perl
The # character is the comment character.
All single-expression statements must end in a semi-colon.$area = $pi * $radius * $radius;while (CONDITION) {
# some stuff}
Programming Languages
Input/Output in Perl
Reading in from the keyboard...$line = <STDIN>;
Filehandles...
File: open(FH,”filename”);open(FH,”>filename”);...$line = <FH>;...close(FH);
DO HELLO WORLD WALK-THROUGH.
Programming Languages
Data Types
Integer - 0, 1, 2, …, 1000, 1001, …Floating Point - 0.0, 0.001, 0.0003, 3.14159265, …Character - a, b, c, d, …, 0, 1, 2, :, !, …
Different languages use different conventions. In Perl, a string is also a basic data type. A string is a sequence of 0 or more characters.
Programming Languages
Variables - Pieces of data stored within a program. (similar to variables in arithmetic)
scalar variables are distinguished by the ‘$’ at their front.
Any name beginning with a letter is allowed$a$a1$alphabet_soup_is_OK_to_me
Programming LanguagesArithmetic Operations
+ Addition- Subtraction* Multiplication/ Division
% Modulo++ Increment-- Decrement|| Logical OR
&& Logical AND! Logical Negation
Programming LanguagesArithmetic Operations
== Eq Equality!= neq Inequality> Greater than
>= … or equal to< Less than
<= … or equal to
Programming LanguagesStatements
A program can be broken down into basic structures called statements. Statements are terminated by a semi-colon.
print “Hello, world!\n”;
Assignment statements use a single ‘=‘ rather than the ‘==‘ of the equality operation.
$pi = 3.1415926;$area = $pi * $radius * $radius;$line = <STDIN>;
Programming Languages
Variable Types
Scalar - a single valueArray - a list of values (indexed by sequential number)Hash - a set of key,value pairs
Prime Numbers = (1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, …)
0 11 22 33 5
First 1Second 2Third 3Fourth 5
......... ...
Programming Languages
Arrays are good when the data is dense, and the algorithm uses a linear access pattern.
Prime Numbers = (1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, …)
1 1 1 0 1 0 1 0 0 0 1
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
1 2 3 5 7 11 13 17 19 23 29 31
Programming Languages
0 1 2 3 4 5 6 7 8 9 10 11
1 2 3 5 7 11 13 17 19 23 29 31
1 2 3 5 7 11 13 17 19 23 29 31
1 1 1 1 1 1 1 1 1 1 1 1
Hash - “associative array”• array indices can be any unique set of “keys”• excellent for accessing in random patterns (in sparse data)
(Ex. “is 19 a prime number?”)
Programming Languages
Scalar -- $foo, $a1, $a2000
Array -- @array, @iito access the element at index $i
$array[$i]
the last index of an array is $#array
the number of elements in an array is$num_elements = $#array + 1;
OR$num_elements = @array;
Programming Languages
Hash --%hash, %envto access the element with index of $i
$hash{$i}
to get a list of keys used in a hash@key_list = keys(%hash);
to determine how many keys are in a hash$num_elements = @key_list;
OR$num_elements = keys(%hash);
Programming Languages
Control of Program Execution
if -- executes a block of code, if the condition evaluates to TRUE
if($light eq “green”) {continue_driving();
}
if( ($light eq “green”) && ($no_traffic) ) {continue_driving();
}
Programming LanguagesIn many cases, a simple if statement is not sufficient, as multiple alternative outcomes need to be evaluated.
if($light eq “green”) {continue_driving();
} else {stop_car();
}
if($light eq “green”) {continue_driving();
} elsif($light eq “red”) {stop_car();
} else {go_fast_to_beat_the_yellow();
}
Programming Languages
Control of Program Execution
Sometimes you need to iterate through a statement multiple times...
Looping constructs:for (…) { … }foreach $var (@list) { … }while (COND) { … }
Programming Languages
Foreach Loop…
foreach $var (@list) {do_stuff($var);
}
foreach $name (@name_list) {print “Name = $name\n”;
}
foreach $name (@name_list) {if($hair_color{$name} eq “blond”) {
print “$name has blond hair.\n”;}
}
Programming Languages
for (INIT; COND; POST) {do_stuff();
}
for ($i=0; $i < 50;$i++) {print “i = $i\n”;
}
for ($i=0; $i < 50; $i++) {if($prime{$i} == 1) {
print “$i is prime!\n”;} else {
print “$i is not prime.\n”;}
}
Programming Languageswhile (COND) {
do_stuff();}
while($line = <FILE_HANDLE>) {print “$line”;
}
while($flag ==0) {if($prime{$position} == 1) {
$flag = 1;} else {
$position++;}
}
Intermission
Review of Perl Concepts
Data Typesscalararrayhash
Input/Outputopen(FILEHANDLE,”filename”);$line = <FILEHANDLE>;print “$line”;
Arithmetic Operations+, -, *, /, %&&, ||, !
Review of Perl Concepts
Control Structuresifif/elseif/elsif/else
foreach
for
while
Regular Expressions
General approach to the problem of pattern matching
RE’s are a compact method for representing a set of possible strings without explicitly specifying each alternative.
For this portion of the discussion, I will be using {} to represent the scope of a set.
{A}{A,AA}
{Ø} = empty set
Regular Expressions
In addition, the [] will be used to denote possible alternatives.
[AB] = {A,B}
With just these semantics available, we can begin building simple Regular Expressions.
[AB][AB] = {AA, AB, BA, BB}AA[AB]BB = {AAABB,AABBB}
Regular Expressions
Additional Regular Expression components* = 0 or more of the specified symbol+ = 1 or more of the specified symbol
A+ = {A, AA, AAA, … }A* = {Ø, A, AA, AAA, … }
AB* = {A, AB, ABB, ABBB, … }[AB]* = {Ø, A, B, AA, AB, BA, BB, AAA, … }
Regular Expressions
What if we want a specific number of iterations?
A{2,4} = {AA, AAA, AAAA}[AB]{1,2} = {A, B, AA, AB, BA, BB}
What if we want any character except one?[^A] = {B}
What if we want to allow any symbol?
. = {A, B}
.* = {Ø, A, B, AA, AB, BA, BB, … }
Regular Expressions
All of these operations are available in Perl
Several “shortcuts”
\d = {0, 2, 3, 4, 5, 6, 7, 8, 9}\w+\s\w+ = {…, Hello World, … }
Name Definition CodeWhitespace [space, tab,
new-line]\s
Wordcharacter
[a-zA-Z_0-9] \w
Digit [0-9] \d
Pattern Matching
Perl supports built-in operations for pattern matching, substitution, and character replacement
Pattern Matching
if($line =~ m/Rn.\d+/) {...
}
In Perl, RE’s can be a part of the string rather than the whole string.
^ - beginning of string$ - end of string
Pattern Matching
Back references…
if($line =~ m/(Rn.\d+)/) {$UniGene_label = $1;
}
Regular Expressions
$file = “my_fasta_file”;open(IN, $file);$line_count = 0;while($line = <IN>) {
if($line =~ m/^\>/) {$line_count++;
}}print “There are $line_count FASTA sequences in $file.\n”;
Pattern Matching
UniGene data file
ID Bt.1TITLE Cow casein kinase II alpha …EXPRESS ;placentaPROTSIM ORG=Caenorhabditis elegans; …PROTSIM ORG=Mus musculus; PROTGI=…SCOUNT 2SEQUENCE ACC=M93665; NID=g162776; …SEQUENCE ACC=BF043619; NID=…//ID Bt.2TITLE Bos taurus cyclin-dependent …...
Pattern Matching
Let’s write a small Perl program to determine how many clusters there are in the Bos taurus UniGene file.
Pattern Matching
Now we’ll build a Perl program that can write an HTML file containing some basic links based on the Bos taurus UniGene clustering.
Important:
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=GID_HERE&dopt=GenBank
Substitution
Pattern matching is useful for counting or indexing items, but to modify the data, substitution is required.
Substitution searches a string for a PATTERN and, if found, replaces it with REPLACEMENT.
$line =~ s/PATTERN/REPLACEMENT/;
Returns a value equal to the number of times the pattern was found and replaced.
$result = $line =~ s/PATTERN/REPLACEMENT/;
Substitution
Substitution can take several different options.specified after the final slash
The most useful areg - global (can substitute at more than one location)i - case insensitive matching
$string = “One fish, Two fish, Red fish, Blue fish.”;$string =~ s/fish/dog/g;print “$string\n”;
One dog, Two dog, Red dog, Blue dog.
Substitution
Example: Removing leading and trailing white-space
$line =~ s/^\s*(.*?)\s*$/$1/;
a *? performs a minimal match…it will stop at the first point that the remainder of the expression can be matched.
$line =~ s/^\s*(.*)\s*$/$1/;this statement will not remove trailing white-space, instead the white space is retained by the .*
Character Replacement
A similar operation to substitution is character replacement.
$line =~ tr/a-z/A-Z/;
$count_CG = $line =~ tr/CG/CG/;
$line =~ tr/ACGT/TGCA/;
$line =~ s/A/T/g;$line =~ s/C/G/g;$line =~ s/G/C/g;$line =~ s/T/A/g;
Character Replacement
while($line = <IN>) {$count_CG = $line =~ tr/CG/CG/;$count_AT = $line =~ tr/AT/AT/;
}$total = $count_CG + $count_AT;$percent_CG = 100 * ($count_CG/$total);
print “The sequence was $percent_CG CG-rich.\n”;
Subroutines
One of the most important aspects of programming is dealing with complexity. A program that is written in one large section is generally more difficult to debug. Thus a major strategy in program development is modularization.
Break the program up into smaller portions that can each be developed and tested independently.
Makes the program more readable, and easier to maintain and modify.
Subroutines
EXAMPLE:Reading in sequences from UniGene.all.seq file
Multiple FASTA sequences in a single file, each annotated with the UniGene cluster they belong to.
GOAL: Make an output file consisting only of the longest sequence from each cluster.
Subroutines
ISSUES:1. Want to design and implement a usable program2. Use subroutines where useful to reduce complexity.3. Minimize the memory requirements.
(human UniGene seqs > 2 GB)