a stroll through perl

37
A Stroll through Perl (R L Schwartz & T Christiansen, O’Reilly) PERL = Practical Extraction and Report Language. A major strength of Perl is the recognition and substitution of text sequences called regular expressions. This is useful for: Web searching - are the query keywords in this web page? Computation of frequencies in a document collection, e.g. to produce a stoplist, or mid- frequency terms for automatic indexing. Making finite state transducers e.g. pluraliser, stemmer, americanizer. Dialogue systems, e.g. ELIZA.

Upload: derry

Post on 07-Jan-2016

24 views

Category:

Documents


2 download

DESCRIPTION

A Stroll through Perl. (R L Schwartz & T Christiansen, O’Reilly) PERL = Practical Extraction and Report Language. A major strength of Perl is the recognition and substitution of text sequences called regular expressions. This is useful for: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Stroll through Perl

A Stroll through Perl

• (R L Schwartz & T Christiansen, O’Reilly)• PERL = Practical Extraction and Report Language.• A major strength of Perl is the recognition and substitution

of text sequences called regular expressions. • This is useful for:• Web searching - are the query keywords in this web page?• Computation of frequencies in a document collection, e.g.

to produce a stoplist, or mid-frequency terms for automatic indexing.

• Making finite state transducers e.g. pluraliser, stemmer, americanizer.

• Dialogue systems, e.g. ELIZA.

Page 2: A Stroll through Perl

“Hello World” Program

• #!/usr/bin/perl -w• print “Hello, world!\n”;

• The first line means “this is a Perl program”. -w tells Perl to generate warning messages.

• Apart from the first line, all Perl statements end with a semicolon ;• To run a PERL program from UNIX:

• perl programname.pl

• comments:• # anything from the hash sign to the end of the

line is a comment

Page 3: A Stroll through Perl

Scalar Variables

• Now get the “Hello, world” program to call you by your name. To do this, we need a place to hold the name, a way to ask for the name, and a way to get a response.

• One place to hold values (like a name) is as a scalar variable. Here we will use the scalar variable $name to hold your name. A scalar variable starts with $ and can hold either a single number or a string (sequence of characters).

Page 4: A Stroll through Perl

print, <STDIN>, chomp

• The program needs to ask for the name (prompt): use the print function.

• The way to get a line from the terminal is with the <STDIN> construct, which grabs one line of input. We assign this input to the $name variable. This gives us the program:

• print “What is your name?”;• $name = <STDIN>;• The value of $name has a terminating newline \n. To get rid of that,

we use the chomp function• chomp ($name);• Now we can reply with:• print “Hello, $name!\n”;• (what does this do?)

Page 5: A Stroll through Perl

Putting it all together we get:

• #!/usr/bin/perl -w• print “What is your name?”;• $name = <STDIN>;• chomp ($name);• print “Hello, $name!\n”;

Page 6: A Stroll through Perl

Adding Choices

• Let’s say we have a special greeting for Randal, but we want an ordinary greeting for anyone else. To do this, we need to compare the name that was entered with the string Randal, and if it’s the same, do something special. Let’s add a C-like if-then-else branch and a comparison to the program:

• #!/usr/bin/perl -w• print “What is your name?”;• $name = <STDIN>;• chomp ($name);• if ($name eq “Randal”){• print “Hello Sir Randal!\n”;• }• else {• print “Hello, $name!\n”;• }

Page 7: A Stroll through Perl

Guessing the Secret Password

• What does this code do?• #/usr/bin/perl -w• $secretword = “llama”; # the secret word• print “What is the secret password?”;• $guess = <STDIN>;• chomp($guess);• while ($guess ne $secretword) {• print “Wrong, try again:\n”;• $guess = <STDIN>;• chomp($guess);• }• First, we define the secret word by putting it into another scalar variable,

$secretword. The person is asked (using print) for a guess, which goes into $guess. The guess is compared with the secret word using the ne operator, which returns true if the strings are not equal (this is the logical opposite of the eq operator). The result of the comparison controls a while loop, which executes the block as long as the ne comparison remains true.

Page 8: A Stroll through Perl

Arrays• .

• We can store several secret words in sort of list, a data structure called an array. Each element of the array is a separate scalar variable that can be independently set or accessed. The entire array can also be given a value in one fell swoop. We can assign a value to the entire array named @words so that it contains three possible good passwords.

• @words = (“camel”,”llama”,”alpaca”);• or• @words = qw(camel llama alpaca)• Note arrays begin with @, while scalar variables begin with $.• Once the array is assigned, we can access each element using a subscript

reference. So $words[0] is camel, $words[1] is llama, and $words[2] is alpaca. The subscript can be an expression as well, so if we set $i = 2 then $words[$i] = alpaca.

• Note: array elements start with $ rather than @ because they refer to a single element of an array rather than the whole array.

Page 9: A Stroll through Perl

More than one Secret Word• #/usr/bin/perl -w• @secretword = qw (camel llama alpaca);• print “What is the secret password?”;• $guess = <STDIN>;• chomp($guess);• $i = 0;• $correct = “maybe”;• while($correct eq “maybe”){• if($words[$i] eq $guess){• $correct = “yes”;• }• elsif ($i < 2){• $i = $i + 1;• }• else {• print “Wrong, try again:”;• $guess = <STDIN>;• chomp ($guess);• $i = 0;

– }• }• This program also shows the elsif block of the if-then-else statement. Perl doesn’t have C’s switch statement, so in Perl

we tend to compare a set of conditions in a if-elsif-elsif-elsif-else type chain.

Page 10: A Stroll through Perl

Hashes• Giving each person a different secret word:• The easiest way to store such a table in Perl is with a hash.• Each element of the hash holds a separate scalar value (just like an array) but the hashes

are referenced by a key, which can be any scalar value (string or number). • To create a hash called %words (notice the % rather than @) we can write:• %words = qw(• fred camel• barney llama• betty alpaca• wilma alpaca• );• To find the secret word for Betty, we need to use betty as the key in a reference to the

hash %words, via some expression such as • $words{“betty”} will return alpaca • or• $person = “betty”;• $words{$person} will also return alpaca.

Page 11: A Stroll through Perl

Trying to look up a word not in the hash

• When we look up someone’s secret word, if their name is not one of the hash keys, the value of $secretword will be an empty string, e.g:

• { instantiate %words, get $name first, then:}

• $secretword = $words{$name}• if($secretword eq “”){ • print “secret word not found\n”;• }• else {• print “your secret word is $secretword”;• }

Page 12: A Stroll through Perl

Handling Varying Input Formats• How do we make our password checker accept Randal, randal, or • Randal L. Schwartz ?

• If ($name =~ /^Randal\b/i) {• # yes, it matches• }• else {• # no, it doesn’t• }

• Notes: eq is for exact equality, =~ for pattern matching.• The regular expression is delimited by forward slashes.• /^Randal/ means any string starting with Randal.• /^Randal\b/ means there must be a white space after Randal, so Randall is excluded.• /^Randal\b/i means that we ignore case, so randal is accepted.

Page 13: A Stroll through Perl

Two Text Converters

• We can write a case converter by using the translate operator.

• $name = tr/A-Z/a-z/;• The slashes delimit the searched-for and replacement

character lists. The hyphen stands for all the characters between A and Z, so the two lists are the same length (26 characters).

• We can replace the word Eurasia with Eastasia using the substitution operator.

• $temp =~ s/Eastasia/XXXX/;• $enemy =~ s/Eurasia/Eastasia/;• $ally =~ s/XXXX/Eurasia/;

Page 14: A Stroll through Perl

Making it Modular

• Perl provides subroutines that have parameters and return values. A subroutine is defined once in a program, and can be used repeatedly by being invoked from any expression.

• Let’s create a subroutine called good_word that takes a name and a guessed word, and returns true if the word is correct and false if not:

• sub good_word {• my($somename, $someguess) = @_;• # name the parameters• if ($words{$somename} eq $someguess {• return 1; # true • }• else {• return 0; # false• }• }

Page 15: A Stroll through Perl

Subroutines

• First, the definition of a subroutine consists of a reserved word sub followed by the subroutine name followed by a block of code { delimited by curly braces }. The definition can go anywhere in the program file, though most people put it at the end.

• The first line within this particular definition is an assignment that copies the values of the two parameters of this subroutine into two local variables named $somename and $someguess.

• The my()defines the two variables as private to the enclosing block - in this case the whole subroutine - and the parameters are initially in a special local array called @_

• A return statement can be used to make the subroutine immediately return to its caller with the supplied value.

• Note that the subroutine assumes that the value of the %words hash is set by the main program.

Page 16: A Stroll through Perl

Let’s Integrate this with the Rest of the Program

• #!/usr/bin/perl• %words = qw{• fred camel• barney llama• betty alpaca• wilma alpaca• };• print “What is your name? “;• $name = <STDIN>;• chomp($name);• print “What is the secret word? “;• $guess = <STDIN>;• chomp($guess);• while (! good_word($name, $guess){• print(“Wrong, try again: ”);• $guess = <STDIN>;• chomp($guess);• }• # insert definition of good_word here …

Page 17: A Stroll through Perl

While, !

• The while loop contains the subroutine good_word. Here we see an invocation of the subroutine, passing it two parameters, $name and $guess. Inside the subroutine, the value of $somename is set from the first parameter, $name, and the value of $someguess is set from the second parameter $guess.

• The value returned by the subroutine (either 1 or 0) is logically inverted with the prefix ! (logical not) operator. This expression returns true is the expression following is false, and returns false if the expression following is true. The overall meaning is “while it’s not a good word …”

Page 18: A Stroll through Perl

Moving the Secret Word List into a separate file

• Suppose we wanted to share the secret word list among three programs, e.g. for simultaneous updating. We can put the word list into a file and then read the file to get the word list into the program. To do this, we need to create an I/O channel called a filehandle. Your Perl program automatically gets three filehandles called STDIN, STDOUT and STDERR. Now we want another handle attached to a file of our own choice.

• sub init_words {• open (WORDSLIST, “wordslist”) || die “can’t open

wordlist: $!; while ( defined ($name = <WORDSLIST>)) {

• chomp ($name);• $word = <WORDSLIST>;• chomp ($word);• $words{$name} = $word;• }• close (WORDSLIST) || die “couldn’t close wordlist:

$!”;• }

Page 19: A Stroll through Perl

The (arbitrary) form of the word list

• fred• camel• barney• llama• betty• alpaca• wilma• alpaca

• The open function initialises a filehandle named WORDSLIST by associating it with a file named wordslist in the current directory.

• while ( defined ($name = <WORDLIST>) ) {• i.e. while there are still values in the data file to read• The die function is frequently used to exit the program with an error message in

case something goes wrong, e.g. the word list file is not found. $! contains the system error message explaining what went wrong.

Page 20: A Stroll through Perl

Three More Loops• 1. To print out scalar variables:• This example prints the numbers 1 to 10, each followed by a space:• for ($i = 1; $i <= 10; $i++){• print “$i “;• }• The above code is very similar to C++.• 2. To print out the contents of an array:• foreach $i(@somelist) {• print “$somelist[$i]\n”;• }• The foreach statement takes a list of values and assigns them one at a time to a scalar

variable, executing a block of code with each successive statement.• 3. To print out the contents of a hash:• foreach $key (keys(%freqhash)) {• print “$key $freqhash{$key}\n”;• }

Page 21: A Stroll through Perl

Regular Expressions

• See Chapter 7 of “Learning Perl”, by R L Schwartz & T Christiansen, O’Reilly, 1993.

• A regular expression is a pattern to be matched against a string.

• e.g. is put found in computer? Succeeds• Is michael found in computer? Fails• Sometimes match success or failure is all you are

concerned about. Other times you want to match and replace.

• e.g. Find put in computer and replace with pil. If the match is unsuccessful, nothing happens.

• $_ is Perl’s default variable – we don’t have to declare it.

Page 22: A Stroll through Perl

Search, Substitution

• Print out every line in the file specified on the command line which contains abc:

• while (<>) {• if(/abc/){• print $_;• }• }

• Substitution. If abc is found in $_, replace it with def (g means every time).

• s/abc/def/g;

Page 23: A Stroll through Perl

Patterns• A regular expression is a pattern. Some parts of the pattern match single characters,

others match multiple characters.• . stands for any single character except \n (newline).• /a./ any two letter sequence that starts with a but is not a\n• /[abcde]/ matches a, b, c, d, or e. (“character class”)• /[a-zA-Z0-9_]/ matches a Perl “word” character.• /[^0-9]/ any NON-digit (“negated character class”)• character class abbreviations:• \d digit• \D non-digit• \w Perl “word”character• \W not a Perl “word” character• \s space character (\r \t \n \f or “ “)• All of the above match one character. We now look at “grouping patterns”:• * zero or more of the immediately previous character or character

class.• + one or more of the immediately previous character• ? zero or one of the immediately previous character.

Page 24: A Stroll through Perl

Patterns are greedy by default

•$_ = “fred xxxxxx barney”;•s/x+/boom/;

• now $_ = “fred boom barney”

• /x{3}/ would mean match against exactly xxx.

Page 25: A Stroll through Perl

Parentheses as memory, anchoring patterns, alternation

• Parentheses as memory:

• abc* matches ab, abc, abcc, abccc, abcccc etc.• (abc)* matches “”, abc, abcabc, abcabcabc etc.

• Anchoring patterns:

• /fred\b/; matches fred and alfred but not frederick• /\bfred/; matches fred and frederick but not alfred• /\bfred\b/; matches fred but not frederick and alfred.

• Alternation:

• (song|blue)bird matches songbird or bluebird

Page 26: A Stroll through Perl

Selecting a different target (the =~ operator)

• $a = “hello world”• if($a =~ /he/) {• # do something …• $a =~ s/hello/goodbye/;

• Special read-only variables• $_ = “this is a sample string”;• /sam.le/; # matches “sample” within the string• # $` is now “this is a”• # $& is now “sample”• # $’ is now “string”

• More substitutions• $_ = “this is a test”;• $new = “quiz”;• s/test/$new/; # now $_ = “this is a quiz”

Page 27: A Stroll through Perl

Basic Data Structures• $scalar - single value or string

• @array - list e.g. • @flintstones = qw(fred barney betty wilma);• $array[2] = “betty”;• foreach $member (@flintstones){• print “$flintstones [$member];• }

• %hash, e.g. frequency list %freq built up by:• $freq{“the”} = 100;• $freq{“chandelier”} = 1;• $freq{$string} = 5;• foreach $key {keys (%freq)) { # once for each key of %freq• print “ $key was found $freq{$key} times\n”; # show

key and value;• }

Page 28: A Stroll through Perl

Sorting: arrays

• @x = qw(small medium large);• @y = sort @x;• Now @y is (large medium small).

• @x = (15, 27, 9, 49, 14);• @y = sort @x;• Now @y is (14, 15, 27, 49, 9).

• @x = (15, 27, 9, 49, 14);• @y = sort { $a <=> $b } @x;• Now @y is (9, 14, 15, 27, 49).

Page 29: A Stroll through Perl

Sorting: hashes • Sort by alphabetic order of keys, or numeric order of values

• @sortedkeys = sort by_names keys(%freqhash);• sub by_names {• return $a cmp $b;• }• foreach (@sortedkeys) {• print “$_ is found $freqhash{$_}times\n”;• }

• @sortedkeys = sort by_number keys(%freqhash);• sub by_number {• return $freqhash{$a} <=> $freqhash{$b};• }• foreach (@sortedkeys) {• print “$_ is found $freqhash{$_}times\n”;• }

Page 30: A Stroll through Perl

Array of arrays (2D arrays)• @AoA = {• [ “fred”, “barney” ],• [ “george”, “jayne”, “elroy” ],• [ “homer”, “marge”, “bart” ],• };• print $AoA[2][1]; # prints “marge”• for $x (0 .. 9) {• for $y (0 .. 9) {• $AoA[$x][$y] = x * y;• }• }• while (<>) { # read in a line of text• @tmp = split; # split elements into a 1D array• push @AoA, [@tmp]; # add 1D array as the next row of a 2D array • }• for $i (0 .. $#AoA) # for each row in AoA• $row = $AoA[$i]; # put row of 2D array into a 1D array - • # note $ subscript even so• for $j (0 .. $#{@row}) { # for each element of that 1D array

print “element $i Sj is $AoA[$i][$j]\n”;• }• }

Page 31: A Stroll through Perl

Hashes of Hashes• %HoH = (• flintstones => {• husband => “fred”,• pal => “barney”,• },• jetsons => {• husband => “george”,• wife => “jane”,• “his boy” => “elroy”,• },• simpsons => {• husband => “homer”,• wife => “marge”,• kid => “bart”,• },• );

• To add another hash to the hash of hashes, you can simply say:• $HoH{ mash } = {• captain => “pierce”;• major => “burns”;• corporal => “radar”;• };

Page 32: A Stroll through Perl

Populating a Hash of Hashes • Here is one technique for populating a hash of hashes. To read from a file with the

following format:• flintstones: husband=fred pal=barney wife=wilma pet=dino

• while ( <> ) {• next unless s/^(.*?):\s//;

#look for characters from start of line to colon• $who = $1; • # $1 is first parenthesised part of reg exp• for $field (split) {• #for each other Perl word in the input line• ($key, $value) = split /=/, $field • # cut each key=value pair at =• $HoH{$who}{$key} = $value;• }• }

Page 33: A Stroll through Perl

To set a key/value pair, and print out a hash of hashes

• You can set a key/value pair of a hash of hashes as follows:

• $HoH{flintstones}{wife} = “wilma”;

• To print out the families, loop through all the keys of the outer hash nd then looping through the keys of the inner hash:

• for $family ( keys %HoH) { • print “$family: “;• for $role (keys %{ $HoH{$family} } ) {• print “$role=$HoH{$family}{$role} “;• }• print “\n”;• }

Page 34: A Stroll through Perl

More advanced data structures

• Also possible: Arrays of hashes, hashes of arrays, hashes of functions and more elaborate records. See chapter 9 of “Programming Perl” by Larry Wall, Tom Christiansen & Jon Orwant, O’Reilly, 3rd edition.

Page 35: A Stroll through Perl

ELIZA (1)

• Substitutions may use memory• e.g. /the (.*)er they were, the \1er they will be/• will match the bigger they were, the bigger they

will be but not the bigger they were, the faster they will be.

• Substitutions using memory are very useful in implementing a simple natural-language understanding program like ELIZA (Weizenbaum, 1966), which could carry on conversations like the following:

Page 36: A Stroll through Perl

ELIZA (2)

• User: Men are all alike.• ELIZA: IN WHAT WAY• User: They’re always bugging us about something or other.• ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE• User: Well, my boyfriend made me come here.• ELIZA: YOUR BOYFRIEND MADE YOU COME HERE• User: He says I’m depressed much of the time.• ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED.• ELIZA works by searching the user’s sentence for regular expressions and

substituting them, e.g. s/my/YOUR/ and s/I’m/YOU ARE/, and then:• s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/• s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Page 37: A Stroll through Perl

ELIZA (3)

• ELIZA works by searching the user’s sentence for regular expressions and substituting them, e.g. s/my/YOUR/ and s/I’m/YOU ARE/, and then:

• s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/

• s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/