perl regex

Post on 21-Oct-2015

24 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

presentation on perl regex

TRANSCRIPT

Overview

Introduction• Regular expressions are tiny programs in their own special

language, built inside Perl.• These allow fast, flexible, and reliable string handling.• A regular expression, often called a pattern in Perl, is a

template that either matches or doesn’t match a given string.

• That is, there are an infinite number of possible text strings; a given pattern divides that infinite set into two groups: the ones that match, and the ones that don’t.

• Don’t confuse regular expressions with shell filename-matching patterns, called globs, which is a different sort of pattern with its own rules.

Simple Pattern• To match a pattern (regular expression) against the

contents of $_, simply put the pattern between a pair of forward slashes (/).

$_ = "yabba dabba doo";

if (/abba/) {

print "It matched!\n";

}

• The expression /abba/ looks for that four-letter string in $_; if it finds it, it returns a true value.

Unicode Properties• Unicode characters know something about themselves;

they aren’t just sequences of bits.• Instead of matching on a particular character, you can

match a type of character.• To match a particular property, you put the name in \

p{PROPERTY}.if (/\p{Space}/) { # 26 different possible characters

print "The string has some whitespace.\n";

}

if (/\p{Digit}/) { # 411 different possible characters

print "The string has a digit.\n";

}

• More properties at perluniprops .

Meta-characters• The dot (.) is a wildcard character—it matches any single

character except a newline./bet.y/ - > matches betty, betsy, bet=y, bet.y,

doesn’t match bety or betsey.

• The dot always matches exactly one character.• If you wanted the dot to match just a period, you can

simply backslash it./3\.141/ -> matches 3.141596456

doesn’t match 3a141545

• If you mean a real backslash, use a pair of them.$_ = 'a real \\ backslash';

if (/\\/) {

print "It matched!\n";

}

Simple Quantifiers• * -- zero or more occurrences

/fred\t*barney/ matches fredbarney, fred\tbarney, fred\t\tbarney

/fred.*barney/ matches fredbarney, fredabcd…barney

• + -- one or more occurrences/fred\t+barney/ matches fred\tbarney, fred\t\tbarney

doesn’t match fredbarney

• ? -- zero or one occurrence/bam-?bam/ matches bambam, bam-bam

doesn’t match bam-----bam

Grouping in Patterns• Use parentheses (“( )”) to group parts of a pattern.• So, parentheses are also meta-characters.

/fred+/ matches fredddd, fredd

/(fred)+/ matches fred, fredfred, fredfredfred

/(fred)*/ matches hello, barney, fred, fredfred

• Using of parentheses makes perl to store matched text in the special variables $1, $2, and so on. The number denotes the capture group.

$_ = “perl version is 5.14”;

if(/perl version is (.*)/) {

print $1; #prints 5.14

}

• Use back references to refer to text that you matched in the parentheses, called a capture group.

• You denote a back reference as a backslash followed by a number, like \1, \2, and so on.

$_ = "abba";

if (/(.)\1/) { # matches 'bb'

print "It matched same character next to itself!\n";

}

$_ = "yabba dabba doo";

if (/y(....) d\1/) {

print "It matched the same after y and d!\n";

}

$_ = "yabba dabba doo";

if (/y(.)(.)\2\1/) { # matches 'abba'

print "It matched after the y!\n";

}

• “How do I know which group gets which number?”--just count the order of the opening parenthesis and ignore nesting.

$_ = "yabba dabba doo";

if (/y((.)(.)\3\2) d\1/) {

print "It matched!\n";

}

• Consider the problem where you want to use a back reference next to a part of the pattern that is a number.

• In this regular expression, you want to use \1 to repeat the character you matched in the parentheses and follow that with the literal string 11

$_ = "aa11bb";

if (/(.)\111/) {

print "It matched!\n";

}

Is that \1, \11, or \111?

• Starting from perl 5.10, by using \g{1}, you disambiguate the back reference and the literal parts of the pattern:‖

use 5.010;

$_ = "aa11bb";

if (/(.)\g{1}11/) {

print "It matched!\n";

}

• With the \g{N} notation, you can also use negative numbers.

use 5.010;

$_ = "xaa11bb";

if (/(.)(.)\g{–1}11/) {

print "It matched!\n"; }

Alternatives• The vertical bar (|), often called “or” in this usage, means, if

the part of the pattern on the left of the bar fails, the part on the right gets a chance to match.

/fred|barney|betty/ matches fred, barney, betty.

/fred( |\t)+barney/ matches if fred and barney are separated by spaces, tabs, or a mixture of the two.

/fred( +|\t+)barney/ matches if fred and barney are separated either only by space or only by tabs not mixture of space and tabs.

/fred (and|or) barney/ matches fred and barney, fred or barney. Same as pattern /fred and barney|fred or barney/.

Character Classes• A character class, a list of possible characters inside square

brackets.• It matches just one single character, but that one character

may be any of the ones you list in the brackets.[abcwxyz] matches a,b,c,w,x,y,z (any of those seven characters)

• You may specify a range of characters with a hyphen (-)[a-cw-z] implies all alphabets between a to c and w to z[a-zA-Z0-9] implies any alphanumeric character

$_ = "The HAL-9000 requires authorization to continue.";

if (/HAL-[0-9]+/) {

print "The string mentions some model of HAL computer.\n";

}

Character Class Shortcuts• Some character classes appear so frequently that they have

shortcuts.• The character class for any digit as \d.

$_ = 'The HAL-9000 requires authorization to continue.';

if (/HAL-[\d]+/) {

say 'The string mentions some model of HAL computer.';

}

• However, there are many more digits than the 0 to 9 that you may expect from ASCII, so that will also match HAL-٩٠٠٠

• Recognizing this problematic shift from ASCII to Unicode, Perl 5.14 adds /a modifier on the end of the match perator tells Perl to use the old ASCII interpretation.

• \s matches any whitespace, which is almost the same as the Unicode property \p{Space}

• \h only matches horizontal whitespace. • \v shortcut only matches vertical whitespace.• Taken together, the \h and \v are the same as \p{Space}• The \R shortcut, introduced in Perl 5.10, matches any sort

of line-break, independent of operating system.• \w matches the set of characters [a-zA-Z0-9_]

Negating the Shortcuts• To specify the characters you want to leave out, rather than

the ones within the character class use caret(^).• A caret (^) at start of character class(i.e., inside square

brackets) negates the class.[^def] match any single character except one of those three.

[^n\-z] matches any character except for n, hyphen, or z.

• To negate a shortcut use it upper case \S matches any non-space

\D matches any non-digit

[\d\D] matches any digit, or any non-digit. i.e., any character or anything

[^\d\D] matches anything that’s not either a digit or a non-digit. i.e., nothing!

Matches with m//• We put patterns in pairs of forward slashes, like /fred/. But

this is actually a shortcut for the m// (pattern match operator).

• We may choose any pair of delimiters to quote the contents.

m(fred), m<fred>, m{fred}, m[fred], m,fred,, m!fred!, m^fred^

• The shortcut is that if you choose the forward slash as the delimiter, you may omit the initial m.

• Wisely choose a delimiter that doesn’t appear in your pattern.

m%http://% instead of /http:\/\// to match the initial "http://".

Match Modifiers• Case-Insensitive Matching with /i

$_=“Is Freddy there?”;

if(/freddy/i) {

print “Yes Freddy is here”;

}

• Without the /s modifier, that match would fail, since the two names aren’t on the same line.

• If you wanted to still match any character except a newline? --You could use the character class [^\n], or from Perl

5.12 added the shortcut \N to mean the complement of \n.

• Matching Any Character with /s– Using /s modifier makes dot(.) to match any character including a

newline character. – It achieves this by replacing (.) with [dD] with matches anything.– The effect can only be felt when the string has newline characters.

$_ = "I saw Barney\ndown at the bowling alley\nwith Fred\nlast night.\n";

if (/Barney.*Fred/s) {

print "That string mentions Fred after Barney!\n";

}

• There are many other modifiers available at perlop documentation. A few are described below.

• Adding Whitespace with /x– allows you to add arbitrary whitespace to a pattern, in order to

make it easier to read./-?[0-9]+\.?[0-9]*/ # what is this doing?

/ -? [0-9]+ \.? [0-9]* /x # a little better– /x allows whitespace inside the pattern, Perl ignores literal space

or tab characters within the pattern.– You could use a backslashed space or \t or \s (more common)(or \

s* or \s+) when you want to match whitespace.

– Perl considers comments a type of whitespace, so you can put comments into that pattern to tell what you are trying to do:

/

-? # an optional minus sign

[0-9]+ # one or more digits before the decimal point

\.? # an optional decimal point

[0-9]* # some optional digits after the decimal point

/x # end of string– Use the escaped character, \#, or the character class, [#], if you

need to match a literal pound sign as it indicates start of comment/

[0-9]+ # one or more digits before the decimal point

[#] # literal pound sign

/x # end of string

– Be careful not to include the closing delimiter inside the comments, or it will prematurely terminate the pattern. This pattern ends before you think it does:

/

-? # with / without - <--- OOPS!

[0-9]+ # one or more digits before the decimal point

\.? # an optional decimal point

[0-9]* # some optional digits after the decimal point

/x # end of string

Combining Option Modifiers• If you want to use more than one modifier on the same

match, just put them both at the end (their order isn’t significant)

if (/barney.*fred/is) { # both /i and /s

print "That string mentions Fred after Barney!\n";

}

Or as a more expanded version with comments:

if (m{

barney # the little guy

.* # anything in between

fred # the loud guy

}isx) { # all three of /s and /i and /x

print "That string mentions Fred after Barney!\n"; }

Misc• The trick with a good pattern is to not match more than you

ever mean to match.

top related