r egular e xpression in p erl (p art 1) thach nguyen
TRANSCRIPT
REGULAR EXPRESSION IN PERL (PART 1)
Thach Nguyen
OBJECTIVE
WHAT IS REGULAR EXPRESSION (REGEX, REGEXP)?
Big factor behind the fame of Perl A string that describe a pattern
Examples of pattern: Search engine to find webpage (Google) List files in directory (ls *.txt, dir *.*) Search, extract parts of strings, search and replace
(Microsoft Word)
Efficient, flexible to manipulate text Not really difficult to understand as reputation
Constructed using simple concepts (conditional, loop)
If getting used to terse notation of them, you’re good to go
HOW TO USE REGEX
Part 1: basics (solve about 98% of your needs) Simple word matching Using character classes Matching this or that
Part 2: power tools (for the rest) Advanced regex operators Latest innovation
PART1: THE BASICS
Simple word matching The simplest regex: a word, a string of
characters Match any string that contains that word
Eg:
Result: It matches
PART1: THE BASICS
Simple word matching Operator
=~ : return true if the regex matched !~ : return true if doesn’t match
/ … / : delimiter to enclose the string/variable of string needed to search Eg: $greeting = “World”;
if (“Hello World” =~ /$greeting/) { … } Other arbitrary delimiters:
PART1: THE BASICS
Simple word matching – Additional Can use the default variable $_ , the omit “$_ =~ ”
part Eg: $_ = “Hello World”; If (/World/) { … }
If regex matches in > 1 place: the earliest point is matched Eg: "Hello World" =~ /o/; # matches 'o' in 'Hello‘
PART1: THE BASICS
Simple word matching – Special characters metacharacters: {}[]()^$.|*+?\
Use backslash \ to include
Escape Sequences ASCII characters (\n, \t. etc), arbitrary bytes (octal,
hexa )
Variables: substituted before matching Eg: $foo = ‘house’;
'cathouse' =~ /cat$foo/; # matches
PART1: THE BASICS
Simple word matching – Special characters Anchor metacharacters: ^ and $ , to match the
beginning and the end of string
Overall: it’s just the surface of regex technology
PART1: THE BASICS Using character classes:
A set of possible characters To match the whole class at particular point in the regex Denoted by brackets [ … ]
Eg: /item[0123456789]/; # matches 'item0' or ... or 'item9' "abc" =~ /[cab]/; # matches 'a‘
To match 'yes' in a case-insensitive way (yes, Yes, YES): /[yY][eE][sS]/ /yes/i (i : case-insensitive, modifier of
matching operation)
PART1: THE BASICS Using character classes – Special characters:
Special characters: -]\^$ Needed a backslash to represent
] The end of a character class
$ Scalar variable
\ Escape sequences
- Range operator within character class
^ Negated character class
PART1: THE BASICS Using character classes – Special characters:
Several abbreviations for common character classes
\d a digit and represents [0-9]
\s whitespace character, represents [\ \t\r\n\f]
\D negated \d
\S negated \s
\W negated \w
. any character but "\n"
\b matches a boundary between a wordcharacter and a non-word character \w\W or \W\w
PART1: THE BASICS Issues:
why '.' matches everything but "\n“? We would like to ignore the newline characters, empty when
counting and matching on the line If we want to keep track of newlines: anchor ^ $,
modifiers /…/s (single line) and /…/m (multiple line)
No modifier //
‘.’ match any character except ‘\n’^, $: just match the beginning and end of string, before a newline
S modifier //s
Treat string as a single long line‘.’ match any character, ^ and $ just match the beginning and end of string before a newline
M modifier //m
Treat string as a set of multiple lines‘.’ match any character except ‘\n’^ and $ match at the start or end of any line in string
Both //sm Treat string as a single line, but detect multiple lines‘.’ match any character^ and $ match the start and end of any line within the string
PART1: THE BASICS
Matching this or that: Able to match different possible words or strings Using alternation metacharacter | Eg:
"cats and dogs" =~ /dog|cat|bird/; # matches "cat“
"cats" =~ /cats|cat|ca|c/; # matches "cats"
QUESTION