perl 101: regular expressions - meetupfiles.meetup.com/501101/perl 101- regular...

Post on 17-Apr-2020

22 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Perl 101: Regular Expressions-Alan Voss, Perl Hacker

A) Black magic?B) A form of wizardry?C) A (mostly) predictable mini language for detecting patterns in text and manipulating text in an iterative fashionD) Platform and language independent (for the most part): Perl, JavaScript, Java, PHP, grep -e, etcE) The first argument to Perl's split()F) All of the above

What are Regular Expressions?

/The/

● Matches any of:○ The○ Their○ Thesis○ anaEsThesia○ ectoThermiC○ AbsinThe

A Basic Example

/tru/i (i: modifier for case insensitivity)

● Matches any of:○ True○ truth○ altruism○ constRUcted○ obStrUctEd○ Restructure

Building on the Basic Example

We can match basic words using a "pattern" of just the word.

But what if we need to match something more interesting?

So far, not very interesting.

/the|tru/i

● Matches any of:○ The○ true○ anaesThesia○ constRUcted○ obStrUctEd○ absinThe

Building on the Basic Example

/chee(z|se)burger/i

● Matches any of:○ cheezburger (sic)○ cheeseburger○ CHEESEburger

Building on the Basic Example

/t(he|ru)/i

● (Still) matches any of:○ The○ true○ anaesThesia○ constRUcted○ obStrUctEd○ absinThe

Building on the Basic Example

● Regular expressions are string iterator instructions.

● Matching always starts at the beginning of the string.

● Matching continues until total success or partial failure.

● If failure is the case, backtracking occurs until no success is possible at the starting position, and when exhausted the cursor advances and starts again.

How does it work?

The● T matches T at start of regex● h matches next character in regex● e matches the one after that

Success!

Back to the /The/ example:

ectoThermic● e does not match T at start of regex

○ fail, advance, start over● c does not match T at start of regex

○ fail, advance, start over● t does not match T at start of regex

○ fail, advance, start over● o does not match T at start of regex

○ fail, advance, start over

Back to the /The/ example:

ectoThermic (continued)● T matches T at start of regex● h matches next character in regex● e matches the one after that

The rest of the string is ignored.

Success!

Back to the /The/ example:

True● T matches T at start of regex● r does not match next character h

○ fail, advance, backtrack regex, start over● r does not match T at start of regex

○ fail, advance, start over● u does not match T at start of regex

○ fail, advance, start over● e does not match T at start of regex

○ fail, can't advance, done.

Failure.

Back to the /The/ example:

Special Characters

. matches any character but "\n" (except with modifiers)

* 0 or more of the proceeding character, class, or sub-expression

+ 1 or more of the proceeding character, class, or sub-expression

? 0 or 1 of the proceeding character, class, or sub-expression

{ n, m } minimum of n, maximum of m (m optional) of the proceeding character, class, or sub-expression

| or

[.....] denotes character class

(.....) denotes sub regular expression or special / extended syntaxes

\ escape any of these symbols, including itself

\Q ... \E escape all special characters afterward (to \E, if present)

Special Characters (zero width)

\A beginning of string

\z absolute end of string

\Z end of string, save the final terminating character, like a newline

\G start matching where the previous global match stopped

\b matches a word boundary

^ beginning of line (including start of new line)

$ end of line (might not be end of string)

(?=) positive lookahead

(?!) negative lookahead

(?<=) positive lookbehind

(?<!) negative lookbehind

For these, don't advance the cursor, just test in place.

/^(ab)?normal$/# normal, abnormal

/^(stig|fer)ma(ta)?$/# stigma, stigmata, fermata

/^(ma){1,2}$/# ma, mama

Examples

/^ba(na){2}$/# banana

/^(kn[ia]ck){1,2}$/# knick, knack, knickknack (knickknick, knackknack)

/^(angio|rhino|osteo|neo)plasty$/# various surgeries

More Examples

/^br+$/# indicating varying degrees of coldness

/^[0-9]{3}-[0-9]{2}-[0-9]{4}$/# social security number (character classes)

/^[0-9]{1,3}(,[0-9]{3})*$/# a number in the form of 10,201,231

/^(.+).?(??{reverse $1})$/# a palindrome (awake...?)

More Examples

Character Classes (use ranges)

Character class

AKA * Denotes Opposite (not) uses leading ^

Opposite (not) AKA *

[ACGT] any DNA nucleic acid [^ACGT]

[0-9] \d any single number [^0-9] \D

[A-Za-z] any uppercase or lowercase ASCII letter [^A-Za-z]

[A-Za-z0-9_] \w any "word" character [^A-Za-z0-9_] \W

[\t\r\n\f ] \s any whitespace character [^\t\r\n\f ] \S

* not available in all implementations, but definitely in Perl!

Greed

+ ? {n,m} {n,} are all greedy

They always match as many as they can.

Upon whole regex failure, part of the substring that was matched using greed will be backtracked by one, until either there is nothing more to backtrack or the whole regex succeeds.

Greed

The opposite of greed.

Adding a ? to the quantifier will make it match the minimum required.

A*? might match 0 letter As, even though it could match many, many more with greed.

Reluctance or Parsimony

Say you wanted to get the information between quotes in this sentence:

$a = 'The man said, "Heck if I know!"';

You could match that with the following:

/"(.+)"/

Greed vs Reluctance/Parsimony

But what about this sentence:

$a = 'The man said, "Heck if I know!" and then she answered, "Oh, but I do!"';

Will this work anymore?

/"(.+)"/

Greed vs Reluctance/Parsimony

Nope. The .+ is greedy, and will swallow everything from the first quote to the very last one, even though you didn't mean to capture all of that.

$a = 'The man said, "Heck if I know!" and then she answered, "Oh, but I do!"';

/"(.+)"/

Greed vs Reluctance/Parsimony

Could use a reluctant expression, and that would match just the first one:

$a = 'The man said, "Heck if I know!" and then she answered, "Oh, but I do!"';

/"(.+?)"/

Greed vs Reluctance/Parsimony

Or use the /g global modifier to match both:

$a = 'The man said, "Heck if I know!" and then she answered, "Oh, but I do!"';

/"(.+?)"/g

Greed vs Reluctance/Parsimony

Or, you could say what you mean (in some cases this is faster, and other times it is slower):

$a = 'The man said, "Heck if I know!" and then she answered, "Oh, but I do!"';

/"([^"]+)"/g

Greed vs Reluctance/Parsimony

There is a quick shortcut for being greedy and not backtracking as well, which is the + which can be used similarly to the ? in reluctant matching.

*+{2,5}+++

'aaaa' =~ /a++a/ will never match, but /a+a/ will.

Greed vs Possessive

/(\d{2,})\1/# matches a repeating set of numbers

\1 refers to the first set of parentheses (subexpression) in the expression, and says "match that again, please"

In substitutions, which we'll talk about later, you could refer to that as $1, a global in Perl

Captures and backreferences

How about making things more clear with names, rather than numbers?

/(\d{2,})\1//(?<numbergroup>\d{2,})\g{numbergroup}/# matches a repeating set of numbers

It is important to note that named captures are just aliases to the numbered backreferences, and can bite you in specific circumstances, e.g. the branch reset pattern.

Named captures and backreferences

Can be used in combination.

Modifiers

Modifier Means Does

/g Global Match as many times as possible.

/i Insensitive Case insensitivity, even with character classes.

/s Single Treats even a multi-lined string as a single string, such that . will match "\n", for example.

/m Multiple For multiple-lined strings, ^ and $ will match the beginning and end of each line.

/x eXtend A good way to add comments in your regex

/e Eval Evaluate the replacement value as an expression, and use the results for the substitution.

/r Return Only return the modified string without actually modifying $_ during substitution

use s with any special character or bracket set as delimiters.

s/Alan/someone better at Java/

s#\b(\w+)\b#ucfirst $1#ge"alan loves regular expressions" becomes "Alan Loves Regular Expressions"

s{(\w)(\w)}{$2$1}g or s/(\w{2})/reverse $1/geSwap adjacent word characters.

Substitution

● lookaheads (and the negative counterpart)● look behinds (and the negative counterpart)● other extended regular expressions● ?: for blocking the capture of a set of

parentheses● Regexp::Common (named aliases for

commonly used regexes, including very complicated ones)

● The entire Regexp:: namespace (including Regexp::Debugger, used in this presentation

Topics not (heavily) covered, but that are related and interesting

top related