regex - clean - macsysadmindocs.macsysadmin.se/2013/pdf/regex.pdf · 2013-09-21 · aani aardvark...

101
Regular Expressions MacSysAdmin Göteborg, September 2012

Upload: dokhue

Post on 09-Jul-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Regular Expressions

MacSysAdminGöteborg, September 2012

“It’s called grepbecause it greps for things.”

-- Rob Pike, Bell Labs

/usr/share/dict/wordsIntroducing My Favourite Unix File

235,886 English words in alphabetical order.

A Quick DemoGrepping for Things

A pattern.

A pattern.Matched against a line.

A pattern.Matched against a line.

Did it match at all?

A pattern.Matched against a line.

Did it match at all?

Where, specifically,did it match?

A pattern.Matched against a line.

Did it match at all?

Where, specifically,did it match?

Can we change the part that matched?

A little history.

Theoretical Computer ScienceFormal Language and Automata Theory

1950s• Stephen Cole Kleene – Foundations of Recursion Theory

– Mathematical notation of “regular sets”

– Kleene star - unary operation on a set of strings, known

in mathematics as the “free monoid construction.”

– Application of the Kleene star to a set V is written as V*

QED

Ordinary character Matches itself

^ Beginning of line

$ End of line

. Any character

[string] Any character in the string

[^string] Any character not in the string

* Zero or more occurrences

a|b Either of the expressions

(expr) Grouping

UNIX and the “ed” editor

• Hey, I’ll show you.

• You should know this editor. Some day you’ll need it.

March 3, 1973

grep

$ grep pattern file1 file2 file3....

And soon, “egrep” and “fgrep”.

Kernighan and Pike, 1984Early UNIX Regular Expressions

c Any non-special character matches itself\c Turn off any special meaning of c^ Beginning of line$ End of line. Any single character

[...] Any one of the characters in the range[^...] Any one character not in the range\(r\) Tagged regular expression (grep only)\x What the x’th \(expression\) matched (grep only)r* Zero or more occurrences of rr+ One or more occurrences of r (egrep only)r? Zero or one occurrences of r

r1r2 r1 followed by r2

r1|r2 r1 or r2 (egrep only)

Extended Regular Expressions - egrep

Implemented first by “egrep”

?

+

|

Early Unix - THREE grep programs

grep• Early. Based on “ed”

egrep• Extended Grep. Fancier regular expressions. Runs faster, but starts slower.

fgrep• “Fixed Grep”. No patterns, static strings only. Very efficient.

• fgrep -f fileOfWordsICannotSpell myDocument

“The distinction is hard to justify”.

Surely we don’t have three grep programs any more. It’s 2012.

Today’s UNIX. Much better.

$ ls -l /usr/bin/*grep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/bzegrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/bzfgrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/bzgrep-rwxr-xr-x 3 root wheel 29664 23 Jul 20:57 /usr/bin/egrep-rwxr-xr-x 3 root wheel 29664 23 Jul 20:57 /usr/bin/fgrep-rwxr-xr-x 3 root wheel 29664 23 Jul 20:57 /usr/bin/grep-rwxr-xr-x 2 root wheel 25632 23 Jul 20:58 /usr/bin/pgrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/zegrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/zfgrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/zgrep-rwxr-xr-x 1 root wheel 1188 23 Jul 20:58 /usr/bin/zipgrep

Surely we don’t have three grep programs any more. It’s 2012.

Today’s UNIX. Much better.

$ ls -l /usr/bin/*grep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/bzegrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/bzfgrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/bzgrep-rwxr-xr-x 3 root wheel 29664 23 Jul 20:57 /usr/bin/egrep-rwxr-xr-x 3 root wheel 29664 23 Jul 20:57 /usr/bin/fgrep-rwxr-xr-x 3 root wheel 29664 23 Jul 20:57 /usr/bin/grep-rwxr-xr-x 2 root wheel 25632 23 Jul 20:58 /usr/bin/pgrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/zegrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/zfgrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/zgrep-rwxr-xr-x 1 root wheel 1188 23 Jul 20:58 /usr/bin/zipgrep

Perl-Compatible Regular Expressions

Inspired by a billion things Perl 5 added to regular expressions• lookahead, lookbehind, lookaround

Used by• Apache

• PHP

• Postfix

• Safari

• ack

A C library used by many other tools

2003The Real Triumph of Grep

2003The Triumph of Grep

It can be simple. /* match: search for regexp anywhere in text */ int match(char *regexp, char *text) { if (regexp[0] == '^') return matchhere(regexp+1, text); do { /* must look even if string is empty */ if (matchhere(regexp, text)) return 1; } while (*text++ != '\0'); return 0; }

/* matchhere: search for regexp at beginning of text */ int matchhere(char *regexp, char *text) { if (regexp[0] == '\0') return 1; if (regexp[1] == '*') return matchstar(regexp[0], regexp+2, text); if (regexp[0] == '$' && regexp[1] == '\0') return *text == '\0'; if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text)) return matchhere(regexp+1, text+1); return 0; }

/* matchstar: search for c*regexp at beginning of text */ int matchstar(int c, char *regexp, char *text) { do { /* a * matches zero or more instances */ if (matchhere(regexp, text)) return 1; } while (*text != '\0' && (*text++ == c || c == '.')); return 0; }

c matches any literal character c . matches any single character ^ matches the beginning of the input string $ matches the end of the input string * matches zero or more occurrences of the previous character

Using Today’s “grep”

$ grep pattern file1 file2 file3 ....About “grep”

Read each file in turn, one line at a time

If the pattern matches anywhere in the line, print the line.

$ grep steve /usr/share/dict/words

$ grep pattern file1 file2 file3 ....About “grep”

Read each file in turn• One line at a time

If the pattern matches anywhere in the line,• print the line.

$ grep steve /usr/share/dict/words

$ grep pattern file1 file2 file3 ....About “grep”

Read each file in turn• One line at a time

If the pattern matches anywhere in the line,• print the line.

$ grep steve /usr/share/dict/words

$ grep pattern file1 file2 file3 ....About “grep”

Read each file in turn• One line at a time

If the pattern matches anywhere in the line,• print the line.

$ grep steve /usr/share/dict/wordsstevedoragestevedorestevedoringstevelstevenTransteverineTrastevereTrasteverine

grep -v: “Print the lines that DON’T match”

Getting fancy with grep

$ grep -v steve /usr/share/dict/wordsAaaaaalaaliiaamAaniaardvarkaardwolfAaron

grep -c: “Just count the matches”

Getting fancy with grep

$ grep -c steve /usr/share/dict/words8

Tip:Remember

this one!

grep -o: “Just print the matching part”

Getting fancy with grep

$ grep -o steve /usr/share/dict/wordsstevestevestevestevestevestevestevesteve

grep -q: “Be quiet.”

Getting fancy with grep

$ grep -q steve /usr/share/dict/words$$ if grep -q steve /usr/share/dict/words; then echo Not a good password; fi Not a good password

grep -C n: “Print some context.”

Getting fancy with grep

$ grep -C 2 steve /usr/share/dict/words

--transsubjectivetranstemporalTransteverinetransthalamictransthoracic----trashytrassTrastevereTrasteverinetrasytraulism

2 lines before, 2 lines after each match.

grep --color: “Colour the matches.”

Getting fancy with grep

$ grep --color steve /usr/share/dict/wordsstevedoragestevedorestevedoringstevelstevenTransteverineTrastevereTrasteverine

grep -i: “Case-insensitive matching.”

Getting fancy with grep

$ grep -i steve /usr/share/dict/wordsStevestevedoragestevedorestevedoringstevelStevenstevenStevensonianStevensonianaTransteverineTrastevereTrasteverine

grep -r “Recursively go through subdirectories”grep -l “Just list the matching file names”

Getting fancy with grep

$ grep -r -l -i Tennis ~/Data/Users/me/Data/Medallists.csv/Users/me/Data/Shakespeare.txt

That you match against text.(Usually against a single line.)

A pattern.

Pattern Matches this Doesn’t match this

stevesteve

stevenEmelio Estevez

stephenSteve

[0123456789] Can$200 Two hundred dollars

. Steve

Beginning or end of the line

Anchors

Pattern Matches this Doesn’t match

^SteSteveStephenStegosaurus

Hello Steve

hen$

Stephenlichenstrengthen

Hello SteveChicken

Repeat the previous pattern

Repetition

Pattern Matches this Meaning

S*Steve“HISSSS” said the snake

|TychoZero or more

S? |Steve|Tycho

Zero or one(Optional)

S+ SteveSSSSnake

One or more

S{3} SSSnakeExactly three

S{3,4} SSSnakeSSSSnake

Three or four

Matches as much as it canNote: * is “greedy”

Suppose you want to match HTML tags.

You have a string

<HTML> <HEAD> <TITLE>My Page</TITLE> </HEAD> </HTML>

You have a pattern <.*>

You can say <.*?> to make it “lazy” instead of “greedy”

Combinations

Pattern Matches lines that....* Anything at all.

^x Start with x

y$ End with y

^x... Start with x;at least 3 more characters

^x.*y$Start with x,anything in the middle,end with y

.......... Have at least 10 characters

^..........$ Have exactly 10 characters

^.{10}$ Have exactly 10 characters

Tip:Remember

this one!

Advanced Patterns

(egrep)Backreferences

References to a (expression) you already matched.

\1 is the first one, \2 is the 2nd, etc.

Example: Find words with double letters

$ egrep ‘(.)\1’ wordsaaaalaaliiaamaardvarkaardwolfabactinallyabaffabaissed

abandonee

sed, vi, perl, XcodeSubstitution

You match something with (grouping) and (more grouping)

You replace it with something else, using \1, \2 to refer to the matched groups

s/pattern/replacement/

perl -p -e 's/(.*):(.*)/\2,\1/;' -- Flip the first two colon-separated fields around

foo:bar => bar,foo

perl -p -e 's/(.*):(.*)/\2,\1/;'

perl -p -e 's/(.*):(.*)/\2,\1/;'

perl -p -e 's/(.*):(.*)/\2,\1/;'

perl -p -e 's/(.*):(.*)/\2,\1/;'

perl -p -e 's/(.*):(.*)/\2,\1/;'

perl -p -e 's/(.*):(.*)/\2,\1/;'

perl -p -e 's/(.*):(.*)/\2,\1/;'

perl -p -e 's/(.*):(.*)/\2,\1/;'

perl -p -e 's/(.*):(.*)/\2,\1/;'

perl -p -e 's/(.*):(.*)/\2,\1/;'

Tycho:SjogrenArek:Dreyer

perl -p -e 's/(.*):(.*)/\2,\1/;'

Sjogren,TychoDreyer,Arek

perl -i.bak -p -e 's/(.*):(.*)/\2,\1/;' names.txt

Tycho:SjogrenArek:Dreyer

names.txt

Tycho:SjogrenArek:Dreyer

names.txt.bak

Sjogren,TychoDreyer,Arek

names.txt

Try this!

$ cd /usr/share/dict$ egrep '(.)\1' words

Challenge: Find a word witha repeating five letter pattern (like “abcdeabcde”)

$ cd /usr/share/dict$ egrep '(.)\1' words

Prize Challenge:

How many 14 letter wordsare in /usr/share/dict/words ?

Really Advancedand Ugly Looking Patterns

“Lookaround”Lookahead and Lookbehind

Pattern Means ...

(?=pattern) Zero-width positive lookahead.

(?!pattern) Zero-width negative lookahead.

(?<=pattern) Zero-width positive lookbehind.

(?<!pattern) Zero-width negative lookbehind.

“U” always follows “Q”, right?A Negative Lookahead Example

Negative Lookahead: A “q” not followed by a “u”

q(?!u)

Why not just use q[^u] ?

Didn’t we use that already?

A q followed by a non-u character?

Normal REWords with “q” not followed by “u”

Negative Lookahead

Lookahead and Lookbehind don’t work everywhere.A Problem

Perl 5

grep -P• 10.7 and earlier only

• 10.8 uses a different grep. You can use “perl -e”

ack

Nice tutorial here:• http://www.regular-expressions.info/lookaround.html

Certain tools only (perl)Comments and Free-Spacing

Match a valid date:

^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$

Or, with comments and free-spacing

# Match a 20th or 21st century date in yyyy-mm-dd format(19|20)\d\d # year (group 1)[- /.] # separator(0[1-9]|1[012]) # month (group 2)[- /.] # separator(0[1-9]|[12][0-9]|3[01]) # day (group 3)

www.regular-expressions.info

Matching an Email Address

How hard can it be?

Matching an Email Address

It’s easy, right?

Everybody is [email protected], right?

\s+@\s+\.\s+

Can you match this?

[email protected]

steve@[192.0.43.10]

“Steve” <[email protected]>

[email protected] (Steve (Musical Dictator))

[email protected]

Steve\ [email protected]

“steve”@example.com

steve\@[email protected]

Doing it Right Is Hard

Recommended on stackoverflow.com ...

^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$

perl - Mail::RFC822::ADDRESS(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)

perl - Mail::RFC822::ADDRESS

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)

RFC3696 - Checking Names

http://tools.ietf.org/html/rfc3696

So how do you really do it?

Ask what their address is.

Maybe look for an “@”

Send a test message.

If it gets through, it’s valid.

Tips

Or, “Leftover Slides”

Doing anything fancy at all?

Use egrep not grep.Or, “grep -E” - same thing.

Use egrep not grep.

This is an “egret”

grep handles only Basic Regular Expressions.egrep handles a lot more.

Quote the pattern!

See exactly what’s matching.Use “grep --color”

betterthangrep.com - based on PRCEConsider installing “ack”

Use --passthru to print all lines, even those not matching

Surprise!grep changed in Mountain Lion

GNU Grep in 10.7

BSD Grep in 10.8– No more “--use-perl” for PCRE

Use perl itself - or go get “ack”.• www.betterthangrep.com

Things I’m Not Proud Of

BonusLinguistic Tagging

Beyond Regular ExpressionsFor Objective-C Programmers

NSRegularExpression

NSDataDetector (iOS 4, OS X 10.7)

• Dates

• Addresses

• Links

• Phone Numbers

• Transit Information

NSLinguisticTagger

WWDC Session 215Text and Linguistic Analysis

NSLinguisticTagger

Find ...• Word and sentence boundaries

• Lexical classes– noun, verb, adjective

• Lemmas– Root forms of words

• Named Entities– Personal names, place names, organization names

In ActionNSLinguisticTagger

NSLinguisticTagger *tagger = [[NSLinguisticTagger alloc] initWithTagSchemes: [NSArray arrayWithObjects: NSLinguisticTagSchemeTokenType, NSLinguisticTagSchemeLexicalClass, NSLinguisticTagSchemeNameType, NSLinguisticTagSchemeNameTypeOrLexicalClass, NSLinguisticTagSchemeLemma, nil] options:0];[tagger setString:string];

In ActionNSLinguisticTagger

[tagger enumerateTagsInRange:range scheme:NSLinguisticTagSchemeLexicalClass options:NSLinguisticTaggerOmitWhitespace usingBlock:^(NSString *tag, NSRange tokenRange, NSRange sentenceRange, BOOL *stop){

if (tag == NSLinguisticTagNoun) // we have found a noun...;

Safe Tweets In SchoolAn Example

Prompt the user for a line of text

Use NSLinguisticTagger to identify the nouns

Remove everything else.

Resources

regular-expressions.info

www.regexpal.com

Prize Challenge:

How many 14 letter wordsare in /usr/share/dict/words ?