www.ischool.drexel.edu info 320 server technology i week 7 regular expressions 1info 320 week 7

40
www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1 INFO 320 week 7

Upload: imogen-thornton

Post on 12-Jan-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

INFO 320Server Technology I

Week 7

Regular expressions

1INFO 320 week 7

Page 2: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Overview

• One of the most powerful tools in UNIX/Linux is the ability to compare regular expressions– Regular expressions overview– grep– Character classes– Applications

2INFO 320 week 7

Page 3: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Regular expressions overview

Mostly from Regular-Expressions.info and the man pages cited

3INFO 320 week 7

Page 4: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Regular expressions?

• “A regular expression (regex or regexp for short) is a special text string for describing a search pattern” – While developed in UNIX, regular expressions

can be also used with little modification in Windows, Perl, PHP, Java, or a .NET language

– “little modification?” Yes, you have to be careful which set of regex rules you’re using

4INFO 320 week 7

Page 5: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Regular expressions

• The down side?– They look like complete and utter gibberish

^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

• The good news?– There are zillions of cookbook recipes for

common uses of them– And with commands (grep, ed, sed), they can

be used in scripts

5INFO 320 week 7

Page 6: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Fancy wildcards?

• The basic idea is that regex are wildcards on steroids

• We saw that, in bash scripting– A star ‘*’ can substitute for zero or more of any

character (except a line break)– A question mark ‘?’ can substitute for exactly

one any character HERE IT DOESN’T– We’ll refine our use of brackets [ ] to include

or exclude any specific one character

6INFO 320 week 7

Page 7: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Regex syntax

• Within UNIX, there are variations on regex syntax– GNU grep (our main tool) uses

GNU Basic Regular Expressions syntax (BRE)– GNU egrep uses

GNU Extended Regular Expressions syntax (ERE)

– POSIX-compliant systems use POSIX Basic Regular Expressions for grep, or POSIX Extended Regular Expressions for egrep

7INFO 320 week 7

Page 8: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

BRE (grep) vs ERE (egrep)

• The only difference is that BRE's will use backslashes to give various characters a special meaning, while ERE's will use backslashes to take away the special meaning of the same characters

• egrep has the same functions as grep, it’s just a little faster– grep –E is the same as egrep

8INFO 320 week 7

Page 9: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Ed and sed

• Similar regex rules are used by grep, ed, and sed– ed is a text line editor– sed is used to perform basic transformations

on an input text stream

9INFO 320 week 7

Page 10: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

grep

10INFO 320 week 7

Page 11: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Regular expressions and grep

• Regular expressions were first implemented in the 1970’s in UNIX for the ‘grep’ command – grep = generate regular expression– egrep = extended grep

• We’ll focus on grep– grep matches BREs, which were defined by

IEEE Std 1003.1-2001, Section 9.3, Basic Regular Expressions (now dated 2008)

11INFO 320 week 7

Page 12: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

grep syntax

• The basic form is – grep –options pattern file

• The normal output from grep is a text list of all the lines which matched the pattern in the file – Notice that patterns like ‘re-

member’ which cross lines are not found! Regex matches cannot span multiple lines

12INFO 320 week 7

Page 13: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

grep options

• Like most UNIX commands, grep has many options (see handout), including– -c shows the count of lines matched, instead

of the lines themselves– -i ignores case when matching (!)– -n gives the line number of each line matched– -v gives lines which don’t match the

pattern(s) as output

13INFO 320 week 7

Page 14: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

grep options

• You can also include a list of patterns with the –e option

• Or use a file with patterns using the –f option

• You can match lines where the whole line matches the pattern, with the –x option

14INFO 320 week 7

Page 15: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Search patterns

• As a good habit, put the search pattern in single or double quotes (either works if consistent)– The pattern is a regular expression

• If you give an empty pattern all lines will be matched– So what does grep –c ‘’ filename do?

15INFO 320 week 7

Page 16: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Metacharacters

• Regex metacharacters are text strings that have special meaning in this context

• We’ll look at them in groups– We already mentioned the wildcard ‘*’ which

matches zero or more of any character (except newline)

– To match any exactly one character, use a period ‘.’

• Notice a ‘?’ did this in the context of scripting

16INFO 320 week 7

Page 17: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Metacharacters

• We can identify words that start or end of a line

• ‘^’ (the carat) marks the start of the line– ‘^Four’

• ‘$’ (dollar) marks the end of the line– ‘ago$’– Again, different meaning than in scripting

17INFO 320 week 7

Page 18: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Metacharacters

• We can identify the start or end of a word

• ‘\<‘ marks the start of a word– ‘\<eat’ would match eats or eating, not feat

• ‘\>’ marks the end of a word– ‘ing\>’ would match loving but not sings

18INFO 320 week 7

Page 19: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Character classes

19INFO 320 week 7

Page 20: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Character classes

• With a "character class" (or set) you can tell the regex engine to match only one out of several characters– Simply place the possible characters you want to

match between square brackets

• If you want to match an a or an e, use [ae]– You could use this in gr[ae]y to match either gray

or grey• Very useful if you do not know whether the document you are

searching through is written in American or British English

From http://www.regular-expressions.info/charclass.html

20INFO 320 week 7

Page 21: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Character classes

• The order of the characters inside a character class does not matter– The results are identical [ae] or [ea]

• The characters don’t have to be sequential– [dptjgm583;] is fine– But if you want cite special characters [\^$.|?*+(){} literally, you need to add a backslash before them

• So [abc\\\?] matches a b c \ or ?

21INFO 320 week 7

Page 22: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Character classes

• More generally in character classes – ‘[]’ matches any one character specified

between the brackets– ‘[^abc]’ matches any one character NOT

specified between the brackets• That example means ‘does not have a b or c in it’• Notice the ^ has very different meaning in a

character class or as its own metacharacter

22INFO 320 week 7

Page 23: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Character classes

• Within character classes, ranges of possible characters can be given– [a-z] means any lower case letter– [a-zA-Z] means any upper or lower case letter– [a-zA-Z0-9] could be any character that isn’t

a letter or number

23INFO 320 week 7

Page 24: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Metacharacters

• The pipe means logical OR in an expression, here called alternation– abc(def|xyz) matches abcdef or abcxyz

• Multiple alternations are allowed– s[i|a|o]ng

• Notice the parentheses group a string of characters to be treated as one

24INFO 320 week 7

Page 25: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Bracket expressions

• POSIX has bracket expressions to provide abbreviations for common search terms– For example instead of [a-z] can use [:lower:] – [a-zA-Z] becomes [:alpha:] – [a-zA-Z0-9] becomes [:alnum:] – What does [A-Fa-f0-9] = [:xdigit:] mean?

• So [^x-z[:digit:]] matches a single character that is not x, y, z or a digit [0-9]

From http://www.regular-expressions.info/posixbrackets.html

25INFO 320 week 7

Page 26: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Optional

• The question mark will attempt match the preceding token zero times or once, in effect making it optional– colou?r matches both colour and color– Nov(ember)? will match Nov and November

26INFO 320 week 7

Page 27: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Repetition

• The asterisk or star tells the engine to attempt to match the preceding token zero or more times. – ‘<[A-Za-z][A-Za-z0-9]*>’ matches an

HTML tag without any attributes

• The plus tells the engine to attempt to match the preceding token once or more. – ‘<[A-Za-z0-9]+>’ will match a tag with

any one or more alphanumeric characters27INFO 320 week 7

Page 28: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Limiting repetition

• As a further refinement, it’s possible to specify how many times a string will be repeated, by adding {min,max} instead of a star or plus

• Max is infinite if not specified, so– * = {0,} + = {1,} and ? = {0,1}– But {0,3} would limit the previous character

to appear zero to three times

28INFO 320 week 7

Page 29: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

() [] [::]?

• So in the context of a regex– Parentheses ( ) are used for grouping, to treat

a series of characters as one for repetition– Square brackets [ ] define a character class,

matches any one character in that class– Square brackets with colons [: :] define a

POSIX bracket expression

29INFO 320 week 7

Page 30: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

?*+{}?

• And following any kind of grouping, character class, or bracket expression– ? Makes a group repeated zero or one time

(optional)– + makes a group repeated one or more times– * makes a group repeated zero or more times

– Curly brackets { } are used for controlling repetition by giving min and max limits

30INFO 320 week 7

Page 31: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Searching for special characters

• To match a ], put it as the first character after the opening [ or the negating ^

• To match a -, put it right before the closing ]

• To match a ^, put it before the final literal - or the closing ]

• Put together, []\d^-] matches ], \, d, ^ or -

31INFO 320 week 7

Page 32: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Applications

From http://www.regular-expressions.info/examples.html

32INFO 320 week 7

Page 33: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Ok, now what?

• Given this terribly complex set of rules for defining a regular expression … so what?

• Regexes are very handy for searching for specific terms, or validating inputs

• Here we’ll review a few cookbook examples

33INFO 320 week 7

Page 34: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Trimming Whitespace

• A mundane example is to use regular expressions to get rid of spaces at the start and end of lines– Search for ^[ \t]+ and replace with nothing

to delete leading whitespace – Search for [ \t]+$ and replace with nothing

to trim trailing whitespace– [ \t] matches a space or tab

34INFO 320 week 7

Page 35: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Match IP addresses

• A simplified version is \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b

• But that will catch illegal IP addresses above 255; to fix that use– \b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.

(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.

(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b – Ok, matching numbers is tough in a text world

35INFO 320 week 7

Page 36: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Numbers are challenging

• To get a real number– [-+]?[0-9]*\.?[0-9]+

• But if you might need exponential notation– [-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?

36INFO 320 week 7

Page 37: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Validate email addresses

• If you get a string and want to see if it’s an email address, could try– ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

– What assumption is made here about case?

37INFO 320 week 7

Page 38: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Validate a date

• (19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])

• Matches a date in yyyy-mm-dd format from between 1900-01-01 and 2099-12-31

38INFO 320 week 7

Page 39: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

Validate credit cards

• To validate a credit card, need their format, and first strip out spaces & dashes

• Visa: ^4[0-9]{12}(?:[0-9]{3})?$ – All Visa card numbers start with a 4; new

cards have 16 digits, old cards have 13

• MasterCard: ^5[1-5][0-9]{14}$ – All MasterCard numbers start with the

numbers 51 through 55; all have 16 digits

39INFO 320 week 7

Page 40: Www.ischool.drexel.edu INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7

www.ischool.drexel.edu

References

• Regular-expressions.infohttp://www.regular-expressions.info/

• Grep man pagehttp://manpages.ubuntu.com/manpages/jaunty/en/man1/grep.1posix.html

• Lots of books are also available on regular expressions

40INFO 320 week 7