regular expression in java 101 comp204 source: sun tutorial, …

Regular Expression in Java101

COMP204

Source: Sun tutorial, …

What are they?

• a way to describe patterns in strings• similar to regex in Perl• cryptic syntax: “write once, ponders many

times”• used to search, parse, modify textual data• Java: java.util.regex with Pattern, Matcher,

and PatternSyntaxException class,• plus utility methods in String class

String constants match

• regex: foo

string: foo

=> 0:3 "foo" • regex: foo

string: foofoofoo

=> 0:3 "foo”

=> 3:6 "foo”

=> 6:9 "foo"

Meta characters

• Some characters are “special”, e.g. a single dot “.” matches any character:

• regex: cat. string: cats => 0:4 catsOthers are: ([{\^-$|]})?*+.Use meta char literally: “escape” with backslash (e.g. \.), or “quote”, e.g. \Q.\E

Character classes

[abc] a, b, or c (simple class)

[^abc] any character except a, b, or c (negation)

[a-zA-Z] a through z or A through Z, inclusive (range)

[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)

[a-z&&[def]] d, e, or f (intersection)

[a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)

[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z](subtraction)

Predefined classes (see Pattern)

. Any character (may or may not match line terminators)

\d digit: [0-9]

\D non-digit: [^0-9]

\s whitespace character: [ \t\n\x0B\f\r]

\S non-whitespace character: [^\s]

\w word character: [a-zA-Z_0-9]

\W non-word character: [^\w]

Greedy Quantifiers

X? X, once or not at all

X* X, zero or more times

X+ X, one or more times

X{n} X, exactly n times

X{n,} X, at least n times

X{n,m} X, at least n but not more than m times

Reluctant quantifiers

X?? X, once or not at all

X*? X, zero or more times

X+? X, one or more times

X{n}? X, exactly n times

X{n,}? X, at least n times

X{n,m}? X, at least n but not more than m times

Possessive Qantifiers

X?+ X, once or not at all

X*+ X, zero or more times

X++ X, one or more times

X{n}+ X, exactly n times

X{n,}+ X, at least n times

X{n,m}+ X, at least n but not more than m times

What’s the difference

// greedy quantifierregex: .*foostring: xfooxxxxxxfoo=> 0:13 "xfooxxxxxxfoo" // reluctant quantifierregex: .*?foostring: xfooxxxxxxfoo=> 0:4 "xfoo”=> 4:13 "xxxxxxfoo"// possessive quantifierregex: .*+foostring: xfooxxxxxxfooNo match found.

Capturing groups

• Quantifiers apply to single characters (e.g. a*, matches everything, why?), character classes (e.g. \s+) or groups (e.g. (dog){2} )

• Groups are numbered left-to-right: ((A)(B(C))) => 1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C)

refer to groups with e.g. \2 for group two: regex: (\w)\1 string: hello => 2:4 “ll”

Boundaries

^ The beginning of a line$ The end of a line\b A word boundary\B A non-word boundary\A The beginning of the input\G The end of the previous match\Z The end of the input but for the final terminator, if any

\z The end of the input

Pattern class

boolean b = Pattern.matches("a*b", "aaaaab");

orPattern p = Pattern.compile("a*b");Matcher m = p.matcher("aaaaab"); boolean b = m.matches();

latter allows for efficient reuse

Splitting a string using a regex

Pattern p = Pattern.compile(“a*b”);

String[] items = p.split(“aabbab”);

for(String s : items) System.out.println(s);

similar to split(regex) method in class String

String[] items = “aabbab”.split(“a*b”);

Matcher class

• loads of methods, e.g. to access groups (see test harness) or replace expressions:

Pattern p = Pattern.compile(“dog”);

Matcher m = p.matcher(“the dog runs”);

String result = m.replaceAll(“cat”);

System.out.println(result);

=> “the cat runs”

String class has one-off methods

“the dog runs”.replaceFirst(“dog”,”cat”);

=> “the cat runs”

“aabcbdabe”.split(“a*b”);

=> {“c”,”d”,”e”}

“xfooxxxxxxfoo”.match(“.*foo”);

=> true

regular expression in java 101 comp204 source: sun tutorial, …

Documents