regex basics
DESCRIPTION
Ciarán Walsh's PHPNW08 slides: In the right hands regular expressions can be a powerful tool, but it’s also far too easy for them to be used badly, or in the wrong situations. This talk will kick off with a look at alternatives to regular expressions, for when the power of pattern matching is not required, and will also go over some cases when there are better alternatives available. Then there will be a brief refresher on pattern syntax and some general tips and tricks to help when constructing regular expressions, before we go on to look at some situations where the use of pattern matching is a good fit, how to solve some common problems, and some common pitfalls when writing patterns.TRANSCRIPT
![Page 1: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/1.jpg)
Regular Expression Basics
PHPNW 2008Ciarán Walsh
![Page 2: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/2.jpg)
What are regular expressions?
•Regular expressions allow matching and manipulation of textual data.
•Abbreviated as regex or regexp, or alternatively just “patterns”.
![Page 3: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/3.jpg)
Regular Expression Basics
Literals
busMatches a ‘b’, followed by a ‘u’, followed by an ‘s’
![Page 4: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/4.jpg)
Regular Expression Basics
Anchors
^ Matches at the beginning of a line
$ Matches at the end of a line
![Page 5: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/5.jpg)
Regular Expression Basics
Character Classes
[abc] Matches one of ‘a’, ‘b’ or ‘c’
[a-c] Same as above (character range)
[^abc]Matches one character that is not listed
. Matches any single character
![Page 6: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/6.jpg)
Regular Expression Basics
Alternation
a|b Matches one of ‘a’ or ‘b’
dog|cat Matches one of “dog” or “cat”
![Page 7: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/7.jpg)
Regular Expression Basics
Quantifiers (repetition)
{x,y}Matches minimum of x and a maximum of y occurrences; either can be omitted
*Matches zero or more occurrences(any amount). Same as {0,}
+Matches one or more occurrences.Same as {1,}
?Matches zero or one occurrences.Same as {0,1}
![Page 8: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/8.jpg)
Regular Expression Basics
Grouping
(…)
Groups the contents of the parentheses.Affects alternation and quantifiers.Allows parts of the match to be captured
for|backward “for” or “backward”(for|back)ward “forward” or “backward”
![Page 9: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/9.jpg)
Regular Expression Basics
Delimiters
pattern / modifiers/
/i Makes match case-insensitive
![Page 10: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/10.jpg)
Performing a Match
preg_match('/Te(.)f?/i','text',$matches);
•Returns number of matches (0 or 1)
•$matches will contain captured groups
![Page 11: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/11.jpg)
Performing a Replacement
preg_replace('/some(text)/','\1',$text)
•Returns string after replacement
•Can use backreferences with \0-9
![Page 12: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/12.jpg)
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:
(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\
]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n
)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)
*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\
\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\"
.\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(
?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]
))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]
+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]
|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@
,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[
\t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(
?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\
031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)
Don’t Use Regular Expressions!
Don’t Abuse Regular
Expressions!Some people, when confronted with a problem, think
“I know, I'll use regular expressions.”Now they have two problems. — Jamie Zawinski
![Page 13: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/13.jpg)
Testing for a Substring
if (preg_match('/foo/', $var))
if (strpos($var, 'foo') !== false)
if (preg_match('/foo/i', $var))
if (stripos($var, 'foo') !== false)
![Page 14: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/14.jpg)
if (preg_match('/^\d+$/', $value)) { // $value is a positive integer}
Validating an Integer
Regular Expression
•Intention is not immediately
obvious
•Not efficient
![Page 15: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/15.jpg)
ctype (Character Type)
Validating an Integer
if (ctype_digit($value)) { // $value is a positive integer}
• Native C library (fast)
• Makes the intention obvious
![Page 16: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/16.jpg)
$casted_value = intval($value);if ($casted_value > 0) { // $casted_value is a positive (non-zero) integer}
Validating an Integer
Casting
•Intention is fairly clear
•Casting is safe practice
•Any invalid values will result in zero
![Page 17: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/17.jpg)
HTML Parsing
![Page 18: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/18.jpg)
Using Regular Expressions
![Page 19: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/19.jpg)
Using Regular Expressions
Postcodes/[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9]
[A-Z]{2}/
IP Addresses @^(\d{1,2})/(\d{1,2})/(\d{4})$@
![Page 20: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/20.jpg)
Constructing Patterns
Writing patterns is a balance between matching what you do want, against not matching what you don’t want.
![Page 21: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/21.jpg)
preg_match('/<b><s>.+<\/s>.+<\/b>/', $html)
preg_match('@<b><s>.+</s>.+</b>@', $html)
You don’t need to use
/…/ to denote a pattern!
![Page 22: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/22.jpg)
Greediness$html = <<<HTML <span>some text</span><span>some more text!</span>HTML;
preg_match("@<span>(.+)</span>@", $html, $matches);echo $matches[0];
preg_match("@<span>(.+?)</span>@", $html, $matches);echo $matches[0];
![Page 23: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/23.jpg)
preg_match('`^(\w+)://(?:(.+?):(.+?)@)?(.+?)\.(\w+)$`', $s, $matches)
preg_match('` ^ (\w+):// # Protocol (?: (.+?) # Username : # : (.+?) # Password @ # @ )? # Username/password are optional (.+?) # Hostname \.(\w+) # Top-level domain $ `x', $s, $matches);
You can make your pattern readable!
![Page 24: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/24.jpg)
preg_match('`^ (?P<protocol>\w+):// (?: (?P<user>.+?) : (?P<pass>.+?) @ )? (?P<host>.+?) \.(?P<tld>\w+) $`x', $s, $matches);
Extracting Captures
Array( [0] => http://foo:[email protected] [protocol] => http [1] => http [user] => foo [2] => foo [pass] => bar [3] => bar [host] => baz.example [4] => baz.example [tld] => com [5] => com)
preg_match('`^ (?P<protocol>\w+):// (?: (?P<user>.+?) : (?P<pass>.+?) @ )? (?P<host>.+?) \.(?P<tld>\w+) $`x', $s, $matches);
![Page 25: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/25.jpg)
if (preg_match("!>$value</(?:div|span)>!", $text))
Variable Data
$value = preg_quote($value, '!');
![Page 26: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/26.jpg)
preg_replace('/\w+/e', 'strtoupper("\0")', 'foo bar baz')
function upper_case_match($matches) { return strtoupper($matches[0]);}preg_replace_callback('/\w+/','upper_case_match','foo bar baz')
Performing Logic on Replacements
![Page 27: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/27.jpg)
Testing Tools
•RegexBuddy
•Reggy
•http://rubular.com
![Page 28: Regex Basics](https://reader036.vdocument.in/reader036/viewer/2022081505/55593321d8b42a4f3d8b4a03/html5/thumbnails/28.jpg)
Any Questions?