regular expressions: javascript and beyond

Post on 03-Jul-2015

662 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Regular Expressions is a powerful tool for text and data processing. What kind of support do browsers provide for that? What are those little misconceptions that prevent people from using RE effectively? The talk gives an overview of the regular expression syntax and typical usage examples.

TRANSCRIPT

Regular Expressions:JavaScript And Beyond

Max ShirshinFrontend Team Lead

deltamethod

Introduction

Types of regular expressions• POSIX (BRE, ERE)

• PCRE = Perl-Compatible Regular Expressions

4

From the JavaScript language specification:

"The form and functionality of regular expressions is modelled after the regular expression facility in the Perl 5 programming language".

5

JS syntax (overview only)

var re = /^foo/;

       

6

JS syntax (overview only)

var re = /^foo/;

// booleanre.test('string');    

7

JS syntax (overview only)

var re = /^foo/;

// booleanre.test('string'); // null or Arrayre.exec('string');

8

Regular expressions consist of...

● Tokens— common characters— special characters (metacharacters)

● Operations— quantification— enumeration— grouping

Tokens and metacharacters

/./.test('foo'); // true

/./.test('\r\n') // false

        10

Any character

/./.test('foo'); // true

/./.test('\r\n') // false

What do you need instead:

/[\s\S]/ for JavaScript or/./s (works in Perl/PCRE, not in JS)

11

Any character

>>> /^something$/.test('something')true

   

 

 

12

String boundaries

>>> /^something$/.test('something')true

>>> /^something$/.test('something\nbad')false

 

 

13

String boundaries

>>> /^something$/.test('something')true

>>> /^something$/.test('something\nbad')false

>>> /^something$/m.test('something\nbad')true

14

String boundaries

>>> /\ba/.test('alabama)true   

   

   

15

Word boundaries

>>> /\ba/.test('alabama)true>>> /a\b/.test('alabama')true

   

   

16

Word boundaries

>>> /\ba/.test('alabama)true>>> /a\b/.test('alabama')true

>>> /a\b/.test('naïve')true

   

17

Word boundaries

>>> /\ba/.test('alabama)true>>> /a\b/.test('alabama')true

>>> /a\b/.test('naïve')true

not a word boundary/\Ba/.test('alabama');

18

Word boundaries

Character classes

/\s/ (inverted version: /\S/)

   

   

     

20

Whitespace

/\s/ (inverted version: /\S/)

FF:\t \n \v \f \r \u0020 \u00a0 \u1680 \u180e \u2000 \u2001 \u2002 \u2003 \u2004 \u2005 \u2006 \u2007 \u2008 \u2009 \u200a\ u2028 \u2029\ u202f \u205f \u3000

Chrome, IE 9:as in FF plus \ufeff

IE 7, 8 :-(only:\t \n \v \f \r \u0020

21

Whitespace

/\d/ ~ digits from 0 to 9

/\w/ ~ Latin letters, digits, underscoreDoes not work for Cyrillic, Greek etc.

Inverted forms:/\D/ ~ anything but digits/\W/ ~ anything but alphanumeric characters

22

Alphanumeric characters

Example:/[abc123]/          

23

Custom character classes

Example:/[abc123]/ Metacharacters and ranges supported:/[A-F\d]/      

24

Custom character classes

Example:/[abc123]/ Metacharacters and ranges supported:/[A-F\d]/ More than one range is okay:/[a-cG-M0-7]/  

25

Custom character classes

Example:/[abc123]/ Metacharacters and ranges supported:/[A-F\d]/ More than one range is okay:/[a-cG-M0-7]/ IMPORTANT: ranges come from Unicode, not from national alphabets!

26

Custom character classes

"dot" means just dot!/[.]/.test('anything') // false

   

27

Custom character classes

"dot" means just dot!/[.]/.test('anything') // false

adding \ ] -/[\\\]-]/

28

Custom character classes

anything except a, b, c:/[^abc]/ ^ as a character:/[abc^]/

29

Inverted character classes

/[^]/matches ANY character;

a nice alternative to /[\s\S]/

30

Inverted character classes

/[^]/matches ANY character;could bea nice alternative to /[\s\S]/

31

Inverted character classes

/[^]/matches ANY character;could bea nice alternative to /[\s\S]/

Chrome, FF:>>> /([^])/.exec('a');['a', 'a']

32

Inverted character classes

/[^]/matches ANY character;could bea nice alternative to /[\s\S]/

IE:>>> /([^])/.exec('a');['a', '']

33

Inverted character classes

/[^]/matches ANY character;could bea nice alternative to /[\s\S]/

IE:>>> /([\s\S])/.exec('a');['a', 'a']

34

Inverted character classes

Quantifiers

/bo*/.test('b') // true

   

36

Zero or more, one or more

/bo*/.test('b') // true

/.*/.test('') // true  

37

Zero or more, one or more

/bo*/.test('b') // true

/.*/.test('') // true /bo+/.test('b') // false

38

Zero or more, one or more

/colou?r/.test('color');/colou?r/.test('colour');

39

Zero or one

40

How many?

/bo{7}/ exactly 7

       

41

How many?

/bo{7}/ exactly 7

/bo{2,5}/ from 2 to 5, x < y      

42

How many?

/bo{7}/ exactly 7

/bo{2,5}/ from 2 to 5, x < y /bo{5,}/ 5 or more    

43

How many?

/bo{7}/ exactly 7

/bo{2,5}/ from 2 to 5, x < y /bo{5,}/ 5 or more This does not work in JS:/b{,5}/.test('bbbbb')

var r = /a+/.exec('aaaaa');    

44

Greedy quantifiers

var r = /a+/.exec('aaaaa'); >>> r[0] 

45

Greedy quantifiers

var r = /a+/.exec('aaaaa'); >>> r[0]"aaaaa"

46

Greedy quantifiers

var r = /a+?/.exec('aaaaa');         

47

Lazy quantifiers

var r = /a+?/.exec('aaaaa');>>> r[0]       

48

Lazy quantifiers

var r = /a+?/.exec('aaaaa');>>> r[0]"a"      

49

Lazy quantifiers

var r = /a+?/.exec('aaaaa');>>> r[0]"a" r = /a*?/.exec('aaaaa');   

50

Lazy quantifiers

var r = /a+?/.exec('aaaaa');>>> r[0]"a" r = /a*?/.exec('aaaaa');>>> r[0] 

51

Lazy quantifiers

var r = /a+?/.exec('aaaaa');>>> r[0]"a" r = /a*?/.exec('aaaaa');>>> r[0]""

52

Lazy quantifiers

Groups

capturing/(boo)/.test("boo");

   

54

Groups

capturing/(boo)/.test("boo");

non-capturing/(?:boo)/.test("boo");

55

Groups

var result = /(bo)o+(b)/.exec('the booooob');         

       

56

Grouping and the RegExp constructor

var result = /(bo)o+(b)/.exec('the booooob');>>> RegExp.$1"bo"     

       

57

Grouping and the RegExp constructor

var result = /(bo)o+(b)/.exec('the booooob');>>> RegExp.$1"bo">>> RegExp.$2"b" 

       

58

Grouping and the RegExp constructor

var result = /(bo)o+(b)/.exec('the booooob');>>> RegExp.$1"bo">>> RegExp.$2"b">>> RegExp.$9""       

59

Grouping and the RegExp constructor

var result = /(bo)o+(b)/.exec('the booooob');>>> RegExp.$1"bo">>> RegExp.$2"b">>> RegExp.$9"">>> RegExp.$10undefined   

60

Grouping and the RegExp constructor

var result = /(bo)o+(b)/.exec('the booooob');>>> RegExp.$1"bo">>> RegExp.$2"b">>> RegExp.$9"">>> RegExp.$10undefined>>> RegExp.$0undefined

61

Grouping and the RegExp constructor

/((foo) (b(a)r))/

 

     

62

Numbering of capturing groups

/((foo) (b(a)r))/

$1 ( ) foo bar      

63

Numbering of capturing groups

/((foo) (b(a)r))/

$1 ( ) foo bar $2 ( ) foo   

64

Numbering of capturing groups

/((foo) (b(a)r))/

$1 ( ) foo bar $2 ( ) foo$3 ( ) bar 

65

Numbering of capturing groups

/((foo) (b(a)r))/

$1 ( ) foo bar $2 ( ) foo$3 ( ) bar$4 ( ) a

66

Numbering of capturing groups

var r = /best(?= match)/.exec('best match');

   

       

67

Lookahead

var r = /best(?= match)/.exec('best match');

>>> !!rtrue

       

68

Lookahead

var r = /best(?= match)/.exec('best match');

>>> !!rtrue

>>> r[0]"best"    

69

Lookahead

var r = /best(?= match)/.exec('best match');

>>> !!rtrue

>>> r[0]"best" >>> /best(?! match)/.test('best match')false

70

Lookahead

NOT supported in JavaScript at all

/(?<=text)match/positive lookbehind

/(?<!text)match/negative lookbehind

71

Lookbehind

Enumerations

/red|green|blue light//(red|green|blue) light/ >>> /var a(;|$)/.test('var a')true

73

Logical "or"

true/(red|green) apple is \1/.test('red apple is red')

true/(red|green) apple is \1/.test('green apple is green')

74

Backreferences

Alternative character represenations

\x09 === \t (not Unicode but ASCII/ANSI)\u20AC === € (in Unicode)

 

   

   

76

Representing a character

\x09 === \t (not Unicode but ASCII/ANSI)\u20AC === € (in Unicode)

backslash takes away special character meaning:

/\(\)/.test('()') // true/\\n/.test('\\n') // true

   

77

Representing a character

\x09 === \t (not Unicode but ASCII/ANSI)\u20AC === € (in Unicode)

backslash takes away special character meaning:

/\(\)/.test('()') // true/\\n/.test('\\n') // true

...or vice versa!/\f/.test('f') // false!

78

Representing a character

Flags

g i m s x y      

     

80

Regular expression flags

g i m s x y global match   

     

81

Regular expression flags

g i m s x y global matchignore case 

     

82

Regular expression flags

g i m s x y global matchignore casemultiline matching for ^ and $

     

83

Regular expression flags

g i m s x y global matchignore casemultiline matching for ^ and $

JavaScript does NOT provide support for:string as single lineextend pattern

84

Regular expression flags

g i m s x y global matchignore casemultiline matching for ^ and $

Mozilla-only, non-standard:stickyMatch only from the .lastIndex index (a regexp instance property). Thus, ^ can match at a predefined position.

85

Regular expression flags

/(?i)foo//(?i-m)bar$//(?i-sm).x$//(?i)foo(?-i)bar/ Some implementations do NOT support flag switching on-the-go.

In JS, flags are set for the whole regexp instance and you can't change them.

86

Alternative syntax for flags

RegExp in JavaScript

RegExp instances: /regexp/.exec('string') null or array ['whole match', $1, $2, ...] /regexp/.test('string') false or true String instances: 'str'.match(/regexp/) 'str'.match('\\w{1,3}') - same as /regexp/.exec if no 'g' flag used; - array of all matches if 'g' flag used (internal capturing groups ignored) 'str'.search(/regexp/) 'str'.search('\\w{1,3}') first match index, or -1

88

Methods

String instances:'str'.replace(/old/, 'new'); WARNING: special magic supported in the replacement string: $$ inserts a dollar sign "$" $& substring that matches the regexp $` substring before $& $' substring after $& $1, $2, $3 etc.: string that matches n-th capturing group 'str'.replace(/(r)(e)gexp/g, function(matched, $1, $2, offset, sourceString) { // what should replace the matched part on this iteration? return 'replacement';});

89

Methods

// BAD CODEvar re = new RegExp('^' + userInput + '$');// ...var userInput = '[abc]'; // oops!

// GOOD, DO IT AT HOMERegExp.escape = function(text) { return text.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");}; var re = new RegExp('^' + RegExp.escape(userInput) + '$');

90

RegExp injection

Recommended reading

Online, just google it:MDN Guide on Regular Expressions

Mastering Regular ExpressionsO'Reilly Media

The Book:

Thank you!

top related