![Page 1: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/1.jpg)
![Page 2: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/2.jpg)
Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
Regular Expressions with full Unicode support
Martin HanssonSoftware DevelopmentMySQL Optimizer Team
The ins and outs of the new regular expression functions and the ICU library
![Page 3: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/3.jpg)
Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied
upon in making purchasing decisions. The development, release, and timing of any
features or functionality described for Oracle’s products remains at the sole discretion of
Oracle.
![Page 4: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/4.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What Happened?Old regexp library (Henry Spencer)
• Does not support Unicode
• Limited Features
• No resource control
• Only Boolean Search
https://mysqlserverteam.com/new-regular-expression-functions-in-mysql-8-0/
![Page 5: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/5.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Not some niche featureFeature Requests for Extracting Substring:
Bug#79428 No way to extract a substring matching a regex
Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine
Bug#16357 add in functions to do regular expression replacements in a select query
Bug#9105 Regular expression support for Search & Replace
51 “affects me” total
CTE had 59 “affects me”
![Page 6: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/6.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
New Regular Expression Functions
REGEXP_INSTR
REGEXP_LIKE
REGEXP_REPLACE
REGEXP_SUBSTR
![Page 7: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/7.jpg)
Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
Program Agenda
Security
ICU library
Unicode
Working with Unicode in Regular Expressions
![Page 8: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/8.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Two Security Concerns
Memory Runtime
8
![Page 9: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/9.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
SecurityCap on runtime
mysql> SELECT regexp_instr( 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC', '(A+)+B');
ERROR 3699 (HY000): Timeout exceeded in regular expression match.
![Page 10: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/10.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
SecurityCap on Memory
mysql> SELECT regexp_instr( '', '(((((((){120}){11}){11}){11}){80}){11}){4}' );
ERROR 3699 (HY000): Timeout exceeded in regular expression match.
mysql> SET GLOBAL regexp_stack_limit = 239;
mysql> SELECT regexp_instr( '', '(((((((){120}){11}){11}){11}){80}){11}){4}' );
ERROR 3698 (HY000): Overflow in the regular expression backtrack stack.
![Page 11: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/11.jpg)
Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
Program Agenda
Security
ICU library
Unicode
Working with Unicode in Regular Expressions
![Page 12: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/12.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
ICU library
![Page 13: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/13.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Building ICUNeed three libraries
• i18n library
– Regular expressions
– Character sets
• Common library
• Data Library
![Page 14: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/14.jpg)
Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
Program Agenda
Security
ICU library
Unicode
Working with Unicode in Regular Expressions
![Page 15: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/15.jpg)
15
UTF-32
ab d
0x00000061 0x000000610x00000061 0x000000610x00000062 0x000000640x000000610x000000610x000000610x0001f37a
![Page 16: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/16.jpg)
16
UTF-8
ab d
0x62 0x000000610x000000610x000000610xF09F8DBA0x62 0x64
![Page 17: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/17.jpg)
17
UTF-16
ab d
0x0062 0x3CD87ADF0x0062 0x0064
![Page 18: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/18.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Under the Hood
• Count codepoints
• Convert to UTF-16
• Use the C API
• Convert back if needed
![Page 19: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/19.jpg)
Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
Program Agenda
Security
ICU library
Unicode
Working with Unicode in Regular Expressions
![Page 20: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/20.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case foldingSimple case sensitivity
mysql> SELECT regexp_like( 'a', '(?i)A' ); # mode modifier1
mysql> SELECT regexp_like( 'a', 'A', ‘i’ ); # match_parameter1
mysql> SELECT regexp_like( 'a' COLLATE utf8mb4_0900_as_cs, 'A' ); # collation0
![Page 21: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/21.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case foldingSimple case sensitivity
mysql> SELECT regexp_like( 'Abc', 'abC', ‘c’ );
→ 0
mysql> SELECT regexp_like( 'Abc', 'abC', ‘i’ );
→ 1
![Page 22: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/22.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case foldingCase-mapping process
A → a B → b C → c
![Page 23: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/23.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case foldingFull Case Folding
ß → ss
mysql> SELECT regexp_like( 'ß', '^ss$', ‘c’ );
→ 0
mysql> SELECT regexp_like( 'ß', '^ss$', ‘i’ );
→ 1
![Page 24: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/24.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case foldingFull Case Folding
ᾛ ⇒ ἣι
U+1F9B GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
U+1F23 U+03B9 GREEK SMALL LETTER ETA WITH DASIA AND VARIA
![Page 25: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/25.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case foldingHas to Look Like a String in order to Match
mysql> SELECT regexp_like( 'ß', '^ss$' );→ 1
mysql> SELECT regexp_like( 'ß', '^s+$' );→ 0
mysql> SELECT regexp_like( 'ß', '^s{2}$' );→ 0
![Page 26: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/26.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case foldingCan’t start Match Within Expanded Character
mysql> SELECT regexp_like( 'ß', 's$' );→ 0
mysql> SELECT regexp_like( 'ß', '^s' );→ 0
![Page 27: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/27.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case foldingCollations
mysql> select 'ß' collate utf8mb4_de_pb_0900_ai_ci = 'ss'\G*************************** 1. row'ß' collate utf8mb4_de_pb_0900_ai_ci = 'ss': 1
mysql> select 'ß' collate utf8mb4_de_pb_0900_as_cs = 'ss'\G*************************** 1. row'ß' collate utf8mb4_de_pb_0900_as_cs = 'ss': 0
![Page 28: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/28.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case foldingLanguage Dependent Case Folding
mysql> SELECT regexp_like( 'I', 'i' );→ 1
mysql> SELECT regexp_like( 'İ', 'i' );
→ 0
mysql> SELECT regexp_like( 'I', ' ı' );
→ 0
![Page 29: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/29.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Beware of Conversion!
mysql> set names latin1;mysql> create table t1 ( a char ( 10 ) );mysql> insert into t1 values ( 'å' );mysql> select a from t1\G*************************** 1. rowa: åmysql> select regexp_like( a, 'å' ) from t1\G*************************** 1. rowregexp_like( a, 'å' ): 1
![Page 30: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/30.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Beware of Conversion!Use Hex Codes!
mysql> select hex( a ) from t1;+----------+| hex( a ) |+----------+| C383C2A5 |+----------+
![Page 31: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/31.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Beware of Conversion!Use Hex Codes!
mysql> select hex( a ) from t1;+----------+| hex( a ) |+----------+| C383C2A5 |+----------+
Latin-1: 0x e5
UTF-8: 0x c3 a5
å is encoded as:
![Page 32: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/32.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 32
Conversion flow
Terminal UTF-8
c3a5 å Latin-1 → UTF-8
UTF-8 → Latin-1 C383C2A5 = Ã¥
Server
Table UTF-8
Server
![Page 33: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/33.jpg)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Power Tip
Use Hex Codes and Character set Introducers!
mysql> set global character_set_client = utf8mb4;mysql> select _utf8mb4 0xc3a5, _latin1 0xe5;+-----------------+--------------+| _utf8mb4 0xc3a5 | _latin1 0xe5 |+-----------------+--------------+| å | å |+-----------------+--------------+
mysql> set global character_set_client = latin1;mysql> select _utf8mb4 0xc3a5, _latin1 0xe5;+-----------------+--------------+| _utf8mb4 0xc3a5 | _latin1 0xe5 |+-----------------+--------------+| å | å |+-----------------+--------------+
![Page 34: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/34.jpg)
Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
Questions?
![Page 35: Regular Expressions with full Unicode support · Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in](https://reader034.vdocument.in/reader034/viewer/2022042922/5f6cc1a74d84f102d61168b7/html5/thumbnails/35.jpg)