unicode text and regular expression

®

IBM Software Group

© 2003 IBM Corporation

Unicode Text and Regular Expression

Andy Heninger

9/9/2004

IBM Software Group

Overview

Regular Expressions have long been used for Searching text data

Parsing, extracting fields

Text manipulation, find & replace

Regular Expressions and Unicode Text data are a good Match.

Regular Expression Languages have evolved new features to work more conveniently and powerfully with Unicode data.

Talk Focus is on these Unicode related features.

IBM Software Group

What Are Regular Expressions

Think of Wildcards

Select or match text

Available in editors, languages, tools, databases

Not the topic today

Literal text Matches itself

* Match 0 or more times

+ Match one or more times

[a-z] Character Range. Match any one

(whatever) grouping

IBM Software Group

Character Ranges

[a-z] Match any one character falling in the specified range

Relies on the existence of some ordering of characters, to determine what falls between a and z. Typically charset order.

Only works for English

No accented characters

No letters from other alphabets (Greek, Arabic, etc.)

Still widely used.

IBM Software Group

POSIX Character Classes

Remove dependency on charset ordering

Convenient, more likely to be correct than [a-z]

[:alnum:] [:cntrl:] [:lower:]

[:space:] [:alpha:] [:digit:]

[:xdigit:] [:print:] [:upper:]

[:blank:] [:graph:] [:punct:]

Implementers must provide definitions for different charsets

IBM Software Group

POSIX -> Unicode

Unicode has a very rich character property system

Unicode TR 18 defines POSIX classes in terms of properties

[:alpha:] Alphabetic = TRUE

[:digit] General Category = Decimal Number

[:space:] White Space = TRUE

[:upper:] Uppercase = TRUE

Direct access to Unicode properties in Character Set expressions is a key feature for Unicode Regular Expression.

IBM Software Group

A Quick Look at the Unicode General Category

Central to Regular Expressions with Unicode Text

Categorize every character as one of Letter

Number

Separator

Punctuation

Marks

Symbols

Others

Subcategories within each. Examples Letter, Uppercase, lowercase, Other, …

Symbols, Math, Currency, Modifiers, …

Mark, spacing, non-spacing, enclosing

IBM Software Group

Unicode Property Based Character Classes

TR 18 Recommended Properties for Basic Unicode support includes General Category

Script

Alphabetic

Uppercase

Lowercase

White Space

Examples: [:Script=Greek:] POSIX syntax[\p{Script=Greek}] Perl syntax[\p{Alphabetic}]

IBM Software Group

Set Operations

[^\p{Letter}] Negation

[\p{Letter}\p{Number}] Union

[\p{Letter}&\p{script=Cyrllic}] Intersection

[\p{Letter}-\p{Latin}] Difference

Important for a character set the size of Unicode.

IBM Software Group

Script and Block Properties

[\p{script=Thai}][\p{block=Thai}]

Unicode Script Property Categorizes each character by script – Latin, Cyrillic, Arabic, etc.

Shared characters classified as “Common”. Numbers, punctuation, etc.

Not the same as Language.

Unicode Block Property Categories by block – contiguous range of characters.

Basic Latin, Latin-1 Supplement, Latin Extended A, Latin Extended B

Greek, Hebrew, and more.

Has Limitations

IBM Software Group

Code Points, Code Units, UTF 8/16/32

Matching happens on Code Points (0 – 10ffff)

UTF-8 bytes or UTF-16 Surrogate Halves not visible

Match results independent of encoding form.

Glitches Implementations without surrogate support

Perl’s \x

IBM Software Group

Normalization

\p{Alphabetic}

n

\p{Non Spacing Mark}

…

n i ñ a

n i n ˜ a

n i ñ a

n i n ˜ a

n i ñ a

n i n ˜ a

n i ñ a

n i n ˜ a

IBM Software Group

Normalization

Approaches to the Problem Data may be pre-normalized, nothing extra needed.

Use Normalization option, if available.

Application Normalizes the data first

IBM Software Group

Line Endings

Unicode has More \u000A Line Feed

\u000C Form Feed\u000D Carriage Return\u0085 Next Line (NEL)\u2028 Line Separator\u2029 Paragraph Separator\u000D \u000A CR/LF sequence

Matches normally stop at line ends, but overridable.

Line endings always match as a single character, including the CR/LF sequence

No \n sequence to match any line ending

IBM Software Group

Caseless Matching

Simple – one to one character relation between pattern and text being matched.

Full – one to many German Sharp-S ß uppercases to ‘SS’

Expensive in complexity of implementation, speed.

Existing implementations provide simple form only.

IBM Software Group

Grapheme Clusters

Definition: what a user would consider a character, or what would display as a single character.

Multi-codepoint Clusters Base char + combining marks

Example: decomposed form of Ň

Hangul (Korean) syllables

Unicode-enabled regular expressions should provide Match a grapheme cluster

Test whether match position is on a boundary.

IBM Software Group

Word Boundaries, \b

Classic RE Feature Boundaries between “word” and “non-word” characters

“Word” characters include all Alphabetic.

Non-spacing marks never separated from base, otherwise ignored.

UAX 29 Boundaries Better, but different, results.

Hello There. G’day 123.456Classic RE

Hello There. G’day 123.456Unicode Word Boundaries

IBM Software Group

Unicode TR 18

Unicode Technical Standard #18, Regular Expressions

Guidelines for how to adapt RE implementations to Unicode

Three Levels of Support, Basic, Extended, Tailored

Basic Support requires Access to common Unicode Character Properties

Set (character class) Operations – Union, Intersection, Subtraction

Simple Unicode Loose (caseless) matching

Unicode Line separator characters

Supplementary Character support

Hex notation for Unicode code points

IBM Software Group

Unicode TR 18

Extended Unicode Support More properties, characters by name. (GREEK CAPITAL LETTER

EPSILON)

Canonical Equivalents (normalization)

Unicode style word boundaries

Full case insensitive matching

Matching default grapheme clusters and boundaries

Tailored Support. Language or Locale specific behavior for a number of matching constructs. No implementations available yet.

IBM Software Group

Implementations

Implementations providing significant Unicode support Perl.

Major innovations to regular expressions

Early adopter of Unicode

Perl features and syntax widely adopted.

Java JDK 1.4

Microsoft .NET

IBM ICU4C

IBM Software Group

Conclusion

Regular expressions provide a great way to analyze and manipulate Unicode data.

Mainstream implementations are readily available.

IBM Software Group

Questions

unicode text and regular expression

Documents