collation in icu 1.8

41
Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM

Upload: ria-davidson

Post on 15-Mar-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Collation in ICU 1.8. Mark Davis Chief SW Globalization Architect IBM. Agenda. What is Collation? Features Mechanisms Warnings ICU 1.8 Collation Note: Slides differ from printouts. Collation = Sorting Order. How hard can it be? A < B < C < … Complications - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Collation in ICU 1.8

Collation in ICU 1.8

Mark DavisChief SW Globalization Architect

IBM

Page 2: Collation in ICU 1.8

AgendaWhat is Collation?

FeaturesMechanismsWarnings

ICU 1.8 Collation

Note: Slides differ from printouts

Page 3: Collation in ICU 1.8

Collation = Sorting Order

How hard can it be?A < B < C < …Complications

Languages are complex and variedUnicode is a big set of charactersPerformance is crucial

Page 4: Collation in ICU 1.8

Varies By:

Language Swedish: z < ö German: ö < z

Usage Dictionary: öf < of Telephone: of < öf

Customizations A < a a < A

Versioning Fixes New Gov. Stds New Characters

Page 5: Collation in ICU 1.8

Levels1. Base characters: a < b2. Accents: as < às < at

ignored if there is a L1 character difference

3. Case: ao < Ao < aòignored if there is a L1 or L2 difference

4. Punctuation: ab < a-b < aBignored* if there is a L1, L2, or L3 difference

Page 6: Collation in ICU 1.8

Context SensitivityContractions

H < Z, but CZ < CHExpansions

OE < Π< OFBoth

カー < カイ キー > キイ

Page 7: Collation in ICU 1.8

Canonical Equivalence

Å ≡ Å≡ A + º

x + . + ^ ≡ x + ^ + .ự≡ u + ’

≡ ư + .≡ ụ + ’≡ u + . + ’≡ u + ̛ + .

Page 8: Collation in ICU 1.8

OdditiesNormal accents

cote < coté < côte < côté• first accent difference determines order

French accentscote < côte < coté < côté• last accent difference determines order

Il-logical Order (Thai, Lao) เ ก sorts like ก เ

Page 9: Collation in ICU 1.8

Merging Database Fields

F1 = LastName, F2 = FirstName

Sequential Weak 1st MergedF1, then F2 F1 (L1), F2 L1, L2, L3

diSilva, JohndiSilva, Freddi Silva, Johndi Silva, Freddísilva, Johndísilva, Fred

diSilva, Johndísilva, Johndi Silva, Johndi Silva, FreddiSilva, Freddísilva, Fred

diSilva, Johndi Silva, Johndísilva, JohndiSilva, Freddi Silva, Freddísilva, Fred

Page 10: Collation in ICU 1.8

Customizations

Parameters that change collation behavior

Choice of language (locale)Runtime choices

Examples to follow

Page 11: Collation in ICU 1.8

Parametric Customizations

Strength Base Base + Accent Base + Accent + Case

Case: A < a a < A

Punctuation: di Silva < diSilva diSilva < di Silva

Page 12: Collation in ICU 1.8

Punctuation (Alternates)Base Character

di silvadi SilvaDi silvaDi SilvaDickensdisilvadiSilvaDisilvaDiSilva

Ignoreable

Dickens di silvadisilvadi SilvadiSilvaDi silvaDisilvaDi SilvaDiSilva

Page 13: Collation in ICU 1.8

Extended Customizations

User-defined“&” ≡ “ampersand”

Merging tailoringsIranian + French

Script Orderb < ב < β < бβ < b < б < ב

Numbers A-1 < A-234 A-234 < A-1

Page 14: Collation in ICU 1.8

Collation also used for:Searching

ignore case, accent optionsSelection

Return all records where• Jones ≤ name < Smith

GraphemesWhat a user considers a “character”Regular expressions (Level 3)• UTR #18

Page 15: Collation in ICU 1.8

UCAUTS #10: Unicode Collation Algorithm

Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc.Default ordering: all Unicode code pointsProvides for tailoring to given languagesAlso see: The Unicode Standard, §5.17: Sorting and Searching

Aligned with ISO 14651

Page 16: Collation in ICU 1.8

APIs

String CompareSort KeysString Search

Page 17: Collation in ICU 1.8

Sort Keys

Transform string into series of bytes which will binary-compare

a: 06 C3 01 20 01 02 00

A: 06 C3 01 20 01 08 00

á: 06 C3 01 20 32 01 02 02 00

ab: 06 C3 06 D7 01 20 20 01 02 02 00

b: 06 D7 01 20 01 02 00

Level 1 Level 2 Level 3

Page 18: Collation in ICU 1.8

String Compare vs. Sort KeysSame results in either caseSC faster for single comparisons

average 5 to 10 times!SK faster for multiple comparisons

index once binary compare many times

Page 19: Collation in ICU 1.8

String SearchNaïve Approach

key matches in target at <x, y>iff target.substring(x, y) ≡ key

Boundary ComplicationsIgnorables: “a” matches in “(a)”?• at <0,2> & <1, 2> & <0,3> & <1,3>?

Contractions: “c” matches in “churo”?Normalization: “å” matches in “a¸˚”?

Page 20: Collation in ICU 1.8

WARNING 1: BasicsNot aligned with character set or repertoire

Latin-1: Swedish and German sorting differsNot code point (binary) order

Binary: Z < a < v < wEnglish: Z > aSwedish: v ≡ w

Not a property of stringsWith same database

• Swedish user: view/select• German user: view/select

Page 21: Collation in ICU 1.8

WARNING 2: Operations

Order not preserved under concatenation / substringing

x < y ↛ xz < yzx < y ↛ zx < zyxz < yz ↛ x < yzx < zy ↛ x < y

Page 22: Collation in ICU 1.8

WARNING 3: DependenceCollation is a relation over strings

Sort keys embody part of that relationThus, comparing sort keys from different tailorings (or parameters) gives undefined results.C < CH < DMay move binary value for D

Page 23: Collation in ICU 1.8

WARNING 4: StabilityStable Sort

Records with equal comparison come out in original orderProperty of algorithm, not comparison

Semi-Stable Comparisonx ≠ y → x ≢ yProperty of comparison, not algorithmDegrades performanceDoesn’t do what people think (or really want)!

Page 24: Collation in ICU 1.8

ICU (Int’l Components for Unicode)

Open-source: C, C++, Java, JNICharset Conversions, Locales, Resources, Collation, Calendars, Time zones (daylight), Transliteration, Normalization, Boundaries (grapheme, word, line, sentence), Format/Parse (numbers, currencies, dates, times, messages)

Cross-Platform: Windows, Unix, 390, …Architecture ≡ Javahttp://oss.software.ibm.com/icu/

Page 25: Collation in ICU 1.8

ICU/Java Collation ArchitectureL1-3, contractions, expansions, …Locale tailoringsFully rule-based specificationArbitrary runtime user customizations

& ‘?’ = ‘question mark’ & ‘$’ = ‘dollar sign’ & z < ‘george’

Page 26: Collation in ICU 1.8

ICU 1.8.1 Collation Revisionfull UCA compliancefull supplementary character supportmuch better performancemuch smaller sort-keyssmaller memory footprintsmaller disk footprintadditional parametric controladditional tailoring control

Page 27: Collation in ICU 1.8

Coding Style for PerformanceAvoided unnecessary function calls.

Example: strlen too expensive!Avoided use of objects

Rewrote core code in CC++ API wraps the C core code.

Fast-pathed common casesUsed stack memory buffers

(with expansion if necessary)Made inner loops as tight as possible

Page 28: Collation in ICU 1.8

Fractional UCAFractional weights for compressionGaps for tailoring, future UCA additionsOnly stores differences in tailoring fileReduces memory footprint

a æ ɒ b a æ ɒ bprimary 0861 0865 0871 0875 17 18 60 18 66 19

secondary 20 20 20 20 03 03 03 03tertiary 02 02 02 02 03 03 03 03

UCA Frac. UCA

Page 29: Collation in ICU 1.8

Flat File I

Flat-file (memory mapped)speeds initializationreduces memory footprint(next slide)

Page 30: Collation in ICU 1.8

Flat-File II

Old: separate allocations

New: offsets within mem-map

Page 31: Collation in ICU 1.8

Delta Tailoring II

“a”

FR

found

UCA not

found

codenot

synthesized

Page 32: Collation in ICU 1.8

Processing Overview

Checks for identical prefixesTolerant of most unnormalized text

invokes normalization rarely

Uses “exceptional values”Compresses sort keysIncremental length/normalization

Page 33: Collation in ICU 1.8

Identical Prefixes

Sorting / Searching DatabasesMany comparisons to “close” stringsCheck initial prefixes with binary compareDrop into collation loop at first differenceComplication…

Page 34: Collation in ICU 1.8

Initial Prefix Complication

Need to backup if in “bad” position:

TypeContraction (Spanish) c hNormalization a °Surrogate Pair <L> <T>

Example

Page 35: Collation in ICU 1.8

Fast C or D (FCD)

Accepts all NFD, most NFC, without normalization

X FCD NFC NFD

A- ring Y YAngstrom YA + ring Y YA + grave Y YA-ring + grave YA + cedilla + ring Y YA + ring + cedillaA-ring + cedilla Y

Page 36: Collation in ICU 1.8

Exceptional Values

Normal weight storageP P P P P P P P P P P P P P P P S S S S S S S S C C T T T T T T

 1  116b 8b 6b

F F F F T T T T d d d d d d d d d d d d d d d d d d d d d d d d4b 4b Tag 24 bit data

Special Weight StorageNOT_FOUND, EXPANSION, CONTRACTION, THAI, …

Page 37: Collation in ICU 1.8

Sort Key CompressionCommon weights are 1-byte

Primary, secondary, tertiary, quarternarySequences are compressedUTF-16 Values for “Märk Davis” (22 bytes)

004D 00E4 0072 006B 0020 0044 0061 0076 0069 0073 0000

Sort Key (L3, ignorable punctuation - 19 bytes)2F 17 39 2B 1D 17 41 27 3B 0177 96 0A 018F 80 8F 07 00

Page 38: Collation in ICU 1.8

ICU 1.8 vs. Windows, glibcFull UCAWarning: perf. comparisons approx.

Depends on data, parameters, featuresglibc - UTF-8 locales

String comparison: comparable≈ 20% worse to 400% better

Sort keys: shorter≈ half as long

Page 40: Collation in ICU 1.8

Backup Slides

Page 41: Collation in ICU 1.8

WARNING 5: Math. RelationS = {Unicode Strings}Reflexive∀a ∊ S: a ≤ a

Antisymmetric∀a, b ∊ S: a ≤ b & b ≤ a → a = b

Transitive∀a, b ∊ S: a ≤ b & b ≤ c → a ≤ c

Total∀a, b ∊ S: a ≤ b ∨ b ≤ a