collation in icu 1.8
DESCRIPTION
Collation in ICU 1.8. Mark Davis Chief SW Globalization Architect IBM. Agenda. What is Collation? Features Mechanisms Warnings ICU 1.8 Collation Note: Slides differ from printouts. Collation = Sorting Order. How hard can it be? A < B < C < … Complications - PowerPoint PPT PresentationTRANSCRIPT
Collation in ICU 1.8
Mark DavisChief SW Globalization Architect
IBM
AgendaWhat is Collation?
FeaturesMechanismsWarnings
ICU 1.8 Collation
Note: Slides differ from printouts
Collation = Sorting Order
How hard can it be?A < B < C < …Complications
Languages are complex and variedUnicode is a big set of charactersPerformance is crucial
Varies By:
Language Swedish: z < ö German: ö < z
Usage Dictionary: öf < of Telephone: of < öf
Customizations A < a a < A
Versioning Fixes New Gov. Stds New Characters
Levels1. Base characters: a < b2. Accents: as < às < at
ignored if there is a L1 character difference
3. Case: ao < Ao < aòignored if there is a L1 or L2 difference
4. Punctuation: ab < a-b < aBignored* if there is a L1, L2, or L3 difference
Context SensitivityContractions
H < Z, but CZ < CHExpansions
OE < Œ < OFBoth
カー < カイ キー > キイ
Canonical Equivalence
Å ≡ Å≡ A + º
x + . + ^ ≡ x + ^ + .ự≡ u + ’
≡ ư + .≡ ụ + ’≡ u + . + ’≡ u + ̛ + .
OdditiesNormal accents
cote < coté < côte < côté• first accent difference determines order
French accentscote < côte < coté < côté• last accent difference determines order
Il-logical Order (Thai, Lao) เ ก sorts like ก เ
Merging Database Fields
F1 = LastName, F2 = FirstName
Sequential Weak 1st MergedF1, then F2 F1 (L1), F2 L1, L2, L3
diSilva, JohndiSilva, Freddi Silva, Johndi Silva, Freddísilva, Johndísilva, Fred
diSilva, Johndísilva, Johndi Silva, Johndi Silva, FreddiSilva, Freddísilva, Fred
diSilva, Johndi Silva, Johndísilva, JohndiSilva, Freddi Silva, Freddísilva, Fred
Customizations
Parameters that change collation behavior
Choice of language (locale)Runtime choices
Examples to follow
Parametric Customizations
Strength Base Base + Accent Base + Accent + Case
Case: A < a a < A
Punctuation: di Silva < diSilva diSilva < di Silva
Punctuation (Alternates)Base Character
di silvadi SilvaDi silvaDi SilvaDickensdisilvadiSilvaDisilvaDiSilva
Ignoreable
Dickens di silvadisilvadi SilvadiSilvaDi silvaDisilvaDi SilvaDiSilva
Extended Customizations
User-defined“&” ≡ “ampersand”
Merging tailoringsIranian + French
Script Orderb < ב < β < бβ < b < б < ב
Numbers A-1 < A-234 A-234 < A-1
Collation also used for:Searching
ignore case, accent optionsSelection
Return all records where• Jones ≤ name < Smith
GraphemesWhat a user considers a “character”Regular expressions (Level 3)• UTR #18
UCAUTS #10: Unicode Collation Algorithm
Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc.Default ordering: all Unicode code pointsProvides for tailoring to given languagesAlso see: The Unicode Standard, §5.17: Sorting and Searching
Aligned with ISO 14651
APIs
String CompareSort KeysString Search
Sort Keys
Transform string into series of bytes which will binary-compare
a: 06 C3 01 20 01 02 00
A: 06 C3 01 20 01 08 00
á: 06 C3 01 20 32 01 02 02 00
ab: 06 C3 06 D7 01 20 20 01 02 02 00
b: 06 D7 01 20 01 02 00
Level 1 Level 2 Level 3
String Compare vs. Sort KeysSame results in either caseSC faster for single comparisons
average 5 to 10 times!SK faster for multiple comparisons
index once binary compare many times
String SearchNaïve Approach
key matches in target at <x, y>iff target.substring(x, y) ≡ key
Boundary ComplicationsIgnorables: “a” matches in “(a)”?• at <0,2> & <1, 2> & <0,3> & <1,3>?
Contractions: “c” matches in “churo”?Normalization: “å” matches in “a¸˚”?
WARNING 1: BasicsNot aligned with character set or repertoire
Latin-1: Swedish and German sorting differsNot code point (binary) order
Binary: Z < a < v < wEnglish: Z > aSwedish: v ≡ w
Not a property of stringsWith same database
• Swedish user: view/select• German user: view/select
WARNING 2: Operations
Order not preserved under concatenation / substringing
x < y ↛ xz < yzx < y ↛ zx < zyxz < yz ↛ x < yzx < zy ↛ x < y
WARNING 3: DependenceCollation is a relation over strings
Sort keys embody part of that relationThus, comparing sort keys from different tailorings (or parameters) gives undefined results.C < CH < DMay move binary value for D
WARNING 4: StabilityStable Sort
Records with equal comparison come out in original orderProperty of algorithm, not comparison
Semi-Stable Comparisonx ≠ y → x ≢ yProperty of comparison, not algorithmDegrades performanceDoesn’t do what people think (or really want)!
ICU (Int’l Components for Unicode)
Open-source: C, C++, Java, JNICharset Conversions, Locales, Resources, Collation, Calendars, Time zones (daylight), Transliteration, Normalization, Boundaries (grapheme, word, line, sentence), Format/Parse (numbers, currencies, dates, times, messages)
Cross-Platform: Windows, Unix, 390, …Architecture ≡ Javahttp://oss.software.ibm.com/icu/
ICU/Java Collation ArchitectureL1-3, contractions, expansions, …Locale tailoringsFully rule-based specificationArbitrary runtime user customizations
& ‘?’ = ‘question mark’ & ‘$’ = ‘dollar sign’ & z < ‘george’
ICU 1.8.1 Collation Revisionfull UCA compliancefull supplementary character supportmuch better performancemuch smaller sort-keyssmaller memory footprintsmaller disk footprintadditional parametric controladditional tailoring control
Coding Style for PerformanceAvoided unnecessary function calls.
Example: strlen too expensive!Avoided use of objects
Rewrote core code in CC++ API wraps the C core code.
Fast-pathed common casesUsed stack memory buffers
(with expansion if necessary)Made inner loops as tight as possible
Fractional UCAFractional weights for compressionGaps for tailoring, future UCA additionsOnly stores differences in tailoring fileReduces memory footprint
a æ ɒ b a æ ɒ bprimary 0861 0865 0871 0875 17 18 60 18 66 19
secondary 20 20 20 20 03 03 03 03tertiary 02 02 02 02 03 03 03 03
UCA Frac. UCA
Flat File I
Flat-file (memory mapped)speeds initializationreduces memory footprint(next slide)
Flat-File II
Old: separate allocations
New: offsets within mem-map
Delta Tailoring II
“a”
FR
found
UCA not
found
codenot
synthesized
Processing Overview
Checks for identical prefixesTolerant of most unnormalized text
invokes normalization rarely
Uses “exceptional values”Compresses sort keysIncremental length/normalization
Identical Prefixes
Sorting / Searching DatabasesMany comparisons to “close” stringsCheck initial prefixes with binary compareDrop into collation loop at first differenceComplication…
Initial Prefix Complication
Need to backup if in “bad” position:
TypeContraction (Spanish) c hNormalization a °Surrogate Pair <L> <T>
Example
Fast C or D (FCD)
Accepts all NFD, most NFC, without normalization
X FCD NFC NFD
A- ring Y YAngstrom YA + ring Y YA + grave Y YA-ring + grave YA + cedilla + ring Y YA + ring + cedillaA-ring + cedilla Y
Exceptional Values
Normal weight storageP P P P P P P P P P P P P P P P S S S S S S S S C C T T T T T T
1 116b 8b 6b
F F F F T T T T d d d d d d d d d d d d d d d d d d d d d d d d4b 4b Tag 24 bit data
Special Weight StorageNOT_FOUND, EXPANSION, CONTRACTION, THAI, …
Sort Key CompressionCommon weights are 1-byte
Primary, secondary, tertiary, quarternarySequences are compressedUTF-16 Values for “Märk Davis” (22 bytes)
004D 00E4 0072 006B 0020 0044 0061 0076 0069 0073 0000
Sort Key (L3, ignorable punctuation - 19 bytes)2F 17 39 2B 1D 17 41 27 3B 0177 96 0A 018F 80 8F 07 00
ICU 1.8 vs. Windows, glibcFull UCAWarning: perf. comparisons approx.
Depends on data, parameters, featuresglibc - UTF-8 locales
String comparison: comparable≈ 20% worse to 400% better
Sort keys: shorter≈ half as long
More InformationICU
http://oss.software.ibm.com/icu/Design Document
http://oss.software.ibm.com/cvs/icu/icuhtml/design/collation/
These Slideshttp://www.macchiato.com
Q & A
Backup Slides
WARNING 5: Math. RelationS = {Unicode Strings}Reflexive∀a ∊ S: a ≤ a
Antisymmetric∀a, b ∊ S: a ≤ b & b ≤ a → a = b
Transitive∀a, b ∊ S: a ≤ b & b ≤ c → a ≤ c
Total∀a, b ∊ S: a ≤ b ∨ b ≤ a