unicode support for mathematics murray sargent iii microsoft

Unicode Support for MathematicsUnicode Support for Mathematics

Murray Sargent III

Microsoft

OverviewOverview

Unicode math charactersSemantics of math charactersUnicode and markupMultiple ways of encoding math charactersNot yet standardized math characters Inputting math symbols

Unicode Math CharactersUnicode Math Characters

• 340 math chars exist in ASCII, U+2200 – U+22FF, arrows, combining marks of Unicode 3.0

• 996 math alphanumeric characters are proposed to be added as requested by STIX project. Plane 1

• 951 new math symbols and operators are proposed for BMP

• One math variant code• One new combining character (reverse solidus).

Math Alphanumeric CharactersMath Alphanumeric Characters

• Math needs various Latin and Greek alphabets like normal, bold, italic, script, Fraktur, and open-face

• May appear to be font variations, but have distinct semantics

• Without these distinctions, you get gibberish, violating Unicode rule: plain text must contain enough info to permit the text to be rendered legibly, and nothing more

• Plain-text searches should distinguish between alphabets, e.g., search for script H shouldn’t match H, etc.

• Reduces markup verbosity

Legibility LossLegibility Loss

Without math alphabets, the Hamiltonian formula

H = dτ [εE2 + μH2]

becomes an integral equation

H = dτ [εE2 + μH2]

Math Alphanumeric Chars (cont)Math Alphanumeric Chars (cont)

Bold a-z, A-Z, 0-9, -, -ΩItalic a-z, A-Z, -, -ΩBold italic a-z, A-Z, -, -ΩScript a-z, A-ZBold script a-z, A-ZFraktur a-z, A-ZBold Fraktur a-z, A-ZOpen-face a-z, A-Z, 0-9Sans-serif a-z, A-Z, 0-9Sans-serif bold a-z, A-Z, 0-9, -, -ΩSans-serif italic a-z, A-ZSans-serif bold italic a-z, A-Z, -, -ΩMonospace a-z, A-Z, 0-9

How Display Math Alphabets?How Display Math Alphabets?

Can use Unicode surrogate pair mechanisms available on OS

Alternatively, bind to standard fonts and use corresponding BMP characters.

Second approach probably faster and to display Unicode one needs font binding in any event.

A single math font may look more consistent.

Multiple Character EncodingsMultiple Character Encodings

As with nonmath characters, math symbols can often be encoded in multiple ways, composed and decomposed

E.g., ≠ can be U+003D, U+0338 or U+2260 Recommendation: use the fully composed symbol, e.g.,

U+2260 for ≠ For alphabetic characters, use the fully decomposed

sequence, e.g., use U+0061, U+0308 for ä, not U+00E4 Some representations use markup for the alphabetic cases.

This allows multicharacter combining marks.

Compatibility HolesCompatibility Holes

Compatibility holes (reserved positions) exist in some Unicode sequences to avoid duplicate encodings (ugh!)

E.g., U+2071-U+2073 are holes for ¹²³, which are U+00B9, U+00B2, and U+00B3, respectively

Math alphanumerics have holes corresponding to Letterlike symbols.

Recommendation: you can use the hole codes internally, but should import and export the standard codes.

Math Glyph VariantsMath Glyph Variants

One approach to the math alphanumerics was to use a set of math glyph variant tags

Such a tag follows a base character imparting a math style

Approach was dropped since it seemed likely to be abused

One math variant tag does exist for purposes of offering a different line slant for some composite symbols.

Nonstandard CharactersNonstandard Characters

People will always invent new math characters that aren’t yet standardized.

Use private use area for these with a higher-level marking that these are for math.

This approach can lead to collisions in the math community (unless a standard is maintained)

Cut/copy in plain text can have collisions with other uses of the private use area

Unicode and MarkupUnicode and Markup

• Unicode was never intended to represent all aspects of text• Language attribute: sort order, word breaks• Rich (fancy) text formatting: built-up fractions• Content tags: headings, abstract, author, figure• Glyph variants: Poetica font: 58 ampersands; Mantinia

font: novel ligatures (TT, TE, etc.)• MathML adds XML tags for math constructs, but seems

awfully wordy

Unicode Plain TextUnicode Plain Text

• Can do a lot with plain text, e.g., BiDi • Grey zone: use of embedded codes• Unicode ascribes semantics to characters, e.g.,

paragraph mark, right-to-left mark• Lots of interesting punctuation characters in range

U+2000 to U+204F• Extensive character semantics/properties tables,

including mathematical, numerical

Unicode Character SemanticsUnicode Character Semantics

Math characters have math property Math characters are numeric, variable, or operator, but not

a combination Properties are useful in parsing math plain text MathML doesn’t use these properties: every quantity is

explicitly tagged Properties still can be useful for inputting text for MathML

(noone wants to type all those tags!) Sometimes default properties need to be overruled Might be useful to have more math properties

Plain Text EncodingPlain Text Encoding

• TEX fraction numerator is what follows a { up to keyword \over

• Denominator is what follows the \over up to the matching }

• { } are not printed• Simple rules give unambiguous “plain text”, but

results don’t look like math• How to make a plain text that looks like math?

Simple plain text encodingSimple plain text encoding

• Simple operand is a span of non-operator characters

• E.g., simple numerator or denominator is terminated by any operator

• Operators include arithmetic operators, whitespace character, all U+22xx, an argument “break” operator (displayed as small raised dot), sub/superscript operators

• Fraction operator is given by the Unicode fraction slash operator U+2044

FractionsFractions

• abc/d gives

• More complicated operands use parentheses ( ), brackets [ ], or { }

• Outermost parens aren’t displayed in built-up form• E.g., plain text (a + c)/d displays as• Easier to read than TEX’s, e.g., {a + c \over d} • MathML: <mfrac><mrow><mi>a</mi><mo>+</mo>

<mi>c</mi></mrow><mrow><mi>d</mi> </mrow></mfrac>

• Neat feature: plain text usually looks like math

abcd

a + cd

Subscripts and SuperscriptsSubscripts and Superscripts

Unicode has numeric subscripts and superscripts along with some operators (U+2070-U+208E).

Others need some kind of markup like <msup>…</msup> With special subscript and superscript operators (not yet in

Unicode), these scripts can be encoded nestibly. Use parentheses as for fractions to overrule built-in

precedence order.

${W^{3\beta}_{\delta_1\rho_1\sigma_2}= U^{3\beta}_{\delta_1\rho_1} + {1 \over 8\pi^2}\int_{\alpha_1}^{\alpha_2} d\alpha_2\prime \left[{U^{2\beta}_{\delta_1\rho_1} - \alpha_2\primeU^{1\beta}_{\rho_1\sigma_2} \overU^{0\beta}_{\rho_1\sigma_2}} \right] }$ .

In a “Unicoded” TEX, it could read as

${W^{3}_{1

1

2} = U^{3}_{1

1} + {1 / 82}_{1}^{2} d2' \left[{U^{2}_{1

1} - 2'U^{1}_{1

2}

/ U^{0}_{1

2}} \right] }$ ,

Unicode Unicode TTEEXX Example Example

Symbol EntrySymbol Entry

GUI PCs can display a myriad glyphs, mathematics symbols, and international characters

Hard to input special symbols. Menu methods are slow. Hot keys are great but hard to learn

Reexamine and improve symbol-input and storage methods

With left/right Ctrl/Alt keys, PC keyboard gives direct access to 600 symbols. Maximum possible = 2100 = 1030

Use on-screen, customizable, keyboards and symbol boxes Drag & drop any symbol into apps or onto keyboards

Hex to Unicode Input MethodHex to Unicode Input Method

Type Unicode character hexadecimal code Make corrections as need be Type Alt+x to convert to character Type Alt+x to convert back to hex (useful

especially for “missing glyph” character) Resolve ambiguities by selection Input higher-plane chars using 5 or 6-digit code New MS Office standard

Built-Up Formula HeuristicsBuilt-Up Formula Heuristics

Math characters identify themselves and neighbors as math

E.g., fraction (U2044), ASCII operators, U2200–U22FF, and U20D0–U20FF identify neighbors as mathematical

Math characters include various English and Greek alphabets

When heuristics fail, user can select math mode: WYSIWYG instead of visible math on/off codes

Operator PrecedenceOperator Precedence

• Everyone knows that multiply takes precedence over add, e.g., 3+5×3 = 18, not 24

• C-language precedence is too intricate for most programmers to use extensively

• TEX doesn’t use precedence; relies on { } to define operator scope

• In general, ( ) can be used to clarify or overrule precedence• Precedence reduces clutter, so some precedence is

desirable (else things look like LISP!)• But keep it simple enough to remember easily

Layout Operator PrecedenceLayout Operator Precedence

Subscript, superscriptIntegral, sumFunctionsTimes, divide / * × · •

Other operators Space ". , = - + LF Tab

Right brackets )]}|

Left brackets ([{

End of paragraph FF CR EOP

Mathematics as a Mathematics as a Programming LanguageProgramming Language

Fortran made great steps in getting computers to understand mathematics

Java accepts Unicode variable names C++ has preprocessor and operator overloading, but needs

extensions to be really powerful Use Unicode characters including math alphanumerics Use plain-text encoding of mathematical expressions Can’t use all mathematical expressions as code, but can go

much further than current languages go When to to multiply? In abstract, multiplication is

infinitely fast and precise, but not on a computer

void IHBMWM(void)

{

gammap = gamma*sqrt(1 + I2);

upsilon = cmplx(gamma+gamma1, Delta);

alphainc = alpha0*(1-(gamma*gamma*I2/gammap)/(gammap + upsilon));

if (!gamma1 && fabs(Delta*T1) < 0.01)

alphacoh = -half*alpha0*I2*pow(gamma/gammap, 3);

else

{

Gamma = 1/T1 + gamma1;

I2sF = (I2/T1)/cmplx(Gamma, Delta);

betap2 = upsilon*(upsilon + gamma*I2sF);

beta = sqrt(betap2);

alphacoh = 0.5*gamma*alpha0*(I2sF*(gamma + upsilon)

/(gammap*gammap - betap2))

*((1+gamma/beta)*(beta - upsilon)/(beta + upsilon)

- (1+gamma/gammap)*(gammap - upsilon)/

(gammap + upsilon));

}

alpha1 = alphainc + alphacoh;

}

void IHBMWM(void) { ' = •(1 + I2); = + 1

+ i•;

inc = 0•(1 - (••I2/')/(' + )); if (!1

|| fabs(•T1) < 0.01)

coh = -.5•0•I2•pow(', 3); else

{ = 1/T1 + 1; I2F = (I2/T1)/(+ i•); = (2= •(+ •I2F));

coh = .5••0•(I2F.(+ )/('•' - 2))

×((1+/)•(- )/(+ ) - (1+')•(' - )/(' + )); }

1= inc + coh ;

}

void IHBMWM(void){

' = • 1 + I2 ;= + 1

+ i•;

inc = 0•

1 - (••I2/')' + ;

if (!1|| fabs(•T1) < 0.01)

coh = -.5•0•I2•pow(', 3);else{

= 1/T1 + 1;

I2F = I2/T1+ i• ;

= 2= •(+ •I2F) ;

coh = .5••0•

I2F.(+ )'•' - 2 ×

1+ •- + -

1+' •' - ' + ;

}

1= inc + coh ;

}

ConclusionsConclusions

Unicode provides great support for math in both marked up and plain text

Unicode character properties facilitate plain-text encoding of mathematics but aren’t used in MathML

Heuristics allow plain text to be built up Need two more Unicode assignments: subscript and

superscript operators On-screen keyboards and symbol boxes aid formula entry Unicode math characters could be useful for programming

languages

unicode support for mathematics murray sargent iii microsoft

Documents

u 00e4 u

u math alphanumerics

z frakturaz

decomposed u

u recommendation

u 00b3

abused u

u 00b2