hon wah chan murray sargent iii microsoft corporation text services group, word multilingual editing...

28
Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using Multilingual Editing using RichEdit 4+ RichEdit 4+

Upload: wyatt-coffey

Post on 27-Mar-2015

221 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Hon Wah Chan

Murray Sargent IIIMicrosoft Corporation

Text Services Group, Word

Multilingual Editing using Multilingual Editing using RichEdit 4+RichEdit 4+

Page 2: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Introduction

RichEdit is a text engine with a hierarchy of presentation formats

Features such as automatic choice of fonts, rich text, 2D text objects

Handling nonUnicode documents in Unicode text engines

Describe interfaces and component usage Ways to input Unicode text using IMEs, speech Demo

Page 3: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

What’s RichEdit?What’s RichEdit?

RichEdit 4.x is set of plain/rich-text, single/multiline Unicode/ANSI edit controls and combo/listboxes in single world-wide binary

Multilevel undo, message & com interfaces, Word compatibility, pretty rich text

Outline view, zoom, font binding, latest in IME support, and rich complex script support (BiDi, Indic, and Thai)

Page 4: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Clients includeClients include Handheld PC PocketWord eBooks OE (for mail header) Borland’s Delphi SQL server dev tools, RAID MSN Companion chat Via Win2k Wrapper – cc:mail,

WebEditPro, Eudora, Encarta, Money(US), Sibelius, Borland TRichedit class, apps created with VB, MFC…

Outlook mail note, post-it

Most Office dialogs All OSes since Win98

Wordpad, Charmap Darwin installer WebCalc Project Visual Studio, DaVinci Publisher Front Page

Page 5: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Some Fancier FeaturesSome Fancier Features

Features added for ebooks: pagination, hyphenation, kerning, ClearType support, text wrap around embedded objects

Multilevel tables Autocorrect AutoURL detection (improved from 3.0)

Page 6: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

2D Text Objects2D Text Objects

RichEdit 4.5 (in development) supports WYSISYG editing of many 2D objects

Ruby, Tatenakayoko, Warichu, Kumimoji Math: fractions, autosizing brackets, boxes, matrices,

integrals Demo will show some of these features

Page 7: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Backward CompatibilityBackward Compatibility

Unicode text engines need to import/export text in other character sets Given nonUnicode plain text, which codepage should one use to

convert to/from Unicode? On localized systems, system code page is a good bet In multilingual text, you can enter text using keyboards in a variety of

languages that need either Unicode or multiple code pages For searching text, best choice seems to be to use the current keyboard

code page If text begins with a BOM, it’s Unicode If text begins with a rich-text header, e.g., “{\rtf” or “<html>”, use

appropriate conversion routine

Page 8: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Backward Compatibility (cont)Backward Compatibility (cont)

Need a little rich-text functionality to display Unicode plain text unambiguously in some CJK scenarios

This functionality handles font choices and language-dependent glyph variants

When a user types in text using a keyboard charset, edit engine knows charset and therefore can insert accurate Unicode text including which CJK glyph variant to use

Client gets text as pure ANSI (or Unicode) text without script clues

Would be handy to have script tags

Page 9: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Complex ScriptsComplex Scripts

Unicode covers many complex scripts, e.g., Arabic, Indic, Thai, ancient Korean

Complex-scripts require layout engine that translates character codes to glyph indices (often referencing ligatures)

RichEdit uses Uniscribe and the MS line-layout component for complex scripts

Page 10: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Font BindingFont Binding

Most Unicode characters belong to scripts Associate with each position in a document a “font bundle” When inserting characters, assign each one to a script For CJK, check surrounding characters for Kana and Hangul as clues

to use Japanese or Korean fonts instead of Chinese Assign scripts to neutrals and digits Keyboard language, especially IMEs, provide strong binding clues Format inserted characters with fonts assigned to scripts. Check

current font to see if it supports required script RichEdit 4.0 has 50 scripts for Unicode 3.1. Client can specify what

default font to use for a given script.

Page 11: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Language Detection & Font BindingLanguage Detection & Font Binding

Korean and Japanese are often easy to spot because of Hangul and Kana characters, respectively

For CJK can convert back to codepage and see if errors occur (Ken Lunde’s suggestion)

For proofing purposes, accurate language identification is needed. For font binding, script identification is usually sufficient

Typically more than one language corresponds to a script, e.g., Latin script. Essentially only one uses the Korean script

Natural language processing techniques allow good language identification if more than a few words are involved, e.g., a sentence

Page 12: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Font SizingFont Sizing

In dialogs, 8-pt Latin characters are commonly used 8-pt Chinese characters are hard to read, so better to use 9

points in combination with 8-pt Latin characters Latin characters have bigger descenders than Chinese

characters, since latter only need room for underline Combining 8-pt Latin characters with 9-point Chinese

characters and keeping same baseline increases line height to 9 pts plus extra height for Latin descender

Result is more like 10 points: shifts text too high in dialog box originally designed to handle one language

Page 13: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Unicode Surrogate PairsUnicode Surrogate Pairs

Using 2 16-bit surrogates to represent a single character complicates more than measurement and display of characters:

Arrow-key handlers and other methods that change character position must avoid ending up in between lead and trail surrogates

Input methods need to map to surrogate pair Case changes, line-breaking rules, sorting, file formats,

and backing-store manipulations in general have to recognize and deal with pairs

Surrogate code ranges make them easy to work with relative to multibyte encoding systems

Page 14: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Nonspacing Combining MarksNonspacing Combining Marks

Multicode characters (surrogate pairs, CRLFs, combining-mark and variant-tag sequences) require special display/navigation handling

Render combining-mark sequences by standard systems calls and fonts that support combining marks. Better display needs layout engine that talks to OpenType

Simple caret movement across combining-mark sequences prevents stopping inside a sequence. Backspace key deletes one mark at a time

Mouse-cursor hit testing leaves selection at beginning/end of combining-mark sequence (more elegant model allows selection and editing of individual marks)

Cool thing: if you can navigate past CRLF combinations, you can modify corresponding code to handle surrogate pairs and combining-mark sequences quite easily

Page 15: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

InterfacesInterfaces

Messages and keyboard File read/write (plain text or RTF) TOM (Text Object Model) ITextServices/ITextHost interfaces

Page 16: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

RichEdit Message InterfaceRichEdit Message Interface

System messages keyboard messages mouse messages clipboard messages

Edit messages – RichEdit supports all but four of the system edit messages

RichEdit messages Character/paragraph formatting Text input/query Notification

Page 17: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

File FormatsFile Formats

Plain text can be saved/read encoded in any codepage, including Unicode and UTF-8

RTF is the principle rich-text format UTF-8 RTF is used preferentially for cut/copy/paste. Can

be used in stream operations Copying text to/from Word can be a handy way to get

desired formatting into a RichEdit instance HTML is available via system converters

Page 18: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

TOM (TOM (Text Object Model)Text Object Model)

A set of COM dual interfaces that allow Unicode rich/plain text to be manipulated by VB, C/C++, and Java clients.

Access for spelling/grammar checkers Accessibility Powerful and efficient text processing primitives.

Embedded scripts

Page 19: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

TOM(cont)

ITextDocument Top-level editing object ITextStoryRanges Enumerator for stories in document

ITextRange Primary text interface: range of text ITextFont Character-attribute interface ITextPara Paragraph-attribute interface ITextTag HTML Tag interface

ITextAttributes Tag-attribute enumerator

ITextSelection Screen highlighted text range TextRange Selection inherits all range methods

Page 20: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

ITextServices/ITextHost ITextServices/ITextHost InterfacesInterfaces

Windowless interfaces that go beyond message interface In-place active state – use window of the container Fewer system resources Faster activation and deactivation

Page 21: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Other Components usedOther Components used

Uniscribe MS line-layout component Windows Text Services Framework Callbacks for access to word-break, auto correct,

hyphenation, and Clear Type libraries

Page 22: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Input methodsInput methods

Support for the latest IMEs Speech and handwriting input (Windows Text Services

Framework) Alt-x Unicode input method Standard hot keys

Page 23: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

IMEsIMEs

Support Level 2 and Level 3 IMEs Support Active Input Method Manager (AIMM) Reconversion - user can convert final string back to

composition mode, allowing easy selection of a different candidate string.

Document feed - provides IME with text for current paragraph to increase conversion accuracy during typing.

Mouse Operation - gives user better control over candidate and UI windows

Caret position - gets current caret and line info, which IME98 uses to position UI windows (e.g., candidate list).

Page 24: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Windows Text Services Windows Text Services FrameworkFramework

Provide support for Far East input across language Win32 platforms to aware applications.

Provide consistent UI for different input methods speech, handwriting, IME

Coordinated input Data persistence for dynamic text editing Richedit supports both the native mode and Active Input

Method Manager (AIMM) mode

Page 25: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Hex to Unicode Input MethodHex to Unicode Input Method

Type Unicode character hexadecimal code Make corrections as need be Type Alt+x to convert to character Type Alt+x to convert back to hex (useful especially for

“missing glyph” character) Resolve ambiguities by selection Input higher-plane chars using 5 or 6-digit code MS Word 2002 standard

Page 26: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Unicode combobox/listboxUnicode combobox/listbox

Emulate the system combobox and listbox Unicode supports on all Win32 platforms Allow mixed languages between items Modified EM_SETTEXTEX for inserting items Use in Office applications

Page 27: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

DemoDemo

Page 28: Hon Wah Chan Murray Sargent III Microsoft Corporation Text Services Group, Word Multilingual Editing using RichEdit 4+

Conclusions

Have described RichEdit, an engine for text display and editing with a hierarchy of presentation formats

Automatic choice of fonts for Unicode plain text including surrogate-pair characters, combining mark sequences

Handling nonUnicode documents in Unicode text engines Described interfaces and component usage Ways to input Unicode text using IMEs, speech Clients include many Office and Windows apps Able to display 2D Text Objects such as Ruby and

Warichu