geo n ame: a system for back-transliterating pinyin place names kui-lam kwok & qiang deng...
TRANSCRIPT
GeoName: a system for back-transliteratingpinyin place names
Kui-Lam Kwok & Qiang Deng
Computer Science Dept., Queens CollegeCity University of New York
Or:
issues involving cross language referencing
of a Chinese place by name
Content:
1. Back-transliteration problem
2. GeoName system - a proposed approach
3. Evaluation
4. Observation/conclusion
Transliteration:
• ‘alphabet mismatch’ when expressingChinese (place) names in English Texts
• names represented by PRC Pinyin code:
e.g. Beijing, Shenzhen
Back-Transliteration:
given the Pinyin code,
what are the original Chinese characters?
Back-Transliteration:
Why Chinese Characters are needed?
• remove ambiguity of referenced Pinyin place
• reconcile names in English & Chinese texts
• may assist alignment in E/C parallel texts
• necessary for E-C Cross Language IR (when translating English queries containing
Pinyin place, person, organization names)
4 Possible Ambiguities in
English–Chinese
cross language place name references
Ambiguity #3: Back-transliteration--> which character string is correct?
e.g.•China’s capital in Chinese - 北京•PRC Pinyin (1 char, 1 syllable) -
北 --> bei 京 --> jing
•map back from Pinyin to characters –bei --> { 北 , 贝 , 被 , 背 , 碑 , 杯 , 备 , 鐾 , …} (total 23)jing--> { 京 , 景 , 井 , 静 , 敬 , 竞 , 精 , 荆 , …} (total 20)
•ambiguous candidates: 北井 , 贝京 , 贝荆 , …北京which one?
Ambiguity #4: Name Reference--> same name, different places
Suppose result of back-transliteration is:
beijing --> 贝荆 , then which 贝荆 ? (longitude, latitude)
Ambiguity #1: E/C Pinyin Systems--> which Pinyin system was used ?
e.g. ‘Hong Kong’ in characters - 香港
PRC Pinyin: 香 -> xiang, 港 -> gangWade-Giles: 香 -> hsiang, 港 -> kangHong Kong: 香 -> hong, 港 -> kong …
‘hong kong’ back-transliterate using PRC Pinyin:
hong -> { 红洪鸿宏虹弘泓闳烘项黉哄 … } (19)kong -> { 孔空恐崆控箜倥 } (7)
Original ‘ 香港’ is NOT one of these 7x19 combinations !
Ambiguity #2: Syllable Segmentationwhich segmentation is correct?
e.g. 秦皇岛 - possible pinyin writing styles:
• Qin Huang Dao• QinHuangDao• Qinhuangdao <-- most common, used in NYT
--> how many syllables?Qin huang dao 3 charQin huang da o 4 charQin hu ang dao 4 charQin hu ang da o 5 char
Summarize: given a Pinyin geographic name
1. Pinyin system -- which?
2. segmentation -- how many syllables?
3. back-transliterate -- which candidate character string?
4. resolve same name, different places.
GeoName:
a system for back-transliteratingPinyin place names
GeoName: E-C cross language place reference
1. which Pinyin system?-- user chooses; or allow both PY & WG
2. how many segmented syllables?-- fewest syllables ranked first
3. back-transliterate: which candidate ?-- a) bi-list; b) confirm by web/Chinese place list; c) rank candidates by frequency
4. resolve same name different places -- not considered
GeoName –
Given English Pinyin place E =e1e2.. en (n syllables),many possible Chinese character string candidates:
C* = c1c2.. cn = argmaxC P(C|E)
= argmaxC P(E|C)*P(C)
~ argmaxC P(C), by assuming
P(E|C) ~ Πi P(ei|C) i.e. ei, ek
independent ~ Πi P(ei| ci) i.e. ei, ck
independent ~ 1 i.e. all ci map to unique
ei
GeoName –
P(C) = language model of Chinese place names<obtain training data by processing TREC, NTCIR Chinese collections using BBN IdentiFinder: ~80K approximate unique place names>
Use P(C) to sort candidates; fewest syllables rankedearlier<bigram model P(c2|c1)P(c3|c2).. not too good>
GeoName –
A heuristic weighting formula based on whole string, bigram and character frequencies:
g(C) = a1*log [f(C)+a1] + a2*log [f(cicj)+a2]
+ a3*log [f(ci)+a3],
- factor ignored if f(.) = 0; a1>a2>a3
- a1*log [f(C)+a1] => a string seen before
is probably correct
Evaluation
Use frequency formula only on 162
Pinyin city names from bilingual map
(no bilingual pair list were employed)
GeoName Evaluation - Frequency Formula(back-transliterating 162 Pinyin geographic names)
60
80
100
120
140
160
180
1 2 3 4 5 6 7 8 9 10 >10
Rank
Cu
mm
ula
tiv
e #
Co
rre
ct
at
Ra
nk
48%
70%
74%
82%
Examples of Correct Names ranked #1Daqiu ( 大丘 ), Wanbi ( 湾碧 ), Gongzhuling, ..
( 公主岭 )Examples of Failed Names• Non-Pinyin:
Qarqi, Yengisar, Jorra, Dongkar, .. ( 察尔齐 ) ( 阳霞 ) ( 觉拉 ) ( 洞嘎 )
• mainly longer names:Tuolu, Fenglingguan, Qingguandu,( 驮芦 ) ( 枫岭关 ) ( 清官渡 )Dating, Shasonggang, Denglonghe, ..( 大亭 ) ( 杉松岗 ) ( 灯笼河 )
GeoName – further improvement
Hypothesis: prefer candidate strings that have been seen before as location
names
confirm candidates on:
1. a bilingual list (~4K) – tag: 100ftp://ftpserver.ciesin.columbia.edu/pub/data/China /CITAS/gb_code/
2. Chinese monolingual place name list (~80K+4K) – tag:
010
3. web data via Google search – tag: 001
1. Pinyin place nameinput; user indicatesPRC or WG system.
3. Bilingual table(4k) lookup. tag 100
2. Pinyin segmentation; map to all possible GB character strings.tag 000
4. Merge GB candidates
6. WWWconfirmation.tag 101, 001
5. Monolingual name list (84k) confirmation.tag 110, 010
7. Evaluate weight g(C);rank according to:(1) tag, (2) name character length, (3) g(C).
tag 111, 011
GeoName –flowchart
GeoName – Evaluation
Evaluate system result using:
tag=000, rank by g(C)tag=001, web confirmation + g(C)tag=010, mono-list confirmation + g(C)tag=111, bi-list + all above
GeoName Evaluation - Various Methods(back-transliterating 162 Pinyin geographic names)
60
80
100
120
140
160
180
1 2 3 4 5 6 7 8 9 10 >10
Rank
Cu
mm
ula
tiv
e #
Co
rre
ct
at
Ra
nk
freq+mono.list (010)
all (111)
freq only (000)
freq+web (001)
48%
70%
74%
82%
72%
83%
86%
79%
Example of back-transliteration: web & no-web
Tag = 111 (with web confirmation)
Chagugang 001 1.38629436 汊沽港 000 15.68423107 查古港 000 9.24647942 诧古港 000 9.24647942 岔古港 000 8.55333224 锸古港 000 8.55333224 槎古港 000 8.55333224 楂古港 000 8.55333224 汊古港 000 8.55333224 嚓古港 000 8.55333224 刹古港
Tag = 110 (without web confirmation)
Chagugang 000 15.68423107 查古港 000 9.24647942 诧古港 000 9.24647942 岔古港 000 8.55333224 锸古港 000 8.55333224 槎古港 000 8.55333224 楂古港 000 8.55333224 汊古港 000 8.55333224 嚓古港 000 8.55333224 刹古港 000 8.55333224 差古港
Examples:
Luliangqu 010 40.02587171 吕梁区 000 9.24647942 吕梁瞿 000 9.24647942 吕梁衢 000 9.24647942 吕梁渠 000 9.24647942 吕梁曲 000 9.24647942 陆良瞿 000 9.24647942 陆良衢 000 9.24647942 陆良渠 000 9.24647942 陆良曲 000 9.24647942 陆良区 district/region
Xiaoyishi 110 40.18588115 孝义市 000 9.24647942 孝尾市 000 9.24647942 萧尾市 000 8.55333224 箫尾市 000 8.55333224 筱尾市 000 8.55333224 骁尾市 000 8.55333224 潇尾市 000 8.55333224 崤尾市 000 8.55333224 哓尾市 000 8.55333224 效尾市 city
Yimaxiang 000 15.68423107 义马乡 000 9.24647942 义马缃 000 9.24647942 义马巷 000 9.24647942 义马祥 000 9.24647942 义马湘 000 9.24647942 义马襄 000 9.24647942 义马香 000 9.24647942 伊玛缃 000 9.24647942 伊玛巷 000 9.24647942 伊玛祥 village
Mengnanzhuang 000 14.95494484 蒙南庄 000 8.51719319 懵南庄 000 8.51719319 孟南庄 000 8.51719319 盟南庄 000 8.51719319 萌南庄 000 7.82404601 虻南庄 000 7.82404601 勐南庄 000 7.82404601 梦南庄 000 7.82404601 猛南庄 000 7.82404601 锰南庄 place
Conclusion:
• reasonable back-transliteration results for map cities
• longer names (>2 char), more error • non-pinyin names, does not work
Future Work:
• increase training data• improve ranking function• direct translation (not just confirmation)
using web• better/more realistic evaluation
If interested:
can demonstrate GeoName (needs Linux re-boot)
Try GeoName at:
http://post.cs.qc.edu/spell2gb/(needs Chinese character display)
feedback appreciated
Thank You!