![Page 1: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/1.jpg)
Language Archiving- Document Annotation and Corpus Linguistics
Keh-Jiann ChenInstitute of Information science
Academia Sinica
![Page 2: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/2.jpg)
The goals of NDAP are :(Quote from [Hsieh 2002, “Digital Media, Informatics, and Cultural Heritage “])
Preserving national cultural collections. Popularizing fine cultural holdings. Strengthening cultural heritage as well as guiding
cultural development. Popularizing knowledge and Improving Information
sharing. Enhancing education and learning. Bootstrapping cultural and value-added industries. Improving literacy, creativity and quality of life. Promoting International Cooperation and resource
sharing.
![Page 3: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/3.jpg)
28
Space, Time and Language Coordinates for Digital ArchivesSpace, Time and Language Coordinates for Digital Archives
LanguageLanguage
TimeTime SpaceSpace
Language Language in Timein Time
HistoricalHistoricalGISGIS
Language Language in Spacein Space
Language Language in Text, in in Text, in Speech...Speech...
Language Changes
Digital Archives
Language variations
Digital Archives and TSL coordinates: (Quote from [Hsieh 2002, “Digital Media, Informatics, and Cultural Heritage “])
![Page 4: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/4.jpg)
Language Archiving is a is a Collection of Linguistic ResourcesCollection of Linguistic Resources Collection of a linguistic archive (such Collection of a linguistic archive (such
as a balanced corpus) is guided by a sas a balanced corpus) is guided by a set of et of design criteriadesign criteria
Design CriteriaDesign Criteria define natural classes of texts in a collection
Each criterion establishes a dimension for comparative studies www.sinica.edu.tw/SinicaCorpus
![Page 5: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/5.jpg)
How to make a single How to make a single archive more versatilearchive more versatile
One Corpus or Many Corpora?One Corpus or Many Corpora?
Or How to make a Balanced Corpus Biased?Or How to make a Balanced Corpus Biased?
With Textual Markup InformationWith Textual Markup Information (e.g. (e.g.
Metadata)Metadata)
genre, style, mode, topic, medium etc.genre, style, mode, topic, medium etc.
word, part-of-speech, structure tags, semantic word, part-of-speech, structure tags, semantic
tagstags
Alignment for heterogeneous corporaAlignment for heterogeneous corpora
![Page 6: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/6.jpg)
Creating Synergy from Uniform Resource Type Each document is marked up with textual
description features: topic, style etc. Each feature selects a subset of
documents Sub-corpora (or new archives) can be
created online according to user’s specification
![Page 7: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/7.jpg)
Creating Synergy from Uniform Resource Type Classical Chinese Corpora http://www.sinica.edu.tw/~tibe/2-words/old-words/index.html
Corpus of Formosan Austronesian Languages Under construction, part of the NationalDigital Archive Initiative
Lexical Databases of other Sino-Tibetan and Tibet
o-Burmese Languages
![Page 8: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/8.jpg)
Creating Synergy from Heterogeneous Resource Type Bi-lingual or multi-lingual corpora
Text and speech aligned corpora
Synchronized corpora collected from
different areas
![Page 9: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/9.jpg)
How to create a balanced corpus?
Creating of Sinica corpus – A word segmented modern Chinese corpus with pos
tagging
![Page 10: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/10.jpg)
Introduction TEI : A corpus is a body of texts put
together in a principled way, typically in order to construct a sample of a given language or sublanguage.
It must be representative and balanced if it claims to faithfully represent the facts in that language or sublanguage [Sinclair 87].
![Page 11: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/11.jpg)
Introduction Sinica balanced corpus
Texts are classified according to 5 different features: ( 1)Genre( 2) Style( 3)Mode( 4) Topic( 5)Medium
Word segmentation standard Segmentation standard for Chinese language
processing Http://godel.iis.sinica.edu.tw/ROCLING/
juhuashu1.htm Part-of-speech tagging
46 syntactic categories
![Page 12: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/12.jpg)
Genre written reportagecommentaryadvertisementletterannouncementfictionprosebiography & diarypoetryanalectsmanual
spoken scriptconversationspeechmeeting minutes
Style
Mode
Topic
NarrationArgumentationExpositionDescription
writtenwritten-to-be-readwritten-to-be-spokenspokenspoken-to-be-written
philosophynatural sciencessocial sciencesfine artsgeneral/leisureliterature
Medium Newspapergeneral magazineacademic journaltextbookreference bookthesisgeneral bookaudio/visual mediainteractive speech
![Page 13: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/13.jpg)
Sinica Corpus philosophy 10% natural sciences 10% social 35% arts 5% general/leisure 20% literature 20%
![Page 14: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/14.jpg)
%% 文類 Genre= 報導 reportage
%% 文體 Style= 記敘 Description
%% 語式 Mode= written
%% 主題 Topic= 訊息 Message
%% 媒體 Medium= 報紙 Newspaper
%% 姓名 Author’s name=
%% 性別 Gender= 男女
%% 國籍 Nationality= 中華民國 Chinese
%% 母語 Mother tone= 中文 Chinese
%% 出版單位 Publisher= 中研院週報 Academia Sinica
%% 出版地 Place= 台北市台灣 Taipei Taiwan
%% 出版日期 date=1994
%% 版次 version=
%% 標題 Title= 國史研習會:中國宗教與社會
1. 。 (PERIODCATEGORY) 由 (P) 本 (Nes) 院 (Nc) 歷史 (Na) 語言 (Na) 研究所 (Nc) 主辦 (VC) , (COMMACATEGORY)
***********************************************
2. , (COMMACATEGORY) 台灣 (Nc) 大學 (Nc) 歷史系 (Nc) 暨 (Caa) 研究所 (Nc) 與 (Caa) 清華 (Nb) 大學 (Nc) 歷史系 (Nc) 暨 (Caa) 研究所 (Nc) 協辦 (VC) 之 (DE) 「 (PARENTHESISCATEGORY) 國史(Na) 研習會 (Na) : (COLONCATEGORY)
***********************************************
3. : (COLONCATEGORY) 中國 (Nc) 宗教 (Na) 與 (Caa) 社會 (Na) 」 (PARENTHESISCATEGORY) ,(COMMACATEGORY)
***********************************************
![Page 15: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/15.jpg)
Design of Corpus Construction and Management System
![Page 16: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/16.jpg)
Introduction Motivations for designing a corpus
management system It is hard to collect, maintain, classify,
tagging a large amount of texts without using a management system.
Automate the word segmentation and tagging processes.
Maintain the precision and consistency of data collection.
Handle the out-of-vocabulary words.
![Page 17: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/17.jpg)
Database for Texts
Text Id
Text Id features
features text
text
record 1
field 1
record 2
field 2 field 3
Text database
ConstructionSystem
Tagged textTaggedtext
…
![Page 18: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/18.jpg)
Construction Flow
Text Collection Module
網路 (WWW)
Text Files
text
text
Inspection System
New Word Editor
Unknown word Identification Module
text
Text & New words
Word Segmentation and Pos-tagging Module
text
Tagged Text Editor
Tagged Text
Revised Tagged text
Text Database(SQL)
Revised New WordsDomain Lexicons
![Page 19: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/19.jpg)
Text Collection Module Purpose: Semi-automatically
collect the various texts from WWW.
Features: Automatic feature extraction and document classification.
![Page 20: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/20.jpg)
![Page 21: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/21.jpg)
Unknown Word Identification Module Identify new words before word
segmentation Methods:
Detect the existence of unknown words
Apply statistical rules and morphological rules to identify unknown words
![Page 22: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/22.jpg)
![Page 23: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/23.jpg)
Word Segmentation & Tagging Module Based on the word segmentation standard for
information processing, the segmentation program segments input text and tags the result words with their part-of-speeches.
Methods:word matching based on lexicon and newly identified words. Segmentation process: Longest matching
and heuristic rules to resolve the segmentation ambiguities.
Pos tagging : Bi-gram model for resolving pos ambiguities.
![Page 24: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/24.jpg)
Word Segmentation & Tagging Module (cont) Additional features: Incorporate user defined
dictionary or domain dictionary to enhance the word segmentation accuracy. Domain dictionary: e.g. medical
dictionary, dictionary for computing terminology.
Extracted unknown words: New words, such as personal names, always occurred in text. The unknown word identification process will extract the unknown words and they will be the supplement of dictionary.
![Page 25: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/25.jpg)
Unknown words extracted from text
General Lexicon
Text Tagged textWord
segmentation and tagging
台大本學期舉辦減重班
台大 (Nc) 本 (Nes) 學期 (Na) 舉辦 (VC) 減重班 (Na)
Domain Lexicon
![Page 26: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/26.jpg)
Inspection System Purpose: To assure the quality of the corpus
collection, the automatic processed texts need to be verified by human experts. Thus an inspection system was designed to speed up the verification process.
Major functions : Editing functions: The errors of word breaks,
pos-tags, features, sentence breaks can be fixed by just clicking the mouse.
Reminder functions : The system will highlight the common errors, prefix, suffix in the text.
Short term memory : The system will recall the most recent modifications and fixed the same type of errors automatically.
![Page 27: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/27.jpg)
Inspection System (cont)
Provide lexical information and examples:
Friendly user interface:
欲構建之語料庫
使用者
Web ServerSQL Server
詞典 舊版本之語料庫
![Page 28: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/28.jpg)
![Page 29: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/29.jpg)
J 塑膠 (Na) 皮 (Na)→ 塑膠皮 (Na)
J 公文 (Na) 包 (VC)→ 公文包 (Na)
J 村 (Nc) 上 (Ncd)→ 村上 (Nb)
J 毛利 (Na) 遜 (VH)→ 毛利遜 (Nb)
J 吉姆 (Nb) 毛利遜 (Nb)→ 吉姆毛利遜 (Nb)
D 世界級 (Na)→ 世界 (Nc) 級 (Na)
D 科學方法 (Na)→ 科學 (Na) 方法 (Na)
D 三代 (Nd)→ 三 (Neu) 代 (Na)
D 交互作用 (Na)→ 交互 (VH) 作用 (Na)
D 如一 (VH)→ 如 (P) 一 (Neu)
C 改變 (VC)→ 改變 (Na)
C 傳統 (VH)→ 傳統 (Na)
C 企畫 (VC)→ 企畫 (Na)
C 自然 (D)→ 自然 (VH)
C 起來 (VA)→ 起來 (Di)
F 反射 (VJ)→ 反射 (VJ)[+nom]
F 遮雨 (VA)→ 遮雨 (VA)[+nom]
F 保持 (VJ)[+nom]→ 保持 (VJ)
F 萊特班 (Na)→ 萊特班 (Na)[+prop]
F 感動 (VHC)→ 感動 (VHC)[+nom]
![Page 30: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/30.jpg)
Corpus Management System Advantages:
The corpus management system speeds up the construction processes and reduces the human efforts.
It also increases the precision and consistency of the word segmentation and pos-tagging.
Database system facilitates the functions of searching, managing, retrieving, and reorganizing texts.
![Page 31: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/31.jpg)
Using Corpora
Reorganizing sub-corporaSearching tools
![Page 32: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/32.jpg)
![Page 33: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/33.jpg)
![Page 34: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/34.jpg)
Reorganizing sub-corpora Sub-corpora can be reorganized
according to different features. Sport corpus Spoken corpus Corpus of the most recent tree
months News corpus Corpus of poetry
![Page 35: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/35.jpg)
Corpus Searching ToolsKey word vectors
Key Word in Context(KWIC) Search
KWIC file
Filtering and Sorting
Display, or Print,or Store
Statistics colllocation
![Page 36: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/36.jpg)
Corpus Searching Tools KWIC search
Key word vector what is matched [ 代表 , N, φ, φ] every word 代表 daibiao tagged
with the pos noun [φ,VA, φ, 1] all monosyllabic intransitiv
e verb(VA) [φ, φ,+fw,φ] all foreign words [.. 化 ,V, φ, 3] all tri-syllabic verb with suf
fix 化 hua '-ize'
![Page 37: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/37.jpg)
![Page 38: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/38.jpg)
Corpus Searching Tools Filtering
The filtering methods include: random sampling, removing redundant samples, removing irrelevant samples by restricting the
content in the window of key words. Displaying, printing, and storing
The result KWIC files can be displayed on screen, or printed,or stored for future processing.
![Page 39: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/39.jpg)
![Page 40: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/40.jpg)
Corpus Searching Tools Statistics:
Statistic functions provide statistical distributions of words and categories occurring within the context window of key words.
For instance, the category distribution of the word 把 ba.
Category Frequency % preposition P 2704 92.57 measure Nf 211 7.22 transitive verb Vc 3 0.10 determiner Neqb 2 0.07 noun Na 1 0.03
![Page 41: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/41.jpg)
![Page 42: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/42.jpg)
Corpus Searching Tools Collocation finding
The system finds collocations of the key words by computing the mutual information [Church & Hanks 90] of the key words with the words or parts-of-speech in a user defined window.
Mutual Information= Log P(X,Y)/P(X)*P(Y) I(x,y) >> 0 : x,y are strongly associated. I(x,y) ≈ 0 : x,y are unrelated. I(x,y) << 0 : x,y are mutually exclusive.
![Page 43: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/43.jpg)
Examples The top 16 collocations of ‘ 威脅’ within t
he window of distance 10. 1. 飽受 2. 恫嚇 3. 綑綁 4. 構成 5. 嚴重 6. 崩坍 7. 恐怖 8. 恐嚇 9. 遭受 10. 刀槍 11. 滾滾 12. 安全 13. 尖刀 14. 健康 15. 成全 16. 備受
![Page 44: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/44.jpg)
![Page 45: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/45.jpg)
![Page 46: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/46.jpg)
Corpus Linguistics Corpus provides ample examples of
word uses and syntactic patterns. It also reflect the real uses of the language and their frequency distribution.
Comparative study can be made within KWIC or between sub-corpora.
Automatic knowledge extraction techniques can be performed on corpus to reduce manual efforts.
![Page 47: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/47.jpg)
Lexicography Corpus provides ample examples of different w
ord uses and syntactic patterns. Corpus reflects the real uses of the language an
d their frequency distribution. Collocations show idiomatic patterns and they
are the most important uses of a word. Examples can be extracted from corpora. Senses and syntactic functions can be ordered a
ccording to their frequencies. CoBuild, Oxford, EDR, Collocation Dictionary of
Noun and Measure Words are examples of using corpora for editing dictionaries.
![Page 48: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/48.jpg)
Language Modeling Markov Language Model: the probabilities a
re estimated from corpora. P(W1W2…Wm)= P(W1)*P(W2|W1)*P(W3|
W1W2)*…*P(Wn|W1W2…Wm-1) N-gram Model: P(W1W2…Wn) P(W1)*P
(W2|W1)*P(W3|W1W2)*…*P(Wn|Wm-n+1,…,Wm-1)
![Page 49: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/49.jpg)
Language Modeling Applications of language modeling:
Inputting methods: speech recognition, character recognition, spelling check, phonetic input, …
Data compression: Huffman coding, Arithmetic Coding,…
Categorization: Text classification, pos tagging, sense disambiguation, word segmentation,…
![Page 50: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/50.jpg)
Machine Translation IBM [Brown etc. 1990] used the bi-lingual H
ansard corpus to build translation models. To translate a French sentence F to an En
glish sentence E is equivalent to find the E which maximize P(E)*P(E|F).
P(E) is estimated from bi-gram model. P(E|F) is estimated from aligned bi-lingua
l corpus.
![Page 51: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/51.jpg)
Conclusion Language archive is not only the
most important culture heritage but also the most important resources for language research.
The computer tools makes the archiving more efficient and manageable.
Everyone can access the archive through WWW.
![Page 52: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/52.jpg)
Websites: Corpora and Archives
Sinica Corpus (Academia Sinica Balanced Corpus of Modern Chinese)
www.sinica.edu.tw/SinicaCorpus
Academia Sinica Classical Chinese Corpora: Early Mandarin
www.sinica.edu.tw/Early_Mandarin
Academia Sinica Formosan Language Archive: Rukai(Mantauran)
www.ling.sinica.edu.tw/formosan
![Page 53: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/53.jpg)
Websites: Digital Museums
Chinese Language KnowledgeNets
WenGuo: Adventures in Wen-Land
http://www.sinica.edu.tw/wen
SouWenJieZi
http://www.dmpo.sinica.edu.tw/~words
![Page 54: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/54.jpg)
5 million words, segmented and taggedDirect WWW Access
-http://www.sinica.edu.tw/ftms-bin/kiwi.sh
License Informationhttp://rocling.iis.sinica.edu.tw/ROCLING/corpus98/sinicor_E.htm
Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus)
![Page 55: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/55.jpg)
Sinica Treebank 1.038,725 Trees
239,532 Words
Direct WWW Access (1000 sample trees)http://godel.iis.sinica.edu.tw/CKIP/trees1000.htm
License Informationhttp://rocling.iis.sinica.edu.tw/ROCLING/Treebank/Treebank-E.htm
![Page 56: Language Archiving- Document Annotation and Corpus Linguistics](https://reader034.vdocument.in/reader034/viewer/2022051316/568148f0550346895db60f43/html5/thumbnails/56.jpg)
Mandarin-Across-Taiwan (MAT) Speech Database
Speech files are collected through telephone networks. The content Includes spontaneous speech (short answering statements) and read speech (numbers, Mandarin syllables, words of 2 to 4 syllables, phonetically balanced sentences).MAT-160 (160 speakers)
MAT-2000 http://rocling.iis.sinica.edu.tw/ROCLING/MAT/index_cf.htm