1 chinese term extraction based on delimiters yuhang yang, qin lu, tiejun zhao school of computer...
TRANSCRIPT
1
Chinese Term Extraction Based on Delimiters
Yuhang Yang, Qin Lu, Tiejun Zhao
School of Computer Science and Technology, Harbin Institute of
Technology Department of Computing,
The Hong Kong Polytechnic University
May, 2008
3
Basic Concepts
Terms(terminology): lexical units of the
most fundamental knowledge of a domain
Term extraction Term candidate extraction
Unithood
Terminology verification Termhood
4
Major Problems
Term boundary identification based on term
features Fewer features are not enough More features lead to more conflicts
Limitation in scope low frequency terms long compound terms dependency on Chinese segmentation
5
Main Idea
Delimiter based Term candidates extraction: identifying the
relative stable and domain independent words immediate
before and after these terms
扫描隧道显微镜是一种基于量子隧道效应的高分辨率显微镜 Scan
tunneling microscope is a kind of quantum tunnelling effect-based
high angular resolution microscope 社会主义制度是中华人民共和国的根本制度
Socialist system is the basic system of the People's Republic of China
Potential Advantages of the proposed approach No strict limits on frequency or word length No need for full segmentation Relatively domain independent
6
Related works:Statistic-based Measures
Internal measure (Schone and Jurafsky, 2001)
Internal associative measures between constituents of the candidate characters, such as:
Frequency Mutual information
Contextual measure
Dependency of candidates on its context: The left/right entropy (Sornlertlamvanich et al., 2000) The left/right context dependency (Chien, 1999) Accessor variety criteria (Feng et al., 2004).
7
Hybrid Approaches
The UnitRate algorithm (Chen et al., 2006)
occurrence probability + marginal variety probability The TCE_SEF&CV algorithm (Ji et al, 2007)
significance estimation function + C-value measure
Limitations Data sparseness for low frequency terms and long terms Cascading errors by full segmentation
8
Observations
Sentences are constituted by substantives and functional words
Domain specific terms (terms for short) are more likely to be domain substantives
Predecessors and successors of terms are more likely to be functional words or general substantives connecting terms Predecessors and successors are markers of terms,
referred to as term delimiters (or simply delimiters)
9
Delimiter Based Term Extraction
Characteristics of delimiters Mainly functional words and general
substantives Relatively stable Domain independent Can be extracted more easily
Proposed model Identifying features of delimiters Identify terms by finding their predecessors and
successors as their boundary words
10
Algorithm design
TCE_DI (Term Candidate Extraction – Delimiter Identification)
Input: Corpusextract (domain corpus ), DListlist ) (1). Partition Corpusextract to char strings by
punctuations. (2). Partition char strings by delimiters to obtain
term candidates.
If there is no delimiter contained in a string, the whole string is regarded as a term candidate.
C1 ... Cib Ci1 ... Cil Cia ... Cjb Cj1 ... Cjm Cja ... Cn
TC1 TC2 TC3
D1 D2
11
Acquisition of DList
From a given stop word list Produced by experts or from a general corpus No training is needed
DList_Ext algorithm Given a training corpus CorpusD_training, and
A domain lexicon LexiconDomain
12
The DList_Ext algorithm
S1: For each term in LexiconDomain
mark Ti in CorpusD_training as a lexical unit S2: Segment the remaining text S3: Extracts predecessors and successors of
all
Ti as delimiter candidates S4: Remove all Ti from delimiter candidates S5: Rank delimiter candidates by frequency
Use of a simple threshold NDI
13
Experiments:Data Preparation
Delimiter ListDListIT Extracted by using CorpusIT_Small and LexiconIT
DListLegal Extracted by using CorpusLegal_Small and LexiconLegal DListSW494 general stop words
14
Performance Measurements
Evaluation: Precision(sampling) & Rate of NTE
Reference algorithms SEF&C-value (Ji et al, 2007) for term candidate
extraction TFIDF (Frank et al., 1999) for both term
candidate extraction and terminology verification LA_TV (Link Analysis based – Terminology
Verification) for fair comparison
TCList
NewNTE N
NR
TCList
NewLexiconTE N
NNprecision
15
Evaluation:DList_Ext algorithm: NDI
CorpusLegal_Large
(11,048 sentences)
CorpusIT_Large
(60,508 sentences)
DListIT (Top100) 77.6% 89.1%
DListIT (Top300) 84.6% 92.6%
DListIT (Top500) 90.3% 93.4%
DListIT (Top700) 92.7% 93.9%
DListlegal (Top100) 95.8% 92.6%
DListlegal (Top300) 97.8% 96.2%
DListlegal (Top500) 98.7% 96.8%
DListlegal (Top700) 99.1% 97.1%
DListSW 98.1% 98.1%Coverage of Delimiters on Different Corpora
17
Evaluation:DList_Ext algorithm: NDI
Performance of DListIT on CorpusIT_Large Performance of DListLegal on CorpusIT_Large
18
NDI = 500
Performance of DListIT on CorpusLegal_Large Performance of DListLegal on CorpusLegal_Large
20
Performance Analysis
Domain independent and stable delimiters Being extracted easily and useful
Larger granularity of domain specific terms Keeping many noisy strings out
Less frequency sensitivity Concentrating on delimiters without regards
to the frequencies of the candidates
21
Evaluation on New Term Extraction: RNTE
Performance of Different Algorithms for New Term Extraction
22
Error Analysis
Figure of Speech phrases “不难看出” (it is not difficult to see that….) “新方法中” (in the new methods)
General words “思维状态” (mental state) “建筑” (architecture)
Long strings which contain short terms “访问共享资源” (access shared resources), “再次遍历” (traverse again)
23
Conclusion A delimiter based approach for term candidate
extraction Advantages
Less sensitivity to term frequency Requiring little prior domain knowledge, relatively
less adaptation for new domains Quite significant improvements for term extraction Much better performance for new term extraction
Future works Improving overall term extraction algorithms Applying to related NLP tasks such as NER Applying to other languages