detecting erroneous sentences using automatically mined sequential patterns
DESCRIPTION
Detecting Erroneous Sentences using Automatically Mined Sequential Patterns. Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.12.04. Outlines. Introduction Related Work Proposed Technique Experimental Evaluation Conclusions and Future Work. Introduction. Summary - PowerPoint PPT PresentationTRANSCRIPT
Detecting Erroneous Sentences using Automatically Mined
Sequential Patterns
Advisor: Hsin-His ChenReporter: Chi-Hsin YuDate: 2007.12.04
Outlines
Introduction Related Work Proposed Technique Experimental Evaluation Conclusions and Future Work
Introduction
Summary Problem: Identifying erroneous/correct sentences Algorithm: Classification (SVM, NB) Approach: Sequential patterns (Data Mining)
Applications Providing feedback for writers of English as a
Second Language (ESL) Controlling the quality of parallel bilingual
sentences mined from the Web Evaluating the MT results
Introduction (cont.)
The common mistakes (Yukio et al.,2001; Gui and Yang, 2003) made by ESL learners spelling, verb formation lexical collocation, tense, agreement, wrong Part-Of-Speec
h (POS), article usage sentence structure (grammar structure)
Example “If Maggie will go to supermarket, she will buy a bag for you.
” The pattern: “if...will...will” (would ) N-grams: considering only continuous sequence of words,
very expensive if N > 3
Related Work
Category 1: the use of hand-crafted rules Heidorn, 2000; Michaud et al., 2000; Bender et al.,
2004 Difficulties
Expensive to write rules manually difficult to produce and maintain a large number of no
n-conflicting rules to cover a wide range of grammatical errors
making different errors by different first-language backgrounds and skill levels
hard to write rules for some grammatical errors
Related Work (cont.)
Category 2: statistical approaches Chodorow and Leacock, 2000; Izumi et al., 2003;
Brockett et al., 2006; Nagata et al., 2006 Problems
focusing on some pre-defined errors the reported results being not attractive the need of errors to be specified and tagged in the tra
ining sentences the need of parallel tagged data
Proposed Technique
Classification model Using SVM (light SVM) Features
Labeled Sequential Patterns (LSP) – 1 feature Complementary features
Lexical Collocation (LC) – 3 features Perplexity from Language Model (PLM) – 2 features Syntactic Score (SC) – 1 feature Function Word Density (FWD) – 5 features
Proposed Technique —LSP (1)
A labeled sequential pattern (LSP), p, is in the form of <LHS, c> LHS is a sequence <a1, ..., am>
ai is named “item”.
c is a class label (correct/incorrect here) Sequence database D
The collection of LSPs
Proposed Technique —LSP (2)
“Contain” relation (subsequence) a sequence s1 =< a1, ..., am > is contained in a seq
uence s2 =< b1, ..., bn > if there exist integers i1, ...i
m such that 1 <= i1 < i2 < ... < im <= n and aj = bij for all j in {1, ...,m}.
A=<abcdefgh> has a subsequence B=<bdeg> A contains B.
A LSP p1 is contained by p2 if the sequence p
1.LHS is contained by p2.LHS and p1.c = p2.c.
Proposed Technique —LSP (3)
A LSP p is attached with two measures, support and confidence. The support of p (the generality of the pattern p)
denoted by sup(p) the percentage of tuples in database D that contain th
e LSP p the confidence of p (predictive ability of p)
Denoted by conf(p) Computed as
Proposed Technique —LSP (4)
Example: t1 = (< a, d, e, f >,E) t2 = (< a, f, e, f >,E) t3 = (< d, a, f >,C) One example LSP p1 = (< a, e, f >, E)
is contained in t1 and t2
sup(p1) = 2/3 = 66.7%, conf(p1)=(2/3)/(2/3) = 100%
LSP p2 = (< a, f >, E) sup(p2) = 3/3 = 100%, conf(p2)= (2/3)/(3/3) = 66.7%
Proposed Technique —LSP (5) Generating Sequence Database
applying Part-Of-Speech (POS) tagger to tag each training sentence MXPOST-Maximum Entropy Part of Speech Tagger Toolkit
3 for POS tags keeping function words and time words each sentence together with its label becomes a database t
uple “In the past, John was kind to his sister” “In the past, NNP was JJ to his NN”
LSP Examples (<a, NNS>, Error), NNS: plural noun (<yesterday, is>, Error)
Proposed Technique —LSP (6)
Mining LSPs adapting the frequent sequence mining algorithm i
n (Pei et al., 2001) setting minimum support at 0.1% and minimum co
nfidence at 75% Converting LSPs to Features
the corresponding feature being set at 1 if a sentence includes a LSP
Proposed Technique —LSP (7)
LSPs for erroneous sentences “<this, NNS>” (“this books is stolen.”) “<past, is>” ( “in the past, John is kind to his sister.”) “<one, of, NN>” ( “it is one of important working language” “<although, but>” (“although he likes it, but he can’t buy it.”) “<only, if, I, am>” (“only if my teacher has given permission,
I am allowed to enter this room.”)
LSPs for correct sentences “<would, VB>” (“he would buy it.”), “<VBD, yeserday>” (“I bought this book yesterday.”)
Proposed Technique —Other Linguistic Features (1)
Lexical Collocation (LC) Lexical collocation (“strong tea”/濃茶 , not “powerful tea”) collecting five types of collocations
verb-object, adjective-noun, verb-adverb, subject-verb, and preposition-object from a general English corpus
Correct LCs extracting collocations of high frequency
Erroneous LC candidates generated by replacing the word in correct collocations with
its confusion words, obtained from WordNet Consulted by experts to see if a candidate is a true erroneo
us collocation
Proposed Technique —Other Linguistic Features (2)
computing three LC features for each sentence (1)
m is the number of CLs n is the number of collocations in each sentence Probability p(coi) of each CL coi is calculated using the met
hod (Lu and Zhou, 2004) (2) the ratio of the number of unknown collocations (neither
correct LCs nor erroneous LCs) to the number of collocations in each sentence
(3) the ratio of the number of erroneous LCs to the number of collocations in each sentence
Proposed Technique —Other Linguistic Features (3)
Perplexity from Language Model (PLM) extracted from a trigram language Using the SRILM-SRI Language Modeling Toolkit
(Stolcke, 2002) Calculating two values for each sentence:
lexicalized trigram perplexity POS trigram perplexity
The erroneous sentences would have higher perplexity
Proposed Technique —Other Linguistic Features (4)
Syntactic Score (SC) using a statistical parser Toolkit (Collins, 1997) assigning each sentence a parser’s score
the related log probability of parsing Assuming that erroneous sentences with
undesirable sentence structures are more likely to receive lower scores
Proposed Technique —Other Linguistic Features (5)
Function Word Density (FWD) the ratio of function words to content words inspired by the work (Corston-Oliver et al., 2001)
Be effective to distinguish between human references and machine outputs
seven kinds of function words
Experimental Evaluation (1) – Experimental setup Classification model: SVM
For a non-binary feature X: its value x is normalized by z-score.
Two data sets: Japanese Corpus (JC) and Chinese Corpus (CC)
Experimental Evaluation (2)
Experimental Evaluation (3)
ALEK (Chodorow and Leacock, 2000)from Educational Testing Service (ETS)
Different cultures (Japanese/Chinese as first language)
694 parallel-sentences1671 non-parallel sentences
Experimental Evaluation (4)
Two LDC data, low-ranked and high-ranked data 14,604 low ranked (score 1-3) MTs 808 high ranked (score 3-5) MTs Both with corresponding human reference translations human references (Correct), MT (erroneous)
Conclusions and Future Work
Conclusions This paper proposed to mine LSPs as the input of classifica
tion models. LSPs were shown to be much more effective than the other
linguistic features. Other features were also beneficial.
Future work To use LSPs to provide detailed feedback for ESL learners To integrate the features effectively To further investigate the application for MT evaluation
Thanks!!