detecting erroneous sentences using automatically mined sequential patterns

Detecting Erroneous Sentences using Automatically Mined

Sequential Patterns

Advisor: Hsin-His ChenReporter: Chi-Hsin YuDate: 2007.12.04

Outlines

Introduction Related Work Proposed Technique Experimental Evaluation Conclusions and Future Work

Introduction

Summary Problem: Identifying erroneous/correct sentences Algorithm: Classification (SVM, NB) Approach: Sequential patterns (Data Mining)

Applications Providing feedback for writers of English as a

Second Language (ESL) Controlling the quality of parallel bilingual

sentences mined from the Web Evaluating the MT results

Introduction (cont.)

The common mistakes (Yukio et al.,2001; Gui and Yang, 2003) made by ESL learners spelling, verb formation lexical collocation, tense, agreement, wrong Part-Of-Speec

h (POS), article usage sentence structure (grammar structure)

Example “If Maggie will go to supermarket, she will buy a bag for you.

” The pattern: “if...will...will” (would ) N-grams: considering only continuous sequence of words,

very expensive if N > 3

Related Work

Category 1: the use of hand-crafted rules Heidorn, 2000; Michaud et al., 2000; Bender et al.,

2004 Difficulties

Expensive to write rules manually difficult to produce and maintain a large number of no

n-conflicting rules to cover a wide range of grammatical errors

making different errors by different first-language backgrounds and skill levels

hard to write rules for some grammatical errors

Related Work (cont.)

Category 2: statistical approaches Chodorow and Leacock, 2000; Izumi et al., 2003;

Brockett et al., 2006; Nagata et al., 2006 Problems

focusing on some pre-defined errors the reported results being not attractive the need of errors to be specified and tagged in the tra

ining sentences the need of parallel tagged data

Proposed Technique

Classification model Using SVM (light SVM) Features

Labeled Sequential Patterns (LSP) – 1 feature Complementary features

Lexical Collocation (LC) – 3 features Perplexity from Language Model (PLM) – 2 features Syntactic Score (SC) – 1 feature Function Word Density (FWD) – 5 features

Proposed Technique —LSP (1)

A labeled sequential pattern (LSP), p, is in the form of <LHS, c> LHS is a sequence <a1, ..., am>

ai is named “item”.

c is a class label (correct/incorrect here) Sequence database D

The collection of LSPs


“Contain” relation (subsequence) a sequence s1 =< a1, ..., am > is contained in a seq

uence s2 =< b1, ..., bn > if there exist integers i1, ...i

m such that 1 <= i1 < i2 < ... < im <= n and aj = bij for all j in {1, ...,m}.

A=<abcdefgh> has a subsequence B=<bdeg> A contains B.

A LSP p1 is contained by p2 if the sequence p

1.LHS is contained by p2.LHS and p1.c = p2.c.


A LSP p is attached with two measures, support and confidence. The support of p (the generality of the pattern p)

denoted by sup(p) the percentage of tuples in database D that contain th

e LSP p the confidence of p (predictive ability of p)

Denoted by conf(p) Computed as


Example: t1 = (< a, d, e, f >,E) t2 = (< a, f, e, f >,E) t3 = (< d, a, f >,C) One example LSP p1 = (< a, e, f >, E)

is contained in t1 and t2

sup(p1) = 2/3 = 66.7%, conf(p1)=(2/3)/(2/3) = 100%

LSP p2 = (< a, f >, E) sup(p2) = 3/3 = 100%, conf(p2)= (2/3)/(3/3) = 66.7%

Proposed Technique —LSP (5) Generating Sequence Database

applying Part-Of-Speech (POS) tagger to tag each training sentence MXPOST-Maximum Entropy Part of Speech Tagger Toolkit

3 for POS tags keeping function words and time words each sentence together with its label becomes a database t

uple “In the past, John was kind to his sister” “In the past, NNP was JJ to his NN”

LSP Examples (<a, NNS>, Error), NNS: plural noun (<yesterday, is>, Error)


Mining LSPs adapting the frequent sequence mining algorithm i

n (Pei et al., 2001) setting minimum support at 0.1% and minimum co

nfidence at 75% Converting LSPs to Features

the corresponding feature being set at 1 if a sentence includes a LSP


LSPs for erroneous sentences “<this, NNS>” (“this books is stolen.”) “<past, is>” ( “in the past, John is kind to his sister.”) “<one, of, NN>” ( “it is one of important working language” “<although, but>” (“although he likes it, but he can’t buy it.”) “<only, if, I, am>” (“only if my teacher has given permission,

I am allowed to enter this room.”)

LSPs for correct sentences “<would, VB>” (“he would buy it.”), “<VBD, yeserday>” (“I bought this book yesterday.”)

Proposed Technique —Other Linguistic Features (1)

Lexical Collocation (LC) Lexical collocation (“strong tea”/濃茶 , not “powerful tea”) collecting five types of collocations

verb-object, adjective-noun, verb-adverb, subject-verb, and preposition-object from a general English corpus

Correct LCs extracting collocations of high frequency

Erroneous LC candidates generated by replacing the word in correct collocations with

its confusion words, obtained from WordNet Consulted by experts to see if a candidate is a true erroneo

us collocation


computing three LC features for each sentence (1)

m is the number of CLs n is the number of collocations in each sentence Probability p(coi) of each CL coi is calculated using the met

hod (Lu and Zhou, 2004) (2) the ratio of the number of unknown collocations (neither

correct LCs nor erroneous LCs) to the number of collocations in each sentence

(3) the ratio of the number of erroneous LCs to the number of collocations in each sentence


Perplexity from Language Model (PLM) extracted from a trigram language Using the SRILM-SRI Language Modeling Toolkit

(Stolcke, 2002) Calculating two values for each sentence:

lexicalized trigram perplexity POS trigram perplexity

The erroneous sentences would have higher perplexity


Syntactic Score (SC) using a statistical parser Toolkit (Collins, 1997) assigning each sentence a parser’s score

the related log probability of parsing Assuming that erroneous sentences with

undesirable sentence structures are more likely to receive lower scores


Function Word Density (FWD) the ratio of function words to content words inspired by the work (Corston-Oliver et al., 2001)

Be effective to distinguish between human references and machine outputs

seven kinds of function words

Experimental Evaluation (1) – Experimental setup Classification model: SVM

For a non-binary feature X: its value x is normalized by z-score.

Two data sets: Japanese Corpus (JC) and Chinese Corpus (CC)

Experimental Evaluation (2)


ALEK (Chodorow and Leacock, 2000)from Educational Testing Service (ETS)

Different cultures (Japanese/Chinese as first language)

694 parallel-sentences1671 non-parallel sentences


Two LDC data, low-ranked and high-ranked data 14,604 low ranked (score 1-3) MTs 808 high ranked (score 3-5) MTs Both with corresponding human reference translations human references (Correct), MT (erroneous)

Conclusions and Future Work

Conclusions This paper proposed to mine LSPs as the input of classifica

tion models. LSPs were shown to be much more effective than the other

linguistic features. Other features were also beneficial.

Future work To use LSPs to provide detailed feedback for ESL learners To integrate the features effectively To further investigate the application for MT evaluation

Thanks!!

detecting erroneous sentences using automatically mined sequential patterns

Documents