Download - You’re Not From ‘Round Here, Are You? Naïve Bayes Detection of Non-native Utterance Text Laura Mayfield Tomokiyo Rosie Jones Carnegie Mellon University

You’re Not From ‘Round Here, Are You? Naïve Bayes Detection of Non-native Utterance Text

Laura Mayfield Tomokiyo

Rosie Jones

Carnegie Mellon University

Overview

Motivation Speech data Accent detection as document

classification Classification performance Discriminative tokens Conclusions

Non-native speech recognition

The warship U.S.S. Jarrett has pulled into port in San Diego, CA after training voyage

Native recognizer (word accuracy = 26.7):

Tomorrow CPU a sister at has spilled into port and sandy and afford after a training wage

Non-native recognizer (word accuracy = 73.3):

The worst eighty U.S.S. chart has pulled into port in San Diego California after training warrior

Motivation

Practical can we detect non-native users with

enough accuracy to switch acoustic models?

Exploratory how well does an algorithm based

only on text features work? what tokens are discriminative for

non-native speakers?

Speech examples

Over the next two months, public officials, Native American leaders, businesses and environmental groups will come up with plans for meeting the law’s requirements.

Spontaneous speech

Read speech

I like to have anything very special in Boston, very native in Boston.

Local specialties

Speech data

Read speech Spontaneous speech

Native language

Speaker count

Utterance count

Word count (types)

Speaker count

Utterance count

Word count (types)

Japanese 10 957 15868 (3195)

31 1685 15934 (826)

English 8 756 10237 (2073)

6 320 4117 (418)

Mandarin --- --- --- 6 374 3490 (391)

Transcripts and hypotheses

A safety net for the salmons

Environment= environmentalists…

A safety net forced simon

Um environmental activists…

•Usually gives a good idea of gold standard

•Finds true differences in linguistic usage

•Implicitly models acoustics

•Benefits from amplified difference between native and non-native samples

Classification based on transcripts: Classification based on hypotheses:

“A safety net for salmon: environmentalists, the government, and ordinary folks team up to save the Northwest’s wondrous wild salmon”

Related work

Acoustic feature based accent discrimination (e.g. Fung and Liu 1999)

Competing HMM based accent discrimination (e.g. Teixeira et al 1996)

Classification of documents according to style (Argamon-Engleson et al 1998), author (Mosteller and Wallace 1964)

Accent detection as document classification

Native speaker utterances

Non-native speaker utterances

Classifier

Accent detection as document classification

Classifier

Test speaker utterances

Classification decision: native or non-native?

Experimental methodology

Rainbow naïve Bayes classifier Both word and part-of-speech tokens were examined Classification based on token unigrams and bigrams No feature selection initially Stopwords were not excluded from feature set Data randomly split into 30% testing, 70% training data

for evaluation; evaluation repeated 20 times and classification results averaged

Utterances from the same speaker never appeared in both training and test sets

Classification of spontaneous speech (transcripts only)

01020304050

60708090

100

Cla

ssif

icat

ion

accu

racy

BaselineWordPOSPOSNoun

Native/ Japanese

Native/ Chinese

Japanese/ Chinese

Native/ Non-native

Native/ Japanese/ Chinese

Classification of read speech

0102030405060708090

100

A

Word-trans

POS-trans

Word-hypo

POS-hypo

A train: same texts

test: same texts

baseline


0102030405060708090

100

A B C D

trans-word

trans-POS

hypo-word

hypo-POS

A train: same texts

test: same texts

B train: disjoint texts

test: disjoint texts

C train: disjoint texts

test: same texts

D train: same texts


baseline


0102030405060708090

100

B

trans-word

trans-pos

hypo-word

hypo-pos

A train: same texts

test: same texts

B train: disjoint texts


C train: disjoint texts

test: same texts

D train: same texts


baseline

Feature Selection

Method Number of features Accuracy

None 4087 47

IG-524 524 69

SMART-524 524 88

IG-200 200 74

SMART-524, IG-200 200 88

IG-70 70 70

M&W-70 70 87

IG-48 48 74

SMART-48 48 84

Discriminative sequences

Speech type Token type Native Non-native

Read Word NMFS the + the

the that

Read POS noun(pl) noun(sing)

noun(pl) verb(past)

Spontaneous Word Wonderland the

Spontaneous POS TO + verb(base) noun(sing)

Spontaneous POSNoun am noun(sing)

transcriptions hypotheses

Conclusions

Transcriptions of spontaneous speech can be classified with high accuracy for both 2-way and 3-way distinctions

Read speech samples, which are simple transformations of native-produced text, can be classified with high accuracy

Recognizer output is classified more accurately than transcripts

Future directions

Incorporating the classification decision in acoustic model selection

Minimizing the number of samples from the test speaker needed for classification

Applying classification to parsing grammar selection, language model construction, writer identification

Discriminative POS sequences

Native Non-native

Noun(pl) Noun(sing)

Determiner Preposition

Noun(pl);preposition Preposition;preposition

Adjective;noun(Pl) Noun(sing);noun(sing)

Gerund;particle Particle;preposition

Noun(s);verb(3s) Cardinal#;cardinal#

Noun(pl);modal Verb(past)

Discriminative word sequences

Native Non-native

NMFS the;the

the;NMFS in;in

nineteen;hundreds the

hundreds;now in

hundreds that

habitats;and habitat;and

Phone-based classification

0

20

40

60

80

100

Words Phones

Identity POS/Phone class

Native Non-native

Phone identity // /I/

Phone class

CCC V

Discriminative tokens

Condition B

Download - You’re Not From ‘Round Here, Are You? Naïve Bayes Detection of Non-native Utterance Text Laura Mayfield Tomokiyo Rosie Jones Carnegie Mellon University

Top Related