doctoral thesis proposal - columbia universityamaxwell/proposal.pdf · doctoral thesis proposal...

Doctoral Thesis Proposal

Automatic Detection and Classification of ProsodicEvents

Andrew [email protected]

Department of Computer ScienceColumbia University

December 11, 2007

Abstract

Speech prosody is a valuable carrier of information. Accents and phrase boundaries havebeen shown to contribute to syntactic disambiguation, semantic, pragmatic and paralinguis-tic interpretation, and to convey information about topicality, focus, contrast and informationstatus. This thesis will present and evaluate techniques to detect and classify these prosodicevents. The acoustic correlates of accents, phrase boundaries and phrase-final tones will alsobe examined.

Spoken language processing systems have not made widespread use of prosodic informa-tion. We hypothesize that access to this information should improve the performance of manySLP applications. To support this, we will present proof-of-concept examples integrating hy-pothesized prosodic event information into speech synthesis, story segmentation, extractivesummarization, and prosody tutoring applications.

i

Contents1 Introduction 1

2 Corpora 2

3 Automatic Pitch Accent Detection 33.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Completed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2.1 Naıve Bayes, Decision Tree and SVM based detection . . . . . . . . . . . 63.2.2 Correlation of Spectral Information . . . . . . . . . . . . . . . . . . . . . 73.2.3 Ensemble-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Automatic Phrase Boundary Detection 114.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Completed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Phrase-Final Tone and Pitch Accent Type Classification 145.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2 Completed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Integrated Prosodic Event Detection 186.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.2 Preliminary Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7 Applications 227.1 Applications of Non-native Prosodic Event Detection . . . . . . . . . . . . . . . . 22

7.1.1 Prosody tutoring system . . . . . . . . . . . . . . . . . . . . . . . . . . . 237.1.2 Accent identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7.2 Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.3 Story Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.3.2 Completed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.3.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.4 Extractive Summarization of Broadcast News . . . . . . . . . . . . . . . . . . . . 277.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.4.2 Completed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287.4.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

8 Contribution 29

9 Plan for completion of the Thesis 30

iii

1 IntroductionAutomatic speech recognition (ASR) performance has reached a level of accuracy that is accept-able for commercial deployment. Spoken dialog systems have become commonplace, used bysuch national corporations as Amtrak, Time Warner Cable, and almost every major airline. Thiscertainly marks a sea change in the way people interact with computers, moving us one step closerto the goals of ubiquitous and natural human-machine interaction.

Speech recognition engines transcribe the lexical content of speech; however, speech containsinformation transmitted via prosodic variations in addition to the sequence of spoken words. Thisintonational information is processed so transparently by human listeners, it is often overlookedas a vital communicative component. Some of this variation is so common that isn’t perceived asadding information as much as “sounding natural”, other variation clearly affects the communica-tive effect of the carrier speech. Consider, on one hand, the simple declarative sentence THE BOY

WALKED HOME. In neutral, natural speech it is expected that BOY would be produced with higherpitch, louder and with longer duration than THE. On the other hand, consider the differences inspeech produced when a speaker is tired, angry, hesitant, or excited. The acoustic variations thatcharacterize these speaker states vary along a number of dimensions, not all of which are fullyunderstood, but nonetheless carry rich communicative information.

Accenting, phrase boundaries and phrase-final tones are three dimensions of prosodic variationthat have been heavily studied in many languages. This group of intonational phenomena are col-lectively referred to as prosodic events. The ToBI framework [104] provides a useful taxonomy ofthese phenomena; pitch accents and associated tone labels define categories of accenting behav-ior, degrees of disjuncture or “breaks” define phrase boundaries, and phrase accents and boundarytones are used to label phrase-final pitch behavior. Due to its widespread use, and available anno-tated data, the ToBI standard defines the prosodic events that will be detected and classified in thisthesis.

A wide variety of communicative effects of accenting location and type have been identified.The example below demonstrates contrastive focus, in which lexical items in contrast with existingdiscourse elements are accented to highlight this effect to the hearer.

1. A: IS BOB’S CAR WHITE?B: CHUCK’S CAR IS WHITE.

2. A: IS CHUCK’S CAR BLACK?B: CHUCK’S CAR IS WHITE.

In exchange 1, speaker B accents the word CHUCK to indicate the contrast between BOB’S CAR

and CHUCK’S CAR. On the other hand, in exchange 2, contrast is introduced by the conceptsignified by WHITE, leading speaker B to accent this term. In addition to contrast, accent locationand type have been shown to be indicators of topicality, focus [46, 45, 42, 107], and informationstatus [41, 29, 65].

While there have been a number of hypotheses about the interaction between syntactic andprosodic phrasing (e.g., [24, 91]), the predominant opinion is there is no direct correspondencebetween the two (e.g.,[32]). However, there are instances in which prosodic phrasing can be usedto disambiguate ambiguous syntactic parses [90, 66]. In the example below, the location of thephrase boundary indicates the attachment of the prepositional phrase WITH THE TELESCOPE toeither THE MAN in the first example or SAW in the second.

1

1. I SAW THE MAN — WITH THE TELESCOPE. (I used a telescope to see the man.)

2. I SAW — THE MAN WITH THE TELESCOPE. (The man had a telescope.)

Phrase-final behavior, following the ToBI vocabulary – phrase accent and boundary tone – hasbeen shown to mark speech acts [85, 53] and convey pragmatic information such as hesitancy oruncertainty [63]. The most commonly observed instance of the impact of phrase-final behavioron communication is the case of declarative questions. The declarative statement CHUCK’S CAR

IS WHITE. is realized with falling pitch immediately prior to the phrase boundary occuring afterWHITE. On the other hand the declarative question CHUCK’S CAR IS WHITE? is realized with ris-ing pitch during the word WHITE. This is a case in which prosodic events are used to disambiguatethe illocutionary force of an utterance.

This thesis will present techniques to automatically detect and classify prosodic events, aswell as methods to provide this intonational information to spoken language understanding (SLU)systems; feature analyses of detection and classification experiments will examine the acousticcorrelates of these events. Proof-of-concept results demonstrating the value of this information inimproving downstream SLU performance will also be presented.

This proposal is structured in the following way. Chapter 2 describes the material that willbe used in prosodic event detection and classification experiments. Chapters 3 and 4 addressthe problems of detecting pitch accents and phrase boundaries, respectively. Approaches to theclassification of pitch accents and phrase-finals are presented in Chapter 5. Chapter 6 presents anumber of techniques predicting both accent and phrase boundary locations in an integrated frame-work. Chapter 7 describes a set of SLU applications that are improved with access to hypothesizedprosodic event information. Chapter 8 describes the contributions of the thesis. Chapter 9 outlinesthe plan for completing the work proposed in this document.

2 CorporaThe research described and proposed in this proposal involves the use of supervised machine learn-ing techniques, and thus requires the availability of speech material that has been manually anno-tated with prosodic events for training and evaluation. There is a substantial amount of this materialavailable from a variety of domains, genres, and speakers. Table 1 contains a comparison of theannotated material comprising each corpus.

While the experiments will report results from multiple corpora, certain corpora are more orless suited to particular tasks. For example, the IBM TTS corpus represents the largest body ofmaterial spoken by a single speaker; this will be used to demonstrate the performance of speaker-dependent modeling of prosodic events. On the other end of the spectrum, the Columbia GamesCorpus (CGC) contains a significant amount of material spoken by multiple speakers. This makesthe CGC particularly well suited for experiments involving speaker-independent modeling andexperiments on speaker adaptation. Many previous papers on prosodic event detection and clas-sification have reported results on the Boston University Radio News Corpus (BU-RNC) and theBoston Directions Corpus (BDC); evaluation on this material will allow us to draw comparisonsto between the techniques described in the thesis and those previously published. Experiments onother corpora will be included to evaluate the robustness of the proposed techniques to the variousidiosyncrasies represented by this diverse material.

2

Most of the corpora have been manually annotated using the full ToBI standard – includingmanual word boundaries. The exceptions to this are the TDT-4 and ETS material. Both of thesehave been labeled with only intonational phrase boundaries, and accent locations; no intermedi-ate phrase boundaries or tone labels were marked. The ETS corpus contains phrase accent andboundary tone labels at intonational phrase boundaries. No pitch accent type information has beenannotated for either of these corpora.

Due to Columbia’s participation in the DARPA GALE project, high quality automatic speechrecognizer (ASR) [108] transcripts are available for the TDT-4 corpus, allowing for the evaluationof completely automatic prosodic event detection and classification techniques. For similar eval-uation on other corpora, the CMU SPHINX open-source speech recognizer [18] will be used togenerate automatic transcripts; however, due to the modest amount of spoken material and openvocabularies comprising these other corpora, significantly higher word error rates are expectedthan those represented in the TDT-4 transcripts.

Corpus Name Length Length Number of Genre Annotation(mins) (words) speakers Standard

BDC-spontaneous [79] 60 11,627 4 Spontaneous Full ToBIBDC-read [79] 50 10,822 4 Non-professional read Full ToBIBU-RNC [81] 141 23,830 6 Professional read Full ToBIIBM TTS [43] 131 21,196 1 Professional read Full ToBI

Communicator [125] 67 12,183 unk. Professional read Full ToBITrains [47] 18.5 2,581 12 Spontaneous Full ToBI

Columbia Games Corpus 362b 73,837 13 Spontaneous Full ToBITDT-4 [109] 30 3,326 30a Professional read IPs,

binary accentsETS (Tentative) 168 32,316 34 Spontaneous IPs,

binary accents,phrase accents,boundary tones

a Count based on automatic speaker diarization output b 362 minutes have been labeled out of the 548 minutes in the corpus

Table 1: Brief Description of Corpora

3 Automatic Pitch Accent DetectionIn this chapter we address the task of automatic pitch accent detection. First, we summarize previ-ous work on this topic concerning both the identification of perceptual correlates to accenting andexisting automatic detection techniques. Then, we present the results of two completed studies,and describe our proposed work.

3.1 BackgroundThere is consensus that intonational prominence – accenting or bearing pitch accent – is markedby pitch movements, increased energy and prolonged duration [3, 6]. Words are said to be made

3

prominent from their surrounding content by containing acoustic excursions away from an other-wise neutral voice [63, 8] and pitch excursions, in particular, were largely taken to be the mostmarked cue to pitch accent. Clark and Yallup [26] described pitch as “the most salient determi-nant of prominence.” While Wightman and Ostendorf [128] found “[l]ittle discussion of energycues...in the linguistics literature, however, probably because energy is less important than F0 andduration in human perception of prominence.” However, recent studies has given more attentionto the correlation between speech energy and pitch accent. Silipo [103] found that in spontaneousspeech, duration and energy to be the most important acoustic parameters underlying accent withpitch playing only a minor role. Kochanski [60] expanded on this result with a lengthy analysis-by-classification of British and Irish English, finding f0 to be a weak predictor of prominence,with loudness and duration being much more discriminative of prominent and non-prominent syl-lables. In a pair of perception [106] and production [105] experiments, Sluijter and van Heuvenshowed that accent in Dutch strongly correlates with the energy within a particular frequency sub-band, that greater than 500Hz. This observation concerning “spectral emphasis” led to a number ofother studies examining relationship between “spectral balance” or “spectral tilt” and pitch accent[49, 48, 34, 17], finding similar strong correlations.

Building on these and other studies of acoustic correlates of accenting, the task of automaticallydetecting pitch accent has received a considerable amount of research attention. A wide range ofsupervised machine learning techniques have been applied to this problem. Ren et al. [95] usedan artificial neural network (ANN) to detect accented syllables with 83.6% accuracy on BU-RNCdata. Frame based Hidden Markov Models (HMMs) have been applied to the task with some suc-cess [20, 16, 27, 19]; 75% accuracy was reported using a three-stream coupled HMM to separatelymodel information streams of pitch, intensity and duration [1]. When including a syntactic modelwith an HMM acoustic model, 86% accuracy on BU-RNC data and 78.6% on was reported byRangarajan et al. [94, 93]. On a subset of the Switchboard corpus of spontaneous telephone con-versations, Gregory [39] applied Conditional Random Fields (CRFs) to detection pitch accent with76.4% accuracy. While using duration and speaking rate information, along with part of speechand collocational measures, this CRF model did not include any energy or pitch information. Sun[110] demonstrated the applicability of ensemble based methods to pitch accent detection; 87.2%syllable-level detection accuracy was demonstrated using a speaker-dependent Boosting approach.Decision trees, gaussian models, memory based systems and maximum entropy classifiers have allbeen used to detect pitch accent [128, 124, 12, 9, 10, 77]. Supervised learning techniques have theburden of requiring a large corpus of annotated training data; because of the resources necessary toprosodically annotate speech, unsupervised learning has also been used to train pitch accent detec-tors. Levow [69] and Ananthakrishnan et al. [2] applied clustering techniques (gaussian mixturemodels (GMM) and k-means) to the problem yielding 78% and 77% accuracy, respectively. Tam-burini [113, 114] constructed a prominence scoring formula based on pitch, duration, intensity andspectral balance quantities. This prominence score is then thresholded to detect accented syllables.

In addition to decisions made about the machine learning technique and feature representationused, any approach to automatic pitch accent detection makes a number of fundamental assump-tions about the task:

• On what unit should accents be detected? The acoustic excursions that lead to the percep-tion of accent are commonly aligned with a syllable. However, the semantic and pragmaticimplications of accenting lie at the word, or even phrase level. Proponents of syllable-based

4

accent detection make the claim that it is easy to translate from syllable to word prominenceand not vice versa. Word-based accent detection advocates would counter that acoustic real-izations of accent are rarely confined to a single syllable, and that mapping word to syllableprominence is not difficult; while accent is realized on the lexically stressed syllable of aword, identifying the stressed syllable within a word can be accomplished by either identi-fying canonical lexical stress from a dictionary, or by a second-pass acoustic analysis.

• How will the unit boundary be defined? Whether accents are being detected at the wordor syllable level, defining the word or syllable boundary manually or automatically has asignificant impact on the performance of the detector. Moreover, manual boundaries carrythe assumption that when used in a larger SLU system, similar manual word or syllableboundaries will be available.

• Will lexical and/or syntactic information be used? The use of lexical and/or syntactic in-formation in automatic prosodic event detection assumes the availability of automatic speechrecognition and syntactic analysis modules. While manual transcription and syntactic infor-mation can provide an upper-bound, the error introduced by these modules is likely to im-pact the accent detection performance. Given these caveats, research from the text-to-speech(TTS) community on prosodic assignment1 has demonstrated that there is a relationship be-tween a sequence of words and corresponding pitch accent locations. Chomsky and Halleposited that accent location is completely predictable from lexical and syntactic information[24]. Bolinger, however, famously responded to this claim that “accent is predictable – onlyif you’re a mind reader” [7]. More recently, Brenier [13] hypothesized that accent predictionusing shallow features such as part-of-speech has reached a performance ceiling at 76% onspontaneous speech and 85% on read speech.

The impact of these assumptions make comparing the performance of different approachesdifficult, if not impossible. Moreover, speaker variation and genre can have a significant impact ondetection performance; speaker-dependent detection of pitch accent on read speech is significantlyeasier than speaker-independent detection on spontaneous speech. When presenting our results,we will make comparisons with those published results that make similar assumptions and areevaluated on data similar to our own.

3.2 Completed WorkIn this section, we describe three studies of automatic pitch accent detection. The first describes theresults of applying common classification techniques to the task. The second describes a study ofthe correlation between pitch accent and the energy information extracted from specific frequencyregions. The third presents the results of an ensemble-based classification approach to pitch accentdetection based on the results of the energy correlation study. In each of these studies, the wordsof a particular corpus are classified as containing a pitch accent or not. For most of the corporaautomatic transcripts are not available, and manual word boundaries are used to define the data set.One exception to this is the TDT-4 corpus. The data points from this corpus are defined by wordboundaries generated by an automatic speech recognition system.

1Prosodic assignment is the task of assigning plausibly natural prosodic events to a string of input text.

5

3.2.1 Naıve Bayes, Decision Tree and SVM based detection

There are a wealth of binary classification models and training algorithms available2. We haverun experiments using three common supervised learning classification algorithms: Naive Bayes,J48 (an implementation of Quinlan’s C4.5 decision tree algorithm [92]) and SMO support vec-tor machines [89]. The Weka machine learning environment [130] includes implementations ofthese algorithms and was used to run these experiments each with ten fold cross validation. Forpitch accent detection, the classification task is to classify each word as being accented or not.Each word in each corpus is described by a vector containing acoustic features extracted from thespeech signal. In this set of experiments, only acoustic features are included in the feature vector.As described in Section 3.1, the main acoustic correlates of pitch accent are pitch, duration, andenergy. The results from these studies directly inform the features that are extracted and comprisethe feature vector. Bolinger describes pitch accent as a deviation from a baseline [8]. That is,there is a neutral baseline voice, and accented words are characterized as an excursion, above orbelow, this neutral voice. To capture this notion of deviation from a baseline, or “standing out”from surrounding material, acoustic features normalized by their surrounding context in a numberof ways are included in the feature vector.

For each word, we compute the minimum, maximum, standard deviation, root mean squaredand mean of pitch (f0) and intensity (I) values extracted using Praat’s [5] Get Pitch (ac). . . and GetIntensity... functions, respectively. We also compute the pitch features based on speaker normal-ized pitch values. The speaker normalization is performed using z-score3 normalization based onthe mean and standard deviation of all pitch values produced by a speaker. When speaker identifi-cation information is available as part of a corpus, these are used for the speaker normalization. Forthe TDT-4 corpus, an automatic speaker diarization module is used to determine speaker identity.Since pitch accent may be determined not only by pitch location, but pitch movement, we alsoincluded the above features calculated over the first order differences (∆f0, ∆I) of the raw pitchand intensity tracks as well as the speaker normalized pitch tracks.

To represent the local context we defined nine contextual windows. These windows wereconstructed using every combination of two, one or zero previous words and two, one or zerowords following a given data point. Based on the pitch values contained in these regions, weinclude in the feature vector, z-score and range normalization of the maximum and mean rawand speaker normalized f0 of the current word. Range normalization4 calculates the value of agiven point, x, relative to the range of values observed within a region. We extract three durationfeatures: the duration of the current word in seconds, the duration of the pause between the currentand previous word and the duration of the pause between the current and following word.

Cross-validation accuracy and F-Measure[121] of accented words results from Naive Bayes,J48 and SVM classification are reported in Table 2. In general, across corpora, we observe modestimprovement of J48 Decision trees over NaiveBayes models, and considerable improvement bysupport vector machines over other modeling techniques. Support vector machine accuracy rangesfrom 83.1% to 87.8%. Human agreement regarding pitch accent detection is commonly held to be

2According to the No Free Lunch Theory [131], no classification algorithm or modeling technique is a priorisuperior to any other. We, thus, run experiments using a variety of supervised machine learning algorithms.

3Z-score normalization measures how many standard deviations (σ) away from the mean (µ) a given point (x) is,and is calculated as x−µ

σ .4Range normalization of a given value, x, and a range, [min,max], is x′ = x−min

max−min .

6

between 81% and 91%, depending on recording conditions, and genre [133, 88, 40, 111].

Corpus NaiveBayes J48 Decision Tree SVMBDC-Read 79.5% / 0.767 79.2% / 0.753 84.2% / 0.813

BDC-Spontaneous 79.5% / 0.795 78.3% / 0.783 83.1% / 0.833BU Radio News Corpus 82.4% / 0.840 83.9% / 0.853 87.8% / 0.890

TDT-4 77.5% / 0.763 80.0% / 0.798 84.9% / 0.851Communicator 75.9% / 0.777 80.2% / 0.821 85.0% / 0.866

IBM TTS 71.2% / 0.700 81.4% / 0.806 84.7% / 0.841Trains 79.1% / 0.783 76.8% / 0.752 82.9% / 0.817

Table 2: Pitch accent detection Accuracy / Accented F-Measure

Sun [110] found ensemble based methods, Bagging [11] and Boosting [36], specifically, toyield high accuracy pitch accent detection results. We have run experiments using these ensemblebased methods, as well as Dagging [118], a Bagging variant, to test if similar high accuraciesare observed. The results of experiments, using the feature set described above, can be found inTable 3. Across corpora, we find Dagging and Bagging to generate higher accuracy models thanBoosting. However, neither Bagging nor Dagging is consistently superior to the other. Comparedto the standard classification algorithms reported in Table 2, Bagging and Dagging approachesperform slightly worse than SVMs but better than decision trees with one exception; on BDC-spontaneous, Bagging yields results that are 0.3% better than SVMs. This improvement is notsignificant.

Corpus AdaBoost.M1 Bagging DaggingBDC-Read 81.3% / 0.782 83.7% / 0.810 83.6% / 0.808



IBM TTS 80.0% / 0.792 84.4% / 0.842 84.3% / 0.837Trains 82.6% / 0.816 83.8% / 0.828 83.5% / 0.825

Table 3: Pitch accent detection Accuracy / Accented F-Measure – Ensemble based classifiers

3.2.2 Correlation of Spectral Information

Spectral information, specifically high frequency emphasis, has been shown to correlate with pitchaccent [105, 106, 34, 49, 48]. However, while there is consensus that high frequency emphasis isan indicator of pitch accent, there is no such agreement about the best way to measure this spectraltilt. The threshold for “high” frequency is defined inconsistently; a variety of fixed frequencyregions have been used, as have variable thresholds based on estimated f0 values.

Therefore, as described in [96], we examine the role of energy in American English pitch accentmore closely. Based on the results of Sluijter and Van Heuven, Fant and Heldner, we evaluate thepredictive capacity of spectral information – the energy components of different frequency regions– to pitch accent on BDC material. In an analysis-by-classification design, we extract energy

7

features from each of 210 frequency regions, as well as from the first and second formants. Theseregions are defined by varying the base frequency from 0 to 19 bark and the bandwidth from 1 to 20bark. The upper bound for any sub-band was 20 bark, as the Nyquist rate for each of the evaluatedcorpora is 8kHz. For each word5, and each frequency region, we extract a set of features basedonly on the energy it contained. The energy features we extract were identical to those energyfeatures used in the previous set of experiments (Section 3.2.1). The feature representation ofeach word included minimum, mean, maximum, standard deviation and rms of intensity and firstorder difference of intensity. Also, we extract the z-score and range normalization of maximumand mean intensity based on the nine contextual windows, as previously described. These ninewindows are each combination of two, one and zero previous words and two, one and zero wordsfollowing the current word. Using these extracted features, we train decision tree classifiers todetect pitch accent.

The most basic conclusion of Sluijter et al. is confirmed. The energy components of differentfrequency regions are able to predict pitch accent with accuracy varying by as much as 14.8%. Weinitially hypothesized that a single frequency region would emerge as the most predictive. How-ever, this is not confirmed. The most predictive frequency region for one speaker is not necessarilythe most predictive for another. Energy features extracted from formant regions are less predictivethan fixed band energy information. We make the claim that the region between 2 and 20 bark wasthe most predictive of pitch accent. While it does not yield the best results for any speaker, it is theonly region that is not significantly worse than the best for any. (We find later that energy featuresextracted from this region, when used in combination with pitch and duration features, were ableto predict accent no better than energy features extracted from the full frequency range of a speechsignal [98].)

One unexpected result of this research is the observation that there was relatively little withrespect to the correct predictions of even adjacent and overlapping frequency regions. That is,those data points correctly classified with respect to the presence of pitch accent vary significantlyfrom one frequency to another – even those that are spectrally close to one another. Moreover, wefind that greater than 99% of words were correctly classified by at least one of the 210 classifiers.If an oracle were available to identify a priori an energy region which contained discriminativeinformation for a given word, this set of classifiers could detect pitch accent with only negligibleerror. As 210 hypotheses are available for each data point, we combine these using a majorityvoting scheme. Using unweighted voting, a classification accuracy of 81.8% on both BDC-readand BDC-spontaneous is achieved. (Confidence weighted voting does not perform significantlydifferently.) Considering that this classifier structure does not use any pitch or duration information– strong correlates of pitch accent – this is a particularly promising result.

3.2.3 Ensemble-based detection

In this section, we present the results of a number of experiments using the findings of our study ofspectral correlates (see Section 3.2.2 and [96]) with pitch and duration information to detect pitchaccent. This research was originally reported in [98]. In this study, we include the TDT-4 corpusin our analysis. This allows us to test the approach using ASR word boundaries in addition to themanual word boundaries available with the BDC corpora.

5Experiments are also run where classification unit is the stressed syllable or syllable nucleus, however, the findingsare similar to those described at the word-level, and classification accuracy is lower.

8

In the first experiment, we extend the feature vector containing pitch and duration features withthe 210 predictions from the filtered-energy based classifiers. We then use this feature vector totrain another J48 decision tree. Note that when evaluating this classifier in a cross-validatationsetting particular attention is paid to guarantee that none of the elements of the testing set wereused in constructing the predictions included in the training set feature vector. To that end for eachtraining and testing set, an additional ten-fold cross-validation scenario is run on the training set toproduce predictions for use in the training feature vector. The testing set predictions are based onenergy-based classifiers trained on the full training set.

In a second classifier design, we make the relationship between pitch and duration informa-tion and filtered energy based classifiers explicit. For each frequency band, we build a pitch andduration based classifier that predicts whether or not the energy based prediction from the givenfrequency band will be correct. As above, particular care is taken when performing ten-fold crossvalidation on this two stage classifier to insure that no test data point is used in the generation ofany training set prediction. For each energy based prediction, this second classifier using pitchand duration features classifies this prediction as either ‘correct’ or ‘incorrect’; predictions that areclassified as ‘incorrect’ are inverted. Thus, an ‘accent’ prediction becomes ‘non-accented’ and viceversa. Since this correction is performed for each filtered energy based classifier, we obtain 210‘corrected’ pitch accent predictions. We combine these into a single final prediction using majorityvoting.

Results from these experiments can be found in Table 4. We are able to correctly classify pitchaccent with 84.0% accuracy on BDC-read, 88.3% on BDC-spontaneous, 88.5% on TDT-4. Withthe exception of BDC-read where accuracy is slightly lower than that obtained from SVMs, webelieve these to represent pitch accent detection results that are competitive to the state-of-the-art. Due to differences in evaluation material – both corpus and unit of analysis – more specificcomparisons between this and other approaches is impossible. However, these results based onword-level classification are higher than any published results on syllable-level detection (87.2%accuracy [110]), and the syllable-level detection task has a higher random baseline which shouldmake classification accuracies still higher. The fact that the accuracy on the TDT-4 corpus isnot significantly different from that obtained on the BDC spontaneous corpus indicates that thetechnique is relatively robust to the accuracy of word boundary placement. Recall that the BDCword boundaries were manually defined, while the TDT-4 boundaries are a result of ASR output.There is no obvious explanation for the failure of this technique to improve over SVM classificationon the BDC-read corpus.

BDC-read BDC-spontaneous TDT-4Pitch/Dur Corrected Voting 84.0% 88.3% 88.5%Pitch/Dur + Predictions 78.8% 77.5% 80.3%Majority Voting 81.8% 81.8% 83.7%‘Best’ Band Energy 79.1% 78.6% 79.6%No Filtering 79.2% 78.3% 80.0%

Table 4: Energy-based Pitch Accent Classification Accuracy

9

3.3 Proposed WorkThe correcting classifier technique as reported above uses J48 decision trees for the energy basedpitch accent classifiers as well as the correcting classifiers. As observed in Table 2, SVMs areable to yield higher pitch accent classification accuracy. Therefore, we propose to reproduce thecorrecting classifier experiments using SVM models instead of decision trees. To more fully ex-amine the influence of modeling technique we will also use Naıve Bayes models. Additionally,this technique will be evaluated on all corpora described in Chapter 2.

The thesis will examine the interaction between lexical and acoustic information in the predic-tion of pitch accent. Three techniques will be used to incorporate lexical and syntactic informationinto the pitch accent detection framework. Part-of-speech tags will be generated using an imple-mentation of the Brill tagger [14]. The Brill tagger produces hypothesized part of speech tagsfrom the inventory described by the Penn Treebank [76]. In addition to using the full set of 36Penn Treebank tags, tags will be collapsed in two ways: 1) function (nouns, adjectives, verbs,etc.) v. content words (determiners, auxiliary verbs, conjunctions, etc.) and 2) nouns, adverbs,adjectives, cardinal numbers, verbs and miscellaneous6.

1. Extend the feature vector Including lexical and part of speech features in the feature repre-sentation used when training a detector will allow the machine learning technique to lever-age this information into improved detection accuracy. Lexical identity and part of speechn-grams have been shown to be useful for pitch accent assignment. In addition, it has beenhypothesized that discourse given words are less likely to be accented than discourse newwords [15]. As a representation of “given-ness”, a distance measure from a given word to itsnearest previous mention will be extracted and included in the feature representation.

2. Syntactic-class-dependent modeling Words of different syntactic classes are accented withdistinct frequencies; content words, such as nouns and verbs, are significantly more likely tobe accented than function words, such as prepositions and determiners [50]. These priors,as well as production differences between accenting different word classes, can be modeleddirectly by training unique models for each syntactic class, defined as the full set of PennTreebank part of speech tags, collapsed tags or function v. content words.

3. Model Combination Part of speech and lexical information alone can predict pitch accentlocation with 76% accuracy [13]. The output of a syntactic-prosodic model will be com-bined with that of a acoustic-prosodic model to generate a final pitch accent hypothesis.Approaches, similar to this, combining independent acoustic- and syntactic-prosodic mod-els have have been examined by Rangarajan [94, 93], Chen [21] and Conkie [27].

When researching the use of textual information in the detection of pitch accent, automatic aswell as manual transcripts will be used whenever possible; this will allow us to measure the impactof ASR errors on pitch accent detection. If access to higher-quality ASR engines is not possible,the open source SPHINX speech recognition engine [18] will be used to generate transcripts ofthose corpora described in Chapter 2 which have not already been automatically transcribed

6A similar collapsing of part-of-speech tags was used by Ross and Ostendorf [100]

10

4 Automatic Phrase Boundary DetectionIn this chapter we discuss automatic approaches to phrase boundary detection. In the first sectionwe describe previous research. The second section presents results of preliminary machine learningexperiments using acoustic information to detect phrase boundaries. In the third and final sectionwe describe the direction that the thesis will take this research.

4.1 BackgroundPhrase boundaries are defined by perceived disjuncture. According to the ToBI framework, dis-juncture between words is annotated as “break indices” – a five-point scale from 0 to 4. Thegreatest degree of disjuncture occurs between full intonational phrases, and is annotated by a breakindex of 4. Intonational phrases are composed of one or more intermediate phrases and a finalpitch movement or “boundary tone” following the phrase accent of the final intermediate phrase.Intermediate phrase boundaries are perceived as less disjoint than intonational phrase boundariesand are indicated by a break index of 3. These are characterized by a single phrase accent af-fecting the region from the last pitch accent to the boundary. A break index of 0 is used to markclitic groups as in the contraction of ‘did you’ containing a medial affricate, while 1 correspondsto normal within-phrase word boundaries. Break indexes of 2 are less common, and are used toannotate boundaries where a strong sense of disjuncture are perceived, but with no correspondingtonal event – phrase accent or boundary tone – to indicate an intonational or intermediate phraseboundary.

The most salient acoustic indicators of phrase boundaries are the presence of silence, and pre-boundary lengthening. Silence is a clear indication of disjuncture; it represents a region withoutany speech information. While the presence of a silent region correlates with phrase boundarylocations [80, 75, 67], the length of a pause is useful for classifying the degree of disjuncture [129,59]. Pre-boundary lengthening describes the phenomenon by which the final syllable in a phrasehas a longer than expected duration. In English, syllables preceding phrase boundaries tend tohave a longer duration than otherwise [35]. This pre-boundary lengthening has also been observedin Chinese (Guoyu and Putonghua), Japanese [35], Russian and Swedish [123]. Wightman [129]found that pre-boundary lengthening is able to indicate disjuncture even at small boundaries thatdo not include a pause. Syllable length can be influenced by factors other than proximity to aphrase boundary, including speaking rate, vowel identity and accenting. To accurately representpre-boundary lengthening effect, these competing factors need to be accounted for.

Phrases often begin with increased intensity and pitch which slowly declines over the courseof the phrase. Declination resets – initial strengthening and pitch reset – have been observed ascorrelates of phrase boundaries. Often the first syllable, or word following a phrase boundarywill be produced with greater intensity than the word immediately preceding the boundary [35];this is referred to as “initial strengthening”. Ladd [64] noted that a speaker’s pitch range is resetfollowing a phrase boundary – “pitch reset”. ‘t Hart [112] described this phenomenon as “rapidupward jumps of the baseline [pitch]”. de Pijper [30] identified melodic discontinuity as the mostsignificant prosodic cue to phrase boundaries. Fon, however, [35] found no interaction betweenphrase boundaries and measures of pitch reset in English, though she did observe an effect of pitchreset in Chinese (Guoyu and Putonghua) and Japanese.

Numeric representations of these acoustic cues have been used by machine learning techniques

11

to automatically detect phrase boundaries. Wightman and Ostendorf [128] simultaneously detectedpitch accents and phrase boundaries using decision tree output as input to an HMM using pitchand duration based features. This combination of decision trees and HMMs was able to classifybreak indices with 67% accuracy. Based on the assumption that phrases show a relatively linearF0 declination, and that pitch resets or discontinuities to a new declination line signal a phraseboundary, F0 contour modeling has been used to identify phrases [56, 101, 62]. Braunschweiler[9, 10] showed that a speaker-dependent example based system can detect phrase boundaries with76% and 78% accuracy on American English (BU-RNC) and German speech, respectively.

While not completely determined by syntax, lexical and syntactic information have been showedto be useful cues to phrase boundary detection. Rangarajan [94, 93] combined a maximum entropysyntactic-prosodic model and acoustic-prosodic HMM using a maximum entropy objective func-tion to identify intonational phrase boundaries with 93% accuracy on BU-RNC and 90.5% on BDC(read and spontaneous subcorpora). Taking a similar model combination approach, Chen [21] rep-resented the syntactic-prosodic relationship with an ANN and the acoustic-prosodic mapping witha GMM. The two were coupled using a maximum likelihood recognizer to detect following into-national phrase boundaries correctly on 93% of words in the annotated portion of BU-RNC.

4.2 Completed WorkIn this section, we present details and results of experiments using machine learning methods toautomatically detect intonational and intermediate phrase boundaries. Similar to the 3.2, each wordin the corpora (see Chapter 2) is classified as immediately preceding a phrase boundary or not.

Word boundaries are defined by manual time-aligned transcriptions in most of the corpora. Au-tomatic transcriptions are available for the TDT-4 corpus, and are used to define the word bound-aries over which features are extracted. The feature vector used in the pitch accent detection models(see Section 3.2) contains a subset of the features extracted here. Features that represent acousticcorrelates of perceived disjuncture augment the feature representation used by the intonational andintermediate phrase boundary detection models. Because acoustic cues to phrase boundaries – pre-boundary lengthening, phrase accent, boundary tone pitch movements (for intonational phrases) –occur immediately prior to these boundaries, we extracted acoustic features from the final fourth ofeach word (in seconds) and the final 200ms. Liscombe [72] found acoustic information extractedfrom the final 200ms of an utterance to be highly discriminative of question types – yes/no vs. wh-quesions. Phrase accents and boundary tones are believed to disambiguate these question types. Ifquestion types can be distinguished based on the final 200ms of an utterance, we hypothesize thatthis region contains information useful to boundary tone spotting and therefore phrase boundarydetection. To capture declination resets that cue phrase boundaries, we included in the feature vec-tor the difference between the mean of the final N pitch and intensity samples of the current wordand the first N samples of the following word. This feature is calculated with N ∈ {1, 5, 10}.

The results of ten-fold cross-validation experiments using Naive Bayes, J48 decision trees andSVMs can be found in Tables 5 and 6. We find that high accuracy phrase boundary detection isachievable. SVM classifiers yield intonational phrase boundary detection accuracy between 86.1%and 96.9%, and intermediate phrase boundary detection accuracy up to 93.2%. These accuracyscores are competitive with previous approaches to intonational phrase boundary detection (seeSection 4.1). Recall that all intonational phrase boundaries are also intermediate phrase bound-aries. The performance reported in Table 6 largely reflects the success of detection intonational

12

phrase boundaries. Intermediate phrase boundaries that are not also intonational phrase bound-aries (break index 3) are particularly difficult to distinguish from within-phrase word boundaries.While the f-measure from detecting full intonational phrase boundaries is approximately 0.7, whenexamining only the intermediate phrase boundaries marked with break index 3 the f-measure dropsto approximately 0.3.

When interpreting phrase boundary performance it is important to recall that this task has ahigh baseline; on average each intonational phrase contains 5.30 words, each intermediate phrase3.96. Because of this class bias in the classification accuracy, measuring the detection performancevia F-measure [121] provides a better description of the performance than accuracy. Despite theclassification accuracy being high, the F-measure of the phrase boundary classes remains fairlymodest. With the exception of the Communicator corpus the best intonational phrase F-measureachieved on each corpus falls between 0.641 and 0.775, intermediate phrase F-measure rangesfrom 0.640 and 0.787. Previous results do not report results in terms of f-measure, making directcomparisons impossible. Regardless, the modest F-measure indicates that there remains substantialroom for improvement.

The length of following silence is the most useful feature for phrase boundary detection acrossall corpora. In addition to this, values concerning minimum pitch and change in energy (∆I) arevery discriminative. The usefulness of minimum pitch features is likely due to the fact that mostphrase boundaries are realized with low (L- or L%) phrase-final tones. Because of this phrase-final behavior, the lowest pitch produced by a speaker is often coincident with his or her phraseboundaries. A low contextually normalized ∆I value also indicates disjuncture. A negative meanchange in energy corresponds to an energy contour that starts higher than it ends. Thus a negativemean change relative to surrounding context indicates that a word ends with lower energy thanthose words before or after it. This roughly corresponds to the relative depth of the energy valleyat the end of the word.




IBM TTS 65.2% / 0.451 82.1% / 0.597 86.1% / 0.641Trains 81.6% / 0.645 89.9% / 0.747 88.4% / 0.696

Table 5: Intonational Phrase boundary detection Accuracy / Boundary F-Measure

4.3 Proposed WorkThe thesis will address phrase boundary detection in two main directions. First, we will examinethe relationship between lexical content and phrase boundary locations. This will be similar tothe directions proposed in Section 3.3. Representations of lexical and syntactic information willbe included in feature vectors for supervised learning. Model combination will be used to com-bine predictions produced by lexical/syntactic and acoustic phrase boundary prediction modelsto generate a final hypothesis. Second, phone hypotheses from an automatic speech recognition

13



Communicator 78.3% / 0.639 91.0% / 0.806 93.2% / 0.842IBM TTS 64.7% / 0.483 80.4% / 0.612 83.8% / 0.640

Trains 80.6% / 0.666 88.9% / 0.757 88.5% / 0.737

Table 6: Intermediate Phrase boundary detection Accuracy / Boundary F-Measure

module will be used to extract pre-boundary lengthening information. To investigate the use oflexical information in prosodic event detection, we will be generating ASR transcripts for as muchannotated speech as possible. This approach requires careful normalization of syllable, or vowellengths. Vowels have different mean durations due their phonemic identity alone; for example,I, as in “bit”, is typically produced with shorter duration than O, as in “dog”. Speaking rate andspeaker idiosyncrasies as well as lexical and prosodic stress also impact the duration of a syllable.Teasing apart these influences from pre-boundary lengthening phenomena remains a challengingnormalization task.

5 Phrase-Final Tone and Pitch Accent Type ClassificationChapters 3 and 4 present approaches to detecting prosodic event locations. In this chapter westudy the task of automatically distinguishing pitch accent types, and categories of phrase-finalpitch movements – phrase accents and boundary tones. Following a brief introduction describingprevious literature, we present results of preliminary classification studies, and describe proposedwork.

5.1 BackgroundThe ToBI standard is, in part, based on Pierrehumbert’s theory [85] that intonation can be describedas a series of high and low tones. Following this theory, boundary tones (H%, L%) and phraseaccents (H-, L-) are annotated as either high (H) or low (L). In addition to low (L*) and high (H*)pitch accents, the ToBI standard defines complex pitch accents characterized by pitch movementsfrom low to high; L*+H is a “scooped accent”, a low accented syllable followed by a sharp riseand L+H*, “sharp rise”, is a high accented syllable preceded by a sharp pitch rise. H+!H* marks astep down to an high accent from higher pitched unaccented speech. High tones describing pitchand phrase accents can also be produced in a compressed pitch range, these “downstepped” hightones are marked by using the ‘!’ diacritic.

It has been posited that the use of different pitch accent types, phrase accents and boundarytones carry distinct communicative implications. In Chapter 1, we present the example in whichphrase-final pitch movement is the only distinguishing feature between the declarative statement,CHUCK’S CAR IS WHITE, and the declarative question with identical orthography, CHUCK’S CAR

IS WHITE? Pierrehumbert and Steele [87] found the use of L+H* pitch accent to indicate certaintyand L*+H to cue incredulity. Ward and Hirschberg [127], however, found L*+H to signal uncer-

14

tainty. The L-H% phrase final behavior is known as “continuation rise” and has been associatedwith a sense that there is “more to come” [8]. Phrase-final falling pitch movements, on the otherhand, have been hypothesized to indicate certainty and finality [8]. Pierrehumbert and Hirschberg[86] proposed a compositional theory of the discourse implications of prosodic tone sequences.For example, the H* accent signals a hearer to add the accented item to his or her mutual beliefs7.Low toned accents (L*) may be used to mark an item as salient when it is believed to already bepart of the hearer’s mutual beliefs. Regarding phrase accents and boundary tones, Pierrehumbertand Hirschberg hypothesize that H- phrase accents indicate that the current phrase is joined withthe following phrase to form a larger communicative unit, while L- phrase accents indicate greaterseparation between the two. High boundary tones (H%) indicate a “forward reference”, that thecurrent phrase will be “completed” by a subsequent phrase while low boundary tones (L%) carryno such directionality.

Pitch accent types are identified by pitch targets or distinctly shaped pitch movements; phraseaccents and boundary tones are, by definition, pitch movements immediately preceding phraseboundaries. Thus, pitch accent type and phrase-final classification approaches rely heavily on f0features. Ostendorf and Ross [82] describe a hierarchy of Gaussian models of pitch and energyto represent dependencies between the segment, syllable and phrases. This modeling approachachieved 85% accurate pitch accent and phrase final classification on a single-speaker subset of theBU-RNC. Braunschweiler [9, 10] reported 60% and 65% pitch accent type classification accuracyon American English (BU-RNC) and German speech, respectively, using a speaker-dependentexample-based model. Phrase-final classification accuracies of 68% and 71% on American Englishand German data were also reported. Levow [68] showed that context – both phrase-based andstatic – is helpful for classification of both pitch accent type. Using an SVM-framework with alinear kernel, she reported 81.6% pitch accent type classification accuracy. Fach [33] used HMMswith f0 based features to classify German pitch accents as either “rise” or “fall”, reporting 81%accuracy. Ishi et al. [57] used a CART approach to correctly classify 76% of phrase-finals inJapanese speech. These experiments confirmed that degree of pitch change within the phrase finalsyllable and pitch reset across the phrase boundary are useful indicators of phrase-final categories.

There is consensus that words are accented in a variety of ways, and that there are differen-tiable pitch movements prior to phrase boundaries. However, there is some debate about whetheror not pitch accent types, phrase accents and boundary tones ought to be represented categorically.The hypotheses of discourse implications posited by Pierrehumbert and Hirschberg [86] supportthe categorical representation of prosodic event types used by ToBI; if pitch accent types carry dis-tinct, non-continuous, implications, then their perception is necessarily categorical. Taylor [115],however, criticized the position that prosodic events should be represented categorically. His re-jection of a categorical representation of intonation is based on the lack of empirical evidence ofcategorical boundary between high and low tones. While not drawing any argument that “a typicalH* accent is different from a different L* accent”, he applies the metaphor of “hot” v. “cold”. De-spite being able to label certain temperatures as “hot” and others as “cold” there exist intermediatephenomena that are impossible to categorize either way, such is true, Taylor claims, of H* and L*accents. However, to reconcile this with the findings of Pierrehumbert and Hirchberg (and others)a mapping between this proposed intonational continuum and a continuum of interpretation – dis-

7A discourse participant’s “mutual beliefs” are those beliefs both shared by speaker and hearer and believed to beshared.

15

course meaning, paralinguistic or otherwise – is required. Contributing to this discussion from anempirical point of view, Ishi et al. [57] found a high-rate of confusion between categories and astrong linear correlation between continuous acoustic features and fine grained (11-class) humanjudgment of the degree of rise. These two findings suggest that a continuous representation ofphrase-final rise would be more a appropriate representation of human perception than a categori-cal one. On the other hand, Demenko [31] clustered boundary tones into four clusters along pitchdimensions – slope, range and bend. The lowest variance four boundary tone clusters correspondedthe four standard ToBI phrase accent/boundary tone combinations. This finding supports the useof categorical labels of phrase final behavior – regardless of whether or not the perception of thisbehavior is continuous, evidence of categorical production of phrase finals has been observed.

5.2 Completed WorkClassification of pitch accent types and phrase-final tones is somewhat different from accent andphrase boundary detection. In the detection task, the goal is to identify an event that stands outfrom its surroundings. This suggests the use of context normalization techniques to capture theseacoustic excursions numerically. When classifying prosodic events, the shape of the pitch move-ment within the realization of an accent or phrase final is more important than its relationship tosurrounding acoustic material.

We have applied three classification algorithms to the task of classifying pitch accent types andphrase-final tones: Naıve Bayes, J48 decision trees, and SVMs. For these classification tasks, it isassumed that the presence of these events has already been determined. That is, for pitch accentclassification, only those words bearing pitch accent are classified, those without pitch accent areexcluded from the data set. Similarly, only those words preceding phrase boundaries are used in thephrase-final classification experiments. The feature vector used in these experiments is based onthat used in phrase boundary detection experiments (see Section 4.2). This feature representationis augmented with some basic features to represent the shape of pitch and energy contours; weincluded the location of the maximum and minimum pitch and energy value. This location wasmeasured both in seconds – for example, the pitch maximum occurs 0.3 seconds after the start ofthe current word – and relative to the duration of the word – for example, the energy minimumoccurs 65% through the word.

In each of these experiments, we have ignored the downstep diacritic (!), collapsing down-stepped and non-downstepped classes of pitch and phrase accents8. Downstepping, or catathesis,is defined as a compression of the speakers pitch range, a quality that distinct from the contourshape differences between other pitch accent types. This collapsing of tones, makes phrase ac-cent classification a binary classification task, distinguishing between high (H-, !H-) and low (L-)accents. Intonational phrase-final classification is a four-way classification task. The four phraseaccent and boundary tones combinations are L-L%, L-H%, H-L% and H-H%. H-L% and !H-L%have been collapsed into a single class. The results of ten-fold cross-validation can be found inTable 8. Pitch accent classification, on the other hand, is a five-way classification task. The pitchaccent types are H* (collapsed with !H*), L+H* (L+!H*), L*, L*+H (L*+!H), and H+!H*9. Theresults of these classification experiments are reported in Tables 7, 8 and 9.

8The impact of this and other collapsing of ToBI tones is a question that will be addressed in the thesis.9H+!H*, while using the ‘!’ diacritic, does not have a non-downstepped variant

16

The performance achieved by the pitch accent type experiments is quite poor. None of the threeclassification algorithms are able to reach accuracies that are significantly higher than the majorityclass baseline. One probable cause of this poor classification accuracy is the definition of the unit ofanalysis. By classifying pitch accent type using word-level acoustic features, a significant amountof noise is introduced. The acoustic features aggregate information not only from the realizationof the pitch accent, but also the rest of the word. This would be troublesome enough if each wordwere the same length, but words of different lengths will include different amounts of noise.

The phrase final classification experiments reveal a significant genre bias. SVM classificationof non-professional, spontaneously produced, BDC material demonstrates a significant reductionof error from the majority class baseline: 42.4% on intonational phrase-final classification and38.4% on phrase accent classification. On the IBM TTS material, read by a single professionalspeaker, the corresponding error reduction is 9.3% when classifying intonational phrase-finals, and0% – no improvement over baseline – when classifying phrase accents. The IBM TTS materialshows the most severe impact of genre on the classification of phrase finals. SVM classification ofread material other IBM TTS shows greater improvement over baseline, though significantly lessthan that demonstrated by BDC-spontaneous.

Corpus Majority Class (H*) NaiveBayes J48 Decision Tree SVMBaseline

BDC-Read 78.2% 5.5% 74.2% 81.0%BDC-Spontaneous 84.6% 6.1% 79.8% 84.9%

BU Radio News Corpus 71.3% 34.1% 62.6% 71.9%Communicator 77.6% 45.2% 75.5% 81.6%

IBM TTS 51.2% 20.6% 44.0% 54.9%Trains 80.4% 52.1% 76.0% 81.7%

Table 7: Pitch Accent Type classification Accuracy

Corpus Majority Class NaiveBayes J48 Decision Tree SVMBaseline

BDC-Read 51.4% (L-L%) 38.6% 68.2% 74.1%BDC-Spontaneous 34.1% (L-H%) 34.9% 51.2% 62.2%

BU Radio News Corpus 63.5% (L-L%) 63.9% 81.1% 85.9%Communicator 58.0% (L-L%) 73.5% 81.3% 84.3%

IBM TTS 57.2% (L-L%) 34.8% 54.5% 61.0%Trains 60.1% (L-L%) 50.8% 65.6% 66.5%

Table 8: Intonational phrase-final classification Accuracy

5.3 Proposed WorkMany previous classification experiments collapse the ToBI inventory of pitch accent types intoa smaller set. The thesis will evaluate the impact of collapsing pitch accent types into high (H*,L+H*) and low (L*, L*+H) categories as well as collapsing downstepped accent types with corre-sponding non-downstepped versions. Additionally, in the applications presented in Chapter 7, we

17

Corpus Majority Class (L-) NaiveBayes J48 Decision Tree SVMBaseline

BDC-Read 77.2% 46.9% 79.3% 86.6%BDC-Spontaneous 58.6% 49.4% 69.5% 74.5%

BU Radio News Corpus 75.3% 76.0% 83.3% 87.4%Communicator 66.1% 79.5% 84.6% 86.7%

IBM TTS 81.1% 58.9% 76.1% 81.1%Trains 88.5% 82.9% 86.8% 91.4%

Table 9: Phrase accent classification Accuracy

will address the effect that these decisions have on the usefulness of prosodic event type hypotheseson downstream SLU module performance.

The thesis will also investigate two approaches to improving prosodic event type classification.

1. Representing pitch contour shape Pitch accents fall on the stressed syllable of lexicalitems. Extracting features from the word level to classify pitch accent type is likely to haveintroduced noise into the classification experiments described in Section 5.2. In order toisolate the pitch accent excursion, the stressed syllable of accented words will be identified;acoustic features will be extracted from only this syllable for classification. Moreover, thecurrent acoustic features do not explicitly represent pitch or energy contour shapes. TILTcoefficients [115] represent the rise, fall and skew of pitch contours, and can also be cal-culated over energy contours without modification. These will be included in the featurerepresentation of each accented syllable. Additionally, piecewise linear fits of pitch and en-ergy contours can be used to provide a stylized representation of each. The slope and lengthof fit lines will be used to define a numerical representation of the contour shape.

Example-based learning such as Walter Daelman’s TIMBL algorithm [28] should be partic-ularly suited to this task. Due to the skewed class distribution of pitch accent types – on thespontaneous portion of the BDC corpus 84.6% of pitch accents are H* – comparing instanceswith training exemplars may prove more successful than the modeling techniques that wereexperimented with in Section 5.2.

2. Modeling speaker differences. Realizations of prosodic event types are likely to revealsignificant differences across speakers. In a preliminary study, we will test this hypothesisby running speaker-dependent classification experiments. If speaker-dependent modelingshows a significant improvement over the speaker-independent results described above, in-dividual differences represent a significant source of error. Assuming this hypothesis isconfirmed, we intend to apply model adaptation and selection techniques with speaker-clustering in hopes of closing the gap between speaker-dependent and speaker-independentperformance.

6 Integrated Prosodic Event DetectionIn this chapter, we examine the relationship between pitch accent and phrase boundary locations.We motivate this line of inquiry, provide results of a preliminary study, and outline future work.

18

6.1 BackgroundThe prevailing theories of intonation described by Bolinger [8], Ladd [63], and Pierrehumbert [85]all imply a relationship between accenting and phrasing. Words bearing pitch accent are said to“stand out” from their surroundings. These surroundings are implicitly defined by the containingprosodic – intermediate or intonational – phrase. Phrases are defined as acoustically coherent with-out significant disjuncture. Any disjuncture – silence, preboundary lengthening, and/or declinationline reset – that marks a new prosodic phrase implies a new acoustic context.

We intend to examine the relationship between pitch accent and phrase boundary location.The most common way that the tasks of pitch accent and phrase boundary detection have beencombined is by detecting the two simultaneously, using a single model. This defines a four-wayclassification task; each unit is labeled as 1) accented and preceding a phrase boundary, 2) deac-cented and preceding a phrase boundary, 3) accented and within an intonational phrase, or 4)deaccented and within an intonational phrase. Rangarajan [93, 94] found improved detection ofboundary tones but worse pitch accent detection when performing this type of simultaneous de-tection as opposed to modeling the two phenomena separately. Levow [70] used a factorial CRFto simultaneously classify pitch accent, boundary tone and phrase accents with high accuracy. Onfour speakers from BU-RNC phrase boundaries were detected with 91.1% accuracy, pitch accentswith 86.2%. Wightman and Ostendorf [128] defined the problem not as simultaneous pitch accentand phrase boundary detection, but rather as pitch accent and boundary tone spotting. As bound-ary tones only occur at intonational phrase boundaries these are equivalent tasks. They reported84.5% pitch accent detection accuracy and 94% boundary tone spotting accuracy by using decisiontree classification hypothesis as input to an HMM. Ostendorf and Ross [82] explicitly modeled thehierarchy from acoustic frames, to phones, to syllables, to tone sequences, to phrase boundariesto simultaneously recognize pitch accent, phrase accent and boundary tones10. This model, unlikeother approaches to simultaneous detection, uses estimated phrase boundary locations to normalizepitch features when decoding the tone – pitch accent, phrase accent or boundary tone – sequence.

In a study that directly examined the relationship between pitch accent and phrase boundarylocation, Wang and Hirschberg [126] demonstrated that pitch accent location (either oracular andhypothesized from text) can be used to improve phrase boundary assignment from text. They foundthat phrase boundaries are more likely to follow words bearing pitch accent than those that do not.This suggests that hypothesized pitch accent locations should be able to be used to improve phraseboundary detection.

6.2 Preliminary StudyIn this section, we describe a preliminary study investigating the relationship between pitch accentand phrase boundary detection. In this study we train classifiers to simultaneously detect pitch ac-cent and phrase boundaries. This is structured as a four-way classification, as described in Section6.1, the cross-product of binary pitch accent and phrase boundary classes. We apply three machinelearning techniques to this classification task: Naıve Bayes, J48 Decision Trees, and SVMs. Thefeature vector used for phrase boundary detection (Section 4.2) is used in these experiments. Theresults of ten-fold cross-validation experiments are reported in Tables 10, 11 and 12.

10Note: By detecting phrase accents and boundary tones, this model implicitly detects phrase boundaries.

19

On all corpora with the exception of IBM TTS, training an SVM to detect pitch accent si-multaneously with phrase boundary location increases the accuracy and f-measure of pitch accentdetection the independent approaches presented in Section 3.2. On some corpora, these increasesin pitch accent detection performance are balanced by performance reductions in phrase boundarydetection. However, there are corpora in which accuracy and f-measure of phrase boundary detec-tion also increases above the results reported in Section 4.2, namely the BDC-read, IBM TTS andTrains material.

These results suggest that detecting phrase boundaries and pitch accents with a single modelrather than introducing more confusability, increases performance. These results suggest thatpitch accented words preceding phrase boundaries are produced differently than phrase-medialaccents. Similarly, intonational phrase boundaries following words not bearing a pitch accent maybe marked differently than those following prominent words. By allowing the machine learningtechnique to model these within class differences, the overall performance is increased. There is,however, a mitigating factor to this claim. The feature vector used in this classification task is theidentical as that used in phrase boundary detection experiments described in Section 4.2. This con-tains a superset of the features used in the pitch accent experiments described in Section 3.2. Theinclusion of additional features may also have contributed to the observed increase pitch accentdetection performance. This possibility will be evaluated in the thesis.

Corpus NaiveBayes J48 Decision Tree SVMBDC-Read 71.3% 75.3% 80.4%

BDC-Spontaneous 64.0% 75.7% 76.5%BU Radio News Corpus 69.0% 74.8% 80.9%

TDT-4 58.0% 72.1% 76.9%Communicator 68.0% 77.5% 83.4%

IBM TTS 53.1% 66.2% 73.5%Trains 63.9% 71.5% 74.2%

Table 10: Simultaneous Event Detection: 4-class Accuracy




IBM TTS 67.9% / 0.445 81.0% / 0.590 85.4% / 0.666Trains 80.2% / 0.630 89.7% / 0.746 88.4% / 0.702

Table 11: Simultaneous Event Detection: Intonational Phrase Boundary Accuracy and F-Measure

6.3 Proposed WorkThe literature suggests that pitch accent and prosodic event detection are not mutually exclusivetasks; pitch accent location information may improve phrase boundary detection performance and

20




IBM TTS 71.5% / 0.693 79.8% / 0.787 83.9% / 0.822Trains 80.0% / 0.795 79.0% / 0.775 83.1% / 0.820

Table 12: Simultaneous Event Detection: Pitch Accent Accuracy and F-Measure

vice versa. We will investigate a number of techniques evaluate this hypothesis and to model thisinteraction. As in Section 6.2, the performance of these approaches will be evaluated by bothoverall accuracy – a data point is correct if both accent and phrase boundary are correctly detected– and by accent and phrase boundary detection accuracy and f-measure, evaluated separately.

1. Pilot Experiments We will first evaluate techniques to use human annotated, ground-truth,pitch accent information to improve phrase boundary detection, and phrase boundary loca-tions to improve pitch accent detection. An N-gram of pitch accents preceding the currentboundary will be included in the feature vector to predict phrase boundaries; conversely aphrase boundary N-gram will be included in the pitch accent prediction vector. Furthermore,phrase boundaries will be used to define a context region over which to normalize acousticfeatures. The impact of this phrase-based context normalization will be compared to thestatic context windows used in the experiments described Section 3.2.

2. Iterative Detection We hypothesize that pitch accent location can improve phrase boundarydetection and vice versa. Assuming the pilot experiments confirm this hypothesis, we intendto evaluate to what degree hypothesized prosodic event locations provide similar improve-ments. The mutual improvement of the detection of phrase boundary and accent detectionimplies a circular relationship. Improved phrase boundary hypotheses can be generated us-ing pitch accent hypotheses. Yet, these pitch accent hypotheses are more accurate whenusing phrase boundary hypotheses. We will run three iteration experiments to study thisrelationship.

(a) We will train an initial pitch accent detection model without any phrase boundary in-formation. Next, use hypotheses from this initial model to training a phrase boundarydetection model. Until performance converges or another stopping criterion is met, wewill retrain the pitch accent model using phrase boundary hypotheses, and vice versa.

(b) We will reproduce the previous iteration technique using an initial phrase boundarydetection model instead of an accent detection model.

(c) We will seed the iteration with both pitch accent and phrase boundary detection modelsthat do not hypothesized event features. At each step, we will retrain both models usinghypotheses produced in previous step.

3. Classifier Fusion Classifier fusion techniques operate on the hypotheses produced by a setof classifiers to generate a final hypothesis. Majority voting classifiers are a common ex-ample of classifier fusion, as is the correcting classifier design described in Section 3.2. We

21

Figure 1: Coupled HMM diagram for integrated prosodic event detection

will model the relationship between pitch accent and phrase boundary location by using theoutput of independent prediction models as input to an coupled hidden Markov model. Thestructure of this model appears in Figure 1.

7 ApplicationsIn this chapter, we describe a set of planned and completed applications of prosodic event detectionand classification to spoken language processing tasks. These include the analysis of non-nativeintonation, speech synthesis, story segmentation and extractive speech summarization.

7.1 Applications of Non-native Prosodic Event DetectionDespite the broad communicative contributions of prosodic information, native-like intonation isoften the last aspect of language taught to language learners, if at all. Non-native speakers oftenretain an accent – their native language influencing production of the second language (L2). In ad-dition to the phonemic production differences that contribute to foreign accent, van Els et al. [122]confirmed that suprasegmental information plays a significant role. It has also been shown that,for example, native Mandarin speakers of English speakers produce intonation errors that cannegatively impact communication [120]. Moreover, non-native accents can provide significant dif-ficulty to spoken language processing and understanding systems, which, in general, are trainedon native speech. This is an intended design to satisfy the needs of the majority of users – nativespeakers. However, speech produced with a foreign accent can differ significantly from the nativetraining data used by these systems, leading to degraded performance.

We will explore the differences in placement and production of prosodic events in English bynative speakers of Standard American English (SAE) and native speakers of Mandarin Chinese.In order to study non-native intonation, annotated non-native speech will be required. We intendto collect and annotate at least thirty minutes of speech produced by native speakers of MandarinChinese. In order to compare patterns of prosodic events that are produced by L2 speakers, we

22

will record Mandarin Chinese speakers each reading the same lexical content. By controlling thelexical material, we will be able to observe the phrasing and accenting decisions while controllingfor lexical factors. One subcorpus of BU-RNC comprises speech of six speakers reading identicaltranscripts. This material was collected in order to study speaker variability in the production of“acceptable” prosodic patterns. In the proposed data collection, Mandarin speakers will be askedto read these transcripts. When annotated, this corresponding non-native corpus will allow us tocompare the prosodic patterns used by native and Mandarin speakers of English with the goal ofidentifying those variations that occur within native speech and those that indicate some degree ofnon-nativeness.

We intend to investigate whether native Mandarin speakers produce prosodic events in thesame way native Standard American English (SAE) speakers do. Chen et al. [23] found that nativeMandarin speakers produced stressed words with shorter duration, and that accents produced byfemale native Mandarin speakers contained a greater pitch rise than those produced by femalenative SAE speakers. We will evaluate the difference in productions of both pitch accents andphrase boundaries by detecting prosodic events in non-native speech using models trained on nativespeech. If the performance of these models is significantly different on accented and native speech,the two communities mark prosodic events differently. Similar to Chen et al., the thesis will presentdescriptive analyses of any observed production differences.

These investigations will lead to two applications of the analysis of non-native intonation.

7.1.1 Prosody tutoring system

Language learners are rarely instructed in native intonation in second language classes. A prosodytutoring system would fill this pedagogical gap left by this lack of classroom instruction. Anunderlying engine that is able to distinguish native from non-native productions and placements ofprosodic events is necessary for such a system. Based on our analyses and modeling of non-nativeintonation, we intend to develop a prototype of such an system.

Tutorial system users will be asked to read a prompt, and their intonation will be assessed asnative or non-native. The system described in the thesis will use prompts from the BU-RNC forwhich native and non-native productions will be available. The main task of the tutoring systemwill be to identify non-native prosodic event placement and productions. To accomplish this theposteriors of native and non-native pitch accent and phrase boundary placement and productionmodels will be compared. When a word fits a non-native model better than a corresponding nativemodel, it will be identified as an “error”. To provide diagnostic information to the tutorial systemuser, we will present exemplar-based feedback. When an error is encountered, the user will bepresented with a number of native productions of the target utterance, and directed to the region inwhich a non-native event was detected. Users will also be able to replay their own production tomore easily observe contrast with the native examples.

7.1.2 Accent identification

Accent identification [22, 116, 37] is of particular interest to the automatic speech recognition(ASR) community. Recognizer performance is best when the input speech is as similar as possi-ble to the material used when training the recognizer. One technique used to improve recognitionaccuracy of accented speech is to train distinct recognition models for each accent [37, 73, 55].

23

However, this requires the preprocessing step of accent identification in order to select the appro-priate recognition model.

The problem of identifying the accent of speech is related to the tutoring scenario. In thetutorial setting, individual words are assessed as native (American speakers of English) or non-native (Mandarin Chinese speakers of English) based on their prosody. Extending this to accentclassification, full utterances will be classified as being spoken by a native SAE speaker or a nativeMandarin speaker. In a related task, Tepperman et al. [117] used hypothesized prosodic eventsto assign “pronunciation assessment” scores to speech rating the “nativeness” of a production.Using the mean of posteriors produced by an HMM prosodic event detector near state-of-the-art pronunciation assessment was achieved; the correlation with human judges was 0.331. Thereexists a significant amount of work on foreign-accent classification using phonetic, phonotactic,and lexical models. In this application, however, we will only be evaluating the use of prosodicevent detection and our analysis of non-native intonation in accent identification.

There is a substantial amount of intonational variation even within a community of nativespeakers. Therefore, critical component of this task is distinguishing this variations from the dif-ferences between native and foreign accented speech. The best technique to combine the wordassessments into an utterance assessment is an open question to be addressed by this research. Is itsufficient for a single word to be produced “non-natively” to consider an utterance as non-native?Is the production and placement of every prosodic event equally important? A prosodic eventsequence will be produced using both native and non-native detection and classification models.The task is then to decode this decision sequence and associated confidence scores into a singledecision: native vs. non-native. The preferred decoding technique remains an empirical questionto be addressed by this research.

7.2 Speech SynthesisWe are currently investigating a modification to the IBM text-to-speech (TTS) synthesizer thatwill allow a user to specify a desired accenting pattern along with a transcript for synthesis. TheIBM speech synthesis engine is a unit-selection concatenative synthesizer [43]. When synthesiz-ing a given utterance, the system, based on acoustic and lexical criteria, concatenates tri-phones(thirds of a phone) extracted from a phonetically transcribed corpus. The system is designed suchthat the search parameters – weights given to acoustic and lexical criteria – used when selectinga unit for synthesis can be easily modified. Moreover, the selection corpus can be annotated withadditional information. Additional corresponding search weights can be set in order to guide theunit-selection search based on these additional annotations. To provide control over accenting be-havior, we employ this functionality. First, the corrected energy-based voting pitch accent detector(see Section 3.2) is used to hypothesize pitch accent labels on the unit selection corpus. A subset ofthe IBM text-to-speech corpus has been annotated with ToBI labels, and is used as training mate-rial for this pitch accent detector. When searching for the lexically stressed syllable of an accentedword, those syllables annotated as accented in the corpus are given increased weight and are morelikely to be selected; when synthesizing an phone that should not bear a pitch accent, tri-phonesfrom accented syllables are less likely to be selected.

However, the detection module hypothesizes which words bear pitch accents, while the syn-thesis engine needs to select tri-phones from accented syllables. Thus, the detection hypothesesrequires some modification before annotating the unit selection corpus. This requirement comes

24

from the fact that not every syllable within an accented word realizes the pitch accent. If, say, theword YESTERDAY is accented, the majority of the acoustic realization of the accent falls withinthe YES syllable. If there is no refinement of the annotation, a search for an accented instance ofa DAY syllable may select the deaccented DAY syllable from an accented YESTERDAY instance.Therefore, when an accent is detected within a word, only the tri-phones of the lexically stressedsyllable are annotated with the pitch accent label for the purposes of directing the unit-selectionsearch.

We intend to run two perception studies to demonstrate that annotation of unit selection corporawith prosodic event information can be used to control the accenting of synthesized speech.

1. Synthesizing accented words This study will evaluate if the speech synthesized by the mod-ified TTS engine successfully realizes the requested accenting pattern. To test this, humanToBI labelers will listen to synthesized utterances and mark the accented words. Of par-ticular interest will be the ability to produce unconventional accenting behavior. Prosodicassignment techniques have shown fair accuracy in producing natural sounding intonationon conventional utterances. However, based on discourse context, certain words are moreor less likely to be accented. For example, contrasted words should be accented, and givenwords are more likely to be deaccented. The IBM synthesizer does not have the ability tosynthesize unconventional accenting; we believe that prosodic event detection by the pro-posed modification may be able to contribute to.

2. Synthesizing “appropriate” utterances This study will investigate the ability of the syn-thesizer to synthesize appropriate sounding utterances given a small amount of discoursecontext. Information status and contrastive focus can both influence the appropriate accent-ing pattern of an utterance. Consider the utterance JOHN SAID IT WILL RAIN TOMORROW.Following the sentence WHO SAID IT WILL RAIN TOMORROW? the appropriate accentingis JOHN SAID IT WILL RAIN TOMORROW, while if the previous sentence were DID JOHN

SAY IT WILL BE SUNNY TOMORROW? then JOHN SAID IT WILL RAIN TOMORROW wouldbe more appropriate. In this study, we will synthesize twelve pairs of utterances where twocontext sentences indicate one of two appropriate accent patterns of a common lexical se-quence. These utterances will be synthesized using the standard IBM TTS system with nocontrol over accenting and deaccenting, and using a modified system with the accentingcontrol behavior described above. Subjects will be presented with the context in which thetarget utterance is to appear in text, and asked to rate which synthesized utterance is moreappropriate given the context.

7.3 Story Segmentation7.3.1 Background

Broadcast news (BN) shows are often comprised of many unrelated stories. However, downstreamNLP tasks often expect semantically homogenous input. Story segmentation is the process ofdividing a contiguous document into such semantically homogenous sections. Most approaches tostory segmentation, including those operating on spoken material, have focused on lexical analysis.Of those that consider acoustic properties of the speech, fewer still use prosodic event information.Some approaches, however, have explored the use of intonational phrasing as a cue to story and

25

topic boundary locations. In general, these approaches use intonational phrase boundaries or otheracoustic segmentations to define candidate story or topic boundaries. Passonneau and Litman[83], for example, used prosodic phrase boundaries to define the candidate boundaries for topicsegmentation of spontaneous narrative monologues. Shriberg, et al. [102] used silent regions withduration over 650ms to define candidate boundaries for segmentation of broadcast news – whilethis is different from using intonational phrase boundaries, recall that the most significant acousticcorrelate of intonational phrases is a following pause (see Section 4.2). A number of approacheshave used automatic sentence unit detection to define the set of possible story boundaries [119, 97].

7.3.2 Completed Work

In this section, we describe a study which evaluates the use of intonational phrase boundaries todefine the set of data points for story segmentation of BN. We examine story segmentation perfor-mance using different input segmentations, including hypothesized intonational phrases, to definecandidate boundaries and units of analysis [99]. We examine 1) hypothesized sentences (at threeconfidence score thresholds: 0.5, 0.3, 0.1), 2) pause-based chunking (at two thresholds: 250ms,500ms), 3) hypothesized intonational phrases (hypothesized using J48 decision trees, see Section4.2), and 4) no segmentation, in which each word boundary is a candidate. These experimentsare run on the full TDT-4 corpus [109] as part of the DARPA GALE program. ASR transcripts[108], speaker diarization hypotheses [132] and hypothesized sentence boundaries [74] are avail-able from Columbia’s SRI NIGHTENGALE collaborators. The corpus also includes manuallylabeled, time-aligned story boundaries which are used in training the segmentation models.

The intonational phrase boundary detection model is trained on English speech. However, thismodel is also applied to material spoken in Mandarin and Arabic. Obviously, no claims are beingmade about the reliability of this model, or the consistency of boundary behavior across thesethree languages. Rather, we propose that this is a viable segmentation strategy which may capturea more linguistically meaningful unit than pause-based segmentation.

The segmentation models are binary classifiers (J48 decision trees trained using weka [130])that determine whether a given candidate boundary is also a story boundary or not. Distinct mod-els are trained for each show (e.g. CNN “Headline News”) in order to capture the idiosyncrasiesof each broadcast. The input segmentations define both the candidate boundary points and theregions over which features are extracted. The feature vector contains acoustic features repre-senting pitch, intensity, and speaking rate, lexical features including cue words and LCSeg [38]and TextTiling [44] coefficients, and structural features capturing position within the show andinformation about the hypothesized speaker’s participation in the show. A complete description ofthese features can be found in [97].

Story segmentation results using different input segmentations are shown in Table 13. All re-sults are based on ten-fold cross-validation experiments. We evaluate these using the WindowDiffmeasure [84], an extension of Beeferman’s Pk [4]. WindowDiff is increased for each false alarmand miss in a hypothesized segmentation such that near-errors, where a hypothesized boundary isclose to a target boundary, incur a lesser penalty than more egregious errors. Thus, lower Win-dowDiff scores represent better segmentations. The appropriate window size for applying bothWindowDiff and Pk is approximately one half the length of the average segment, which in theTDT-4 corpus is 215.9 words per story. We thus use a window size of 100.

The story boundary detection models produce a story-boundary/non-story-boundary prediction

26

Word SU thresh SU thresh SU thresh 250ms 500ms Hyp. IPs0.5 0.3 0.1 pause pause

English 0.300 0.357 0.300 0.308 0.298 0.344 0.340Arabic 0.308 0.361 0.318 0.304 0.312 0.419 0.333Mandarin 0.320 0.278 0.258 0.253 0.248 0.295 0.266

Table 13: Story Segmentation Results - (WindowDiff; k=100)

for each input segment. As each input segmentation defines a different data set, we need to insurethat these evaluations are comparable. To do this, we align the every set of input segment-basedpredictions to the word level. This allows us to apply the WindowDiff evaluation technique equiv-alently regardless of input segmentation.

While hypothesized intonational phrases do not define the best input segmentation, they yieldbetter results than hypothesized sentences across all languages. Whether or not story segmentationwould improve along with improvements to intonational phrase detection accuracy remains anopen question. However, this result is notable considering the modest f-measure of detecting phraseboundaries on the TDT-4 corpus (0.641, cf. Section 4.2), and furthermore, the performance of thismodel is untested on Mandarin and Arabic speech.

7.3.3 Proposed Work

We intend to further examine the role hypothesized prosodic events can play in story segmenta-tion. Pitch accent location and type hypotheses will be included into the segmentation model.Often utterances at the start of discourse segments contain a higher rate of accented words thanutterance towards the end of a discourse segment [52]. Therefore, it is expected that segments,however defined, towards the start of a story be accented at a greater rate than those at the end.The effect of different pitch accent type collapsing strategies will also be evaluated. Accenting alexical item may be used to indicate topicality or focus [46]. The lexical identity of accented wordsmay be useful in identifying the topic shift associated with story boundaries. The presence of apreviously unseen word bearing an accent, may be more indicative of topic shift than an unseendeaccented word. We will also investigate the impact of boundary tone and phrase accent classi-fication hypotheses on story segmentation performance. As mentioned in Section 5.1, L-L% andH-H% are believed to indicate a greater degree of finality that L-H% and H-L% [8]. We expectstory boundaries to coincide with prosodic indications of completeness more frequently than thoseof incompleteness.

7.4 Extractive Summarization of Broadcast News7.4.1 Background

Extractive summaries are constructed by selecting portions of a source document for inclusion ina summary [54, 134, 25]. This is analogous to making a collage, where a source is cut up andonly the most relevant sections are pasted together to form a summary. The task is often framedas one of classifying segments as being either included in or excluded from the summary. Whenperforming extractive summarization of text documents, it is common to segment or “cut up” theinput material at either sentence or syntactic constituent boundaries. However, when summarizing

27

speech documents the extraction of sentences and syntactic constituents may not be ideal for anumber of reasons. Speech disfluencies and grammar abnormalities can make identification ofsentence or syntactic unit boundaries difficult even for humans. Moreover, the error rates of state-of-the-art automatic sentence boundary detection and syntactic parsers on speech material remainsignificantly higher than those that operate on text [74]. Because of these concerns, extractivespeech summarization approaches which extract units other than the sentence have been developed.Hori et al. [54] constructed a summary a single word at a time. Kolluru et al. [61] extracted phrasesusing a filtering process by which at a number of stages within the summarization process low-confidence words were removed and significant segments were identified.

7.4.2 Completed Work

In this section, we describe an experiment examining the impact of selecting different units for ex-tractive speech summarization of broadcast news (BN). We hypothesize that 1) intonational phraseboundaries define intonationally salient units better than other segmentation approaches and 2)extraction of these units will lead to improved summarization performance. We use four segmen-tation strategies to define the extractive units for this summarization study: 1) intonational phraseboundaries automatically hypothesized using the J48 decision tree algorithm and the feature vectordescribed in Section 4.2 trained on the annotated TDT-4 material 2) 500ms pause-based acousticchunking, 3) 250ms pause-based acoustic chunking and 4) hypothesized sentence boundaries. Thisresearch was conducted as part of the DARPA GALE project. We therefore were provided withhypothesized sentence boundaries by our collaborators at ICSI [74].

The material used for these experiments is a subset of the TDT-4 corpus [109]. This subsetconsists of 12 CNN “Headline News” broadcasts containing 419 BN stories. A single humanlabeler summarized these stories with a length requirement of less than 30% of the original storylength. The labeler was asked to use material directly from the original story whenever possible.

To generate training data, we annotate the words of ASR transcripts for inclusion or exclusionin the summary. This annotation is performed by automatically aligning each human summary tothe ASR transcript of the show it summarizes. If more than 50% of the words in a given segmentare aligned to the human summary, the segment is labeled for inclusion.

We train summarizers that extract segments based on the four segmentation units using binaryBayesian Network classifiers [58]. We extract features for each segment and inclusion/exclusionlabels based on the previously described alignment procedure. To avoid the impact of ASR tran-scription errors, we use only acoustic and structural features for this classification. Maskey andHirschberg [78] showed that extractive speech summarization is possible without a lexical infor-mation beyond hypothesized word boundaries. Acoustic features are aggregated over candidatesegments. We extract minimum, maximum, standard deviation and mean of f0, ∆ f0, rms intensity(I) and ∆ I, identical features are extracted from speaker normalized (z-score normalization) f0 andI streams. We also calculate the z-score (x−µ

σ) of the maximum and minimum within the segment

over these four acoustic information streams. The feature vector also contained three pitch resetfeatures: the difference of the average of the last N pitch points (calculated at a 10ms fram) inthe current unit and the first N in the following unit, where N∈{1,5,10}. Finally, we include theaverage word length within the segment. Using hypothesized story boundaries [96] and speakerturn boundaries provided by ISCI [132] we extract a set of structural features. These include thelength, absolute and relative start time, and relative position in the speaker turn and story.

28

The results of ten-fold cross validation experiments are presented in Table 14. We evaluatethe experiments in two ways, using F-measure and three ROUGE11 variants. The extraction ofintonational phrases yields the best summarizer whether evaluated using F-measure or ROUGE.The summaries based on intonational phrase extraction yielded an 8.2% F-measure improvementover sentence-based summarizes. There were 2.75 intonational phrases in each sentence, howeverthis improvement is not simply due to the ability to select smaller units. While the pause-basedsummarizer operates on smaller units (1.6 times as many) than the sentence-based summarizer, thesentence-based results are slightly better. When evaluating using ROUGE (ROUGE-1, ROUGE-2,or ROUGE-L) the advantage of intonational phrase extraction is even more pronounced. Intona-tional phrase-based summarization ROUGE scores are considerably higher than those obtained bythe next best summarizer extracting 250ms pause-based segments; improvements of 13.5% usingROUGE-1, 8%, using ROUGE-2 and 14.4% using ROUGE-L are observed. These results confirmthat hypothesized intonational phrases are a viable, even, excellent candidate for the extractivesummarization of broadcast news.

Segmentation Precision Recall F-Measure ROUGE-1 ROUGE-2 ROUGE-L250ms Pause 0.333 0.622 0.432 0.437 0.103 0.415500ms Pause 0.255 0.756 0.381 0.440 0.128 0.412

Sentence 0.362 0.540 0.434 0.394 0.096 0.377Intonational Phrase 0.428 0.650 0.516 0.572 0.183 0.559

Table 14: Summarization Results

7.4.3 Proposed Work

To further investigate the use of prosodic event hypotheses in extractive speech summarization, weintend to evaluate the contribution of pitch accent location and type hypotheses, as well as phrasefinal classifications, to summarization performance. We will extend the feature vector used whentraining the summarizer using posteriors produced by prosodic event detection and classificationmodels. The impact of collapsing pitch accent types and phrase-final tones will be also investi-gated. In some cases the information contained in these predictions may be available to the modelvia the continuous acoustic features that are already included in the feature vector. An importantaspect of these experiments will be the relative performance contributions of raw acoustic featuresand categorical prosodic event predictions.

8 ContributionProsody is a valuable component of the transmission of information via speech. While the prosodicphenomena of accenting and phrasing carry a fraction of the breadth of information conveyed by

11ROUGE [71] is a measure commonly used to evaluate automatic summarization. A number of ROUGE variantsexist, however, all are calculated as the ratio of a word overlap measure to the maximum possible overlap. Forexample, ROUGE-N measures overlap as the count of N-grams in the evaluation (human) summaries that appear inthe automatic summary, while ROUGE-L measures the length of the longest overlapping subsequence.

29

intonation, they represent some of the most well-understood and most thoroughly researched as-pects of the communicative effect of prosody. In this thesis we will describe novel techniques forthe detection and classification of these prosodic events. These techniques will be evaluated on adiverse range of genres and speakers to measure their generalizability. In addition to this technicalcontribution, the thesis will measure the discriminative power of acoustic and lexical features toprosodic event detection and classification. These feature analyses will contribute to the under-standing of the correlates of prosodic events. The thesis will advocate the integration of prosodicevent detection and classification techniques into spoken language processing (SLP) systems bypresenting examples of improved performance due to the availability of automatically extractedprosodic event information. We believe these three contributions – detection and classificationtechniques, feature analysis, and integration with SLP systems – will advance both the under-standing and technical application of a significant element of intonational information – prosodicevents.

9 Plan for completion of the ThesisThis section presents a detailed timeline for completing the thesis (Table 15). Following this sched-ule, the thesis defense will take place in mid February 2009.

Date Area TaskJan. 2008 Corpus Preparation Begin annotating ≥ 30 minutes of TDT-4 BN speech

Corpus Preparation Generate ASR transcripts of all corporaCorpus Preparation Collect non-native dataIntegrated Detection Pilot experimentsSpeech Synthesis Run perception experiments

Feb. 2008 Corpus Preparation Begin annnotating L2 dataPitch Accent Detection Evaluate correcting classifier on all corporaAccent and Phrase Boundary Detection Lexical Experiments

Mar. 2008 Non-native Intonation Descriptive analysis of native v. non-native intonation.Non-native Intonation Detection of intonation errors research

Apr. 2008 Non-native Intonation Accent identification experimentsMay 2008 Non-native Intonation Development of Tutorial SystemJun. 2008 Phrase Boundary Detection Experiments modeling pre-boundary lengthening

Phrase Boundary Detection Feature AnalysesJul. 2008 Extractive Summarization Inclusion of prosodic event hypotheses

Story Segmentation Inclusion of prosodic event hypothesesAug. 2008 Integrated Detection Iterative Detection experiments

Integrated Detection CHMM Classifier Fusion experimentsSep. 2008 Event Type Classification Syllable-level classification

Event Type Classification Evaluation of collapsing type classesOct. 2008 Event Type Classification Speaker clustering experiments

Event Type Classification Model adaptation experimentsNov. 2008 Write Thesismid Jan. 2009 Prepare Defensemid Feb. 2009 Thesis defense

Table 15: Plan for completion of the proposed thesis

30

References[1] S. Ananthakrishnan and S. Narayanan. An automatic prosody recognizer using a coupled

multi-stream acoustic model and a syntactic-prosodic language model. In ICASSP, 2005.

[2] S. Ananthakrishnan and S. Narayanan. Combining acoustic, lexical, and syntactic evidencefor automatic unsupervised prosody labeling. In ICSLP, 2006.

[3] M. Beckman. Stress and non-Stress. Foris Publications, Dordrect, Holland, 1986.

[4] D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. MachineLearning, 31(1-3):177–210, 1999.

[5] P. Boersma. Praat, a system for doing phonetics by computer. Glot International, 5(9-10):341–345, 2001.

[6] D. Bolinger. A theory of pitch acent in english. Word, 14:109–149, 1958.

[7] D. Bolinger. Accent is predictable (if you’re a mind-reader). Language, 48, 1972.

[8] D. Bolinger. Intonation and Its Parts: Melody in Spoken English. Stanford University Press,1985.

[9] N. Braunschweiler. Automatic Detection of Prosodic Cues. PhD thesis, University of Kon-stanz, Germany, 2005.

[10] N. Braunschweiler. The prosodizer - automatic prosodic annotations of speech synthesisdatabases. In Speech Prosody, 2006.

[11] L. Breiman. Bagging predictors. Machine Learning, 1996.

[12] J. Brenier, D. Cer, and D. Jurafsky. The detection of emphatic words using acoustic andlexical features. In Eurospeech, 2005.

[13] J. Brenier, A. Nenkova, A. Kothari, L. Whitton, D. Beaver, and D. Jurafsky. The (non)utilityof linguistic features for predicting prominence in spontaneous speech. In IEEE/ACL 2006Workshop on Spoken Language Technology, 2006.

[14] E. Brill. A simple rule-based part-of-speech tagger. In Proceedings of ANLP-92, 3rd Con-ference on Applied Natural Language Processing, pages 152–155, Trento, IT, 1992.

[15] G. Brown. Prosodic structure and the given/new distinction. In A. Cutler and D. Ladd,editors, Prosody: Models and Measurements, pages 67–77. Springer Verlag, Berlin, 1983.

[16] N. Campbell. Multi-Level Timing in Speech. PhD thesis, Sussex University, 1992.

[17] N. Campbell. Loudness, spectral tilt and perceived prominence in dialogues. In ICPhS,1995.

[18] Carnegie Mellon University. Cmu sphinx-4. http://cmusphinx.sourceforge.net.

31

[19] L. Chaolei, L. Jia, and X. Shanhong. English sentence stress detection system based onhmm framework. Applied Mathematics and Computation, 185:759–768, 2007.

[20] F. Chen and M. Withgott. The use of emphasis to automatically summarize a spoken dis-course. In ICASSP, 1992.

[21] K. Chen, M. Hasegawa-Johnson, and A. Cohen. An automatic prosody labeling systemusing ann-based syntactic-prosodic model and gmm-based acoustic-prosodic model. InICASSP, 2004.

[22] T. Chen, C. Huang, E. Chang, and J. Wang. Automatic accent identification using gaussianmixture models. In ASRU, pages 343–346, 2001.

[23] Y. Chen, M. Robb, H. Gilbert, and J. Lerman. A study of sentence stress production inmandarin speakers of american english. Journal of the Acoustical Society of America, 2001.

[24] N. Chomsky and M. Halle. The Sound Pattern of English. Harper and Row, 1968.

[25] H. Christensen, B. Kolluru, Y. Gotoh, and S. Renals. From text summarization to style-specific summarization for broadcast news. In ECIR, 2004.

[26] J. Clark and C. Yallup. Introduction to Phonology and Phonetics. Blackwell, 1990.

[27] A. Conkie, G. Riccardi, and R. Rose. Prosody recognition from speech utterances usingacoustic and linguistic based models of prosodic events. In Eurospeech, 1999.

[28] W. Daelmans, J. Zavrel, K. van der Sloot, and A. van den Bosch. Tibml: Tilburg memorybased learner, version 5.0. Technical report, Universiteit van Tilburg, 2003.

[29] D. Dahan, M. Tanenhaus, and C. Chambers. Accent and reference resolution in spoken-language comprehension. Journal of Memory and Language, 47:292–314, 2002.

[30] J. R. de Pijper and A. Sanderman. On the perceptual strength of prosodic boundaries and itsrelation to suprasegmental cues. Journal of the Acoustical Society of America, 1994.

[31] G. Demenko, S. Grocholewski, A. Wagner, and M. Szymanski. Prosody annotation forcorpus based speech synthesis. In Australian International Conference on Speech Scienceand Technology, 2006.

[32] M. Fach. A comparison between syntactic and prosodic phrasing. In Eurospeech, 1999.

[33] M. Fach and W. Wokurek. Pitch accent classification of fundamental frequency contours byhidden markov models. In Eurospeech, 1995.

[34] G. Fant, A. Kruckenberg, and J. Liljencrants. Acoustic-phonetic analysis of prominencein swedish. In A. Botinis, editor, Intonation, Analysis, Modelling and Technology, pages55–86. Kluwer, 2000.

[35] J. Fon. A Cross-Linguistic Study on Syntactic and Discourse Boundary Cues in SpontaneousSpeech. PhD thesis, Ohio State University, 2002.

32

[36] Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In ICML, 1996.

[37] P. Fung. Fast accent identification and accented speech recognition. In ICASSP, pages221–224, 1999.

[38] M. Galley, K. McKeown, E. Fosler-Lussier, and H. Jing. Discourse segmentation of multi-party conversation. In 41st Annual Meeting of ACL, pages 562–569, July 2003.

[39] M. Gregory and Y. Altun. Using conditional random fields to predict pitch accents in con-versational speech. In ACL, 2004.

[40] M. Grice, M. Reyelt, R. Benzmuller, J. Mayer, and A. Batliner. Consistency in transcriptionand labeling of german intonation with gtobi. In S.-A. Jun, editor, Prosodic Typology.Oxford University Press, 1996.

[41] M. Grice and M. Savino. Can pitch accent type convey information status in yes-no ques-tions. In Concept to Speech Generation Systems, 1997.

[42] J. Gundel. On different kinds of focus. In Focus: Linguistic, Cognitive and ComputationalPerspectives. Cambridge University Press, 1999.

[43] W. Hamza, E. Eide, R. Bakis, M. Picheney, and J. Pitrelli. The ibm expressive speechsynthesis system. In Interspeech, pages 2577–2580, 2004.

[44] M. A. Hearst. Texttiling: Segmenting text into multi-paragraph subtopic passages. Compu-tational Linguistics, 23(1):33–64, 1997.

[45] N. Hedberg. The prosody of contrastive topic and focus in spoken english. In Workshop oninformation structure in context, 2003.

[46] N. Hedberg and J. Sosa. The prosody of topic and focus in spontaneous english dialogue.In Topic and Focus: Cross-Linguistic Perspectives on Meaning and Intonation. Springer,2007.

[47] P. Heeman and J. Allen. The trains 93 dialogues. Technical Report TN94-2, The Universityof Rochester, 1995.

[48] M. Heldner. Spectral emphasis as an additional source of information in accent detection.In Prosody 2001: ISCA Tutorial and Research Workshop on Prosody in Speech Recognitionand Understanding, pages 57–60, 2001.

[49] M. Heldner, E. Stragert, and T. Deschamps. Focus detection using overall intensity and highfrequency emphasis. In ICPhS, 1999.

[50] J. Hirschberg. Pitch accent in context: Predicting intonational prominence from text. Artifi-cial Intelligence, 63(1-2):305–340, 1993.

[51] J. Hirschberg, A. Gravano, A. Nenkova, E. Sneed, and G. Ward. Intonational overload: Usesof hte downstepped (h* !h* l- l%) in read and spontaneous speech. In Proceedings of theNinth Conference on Laboratory Phonology, 2004.

33

[52] J. Hirschberg and C. Nakatani. Acoustic indicators of topic segmentation. In Proc. of ICSLP,volume 4, pages 1255–1258, 1998.

[53] J. Hirschberg and G. Ward. The interpretation of the high-rise question contour in english.Journal of Pragmatics, 1994.

[54] C. Hori, S. Furui, R. Malkin, H. Yu, and A. Waibel. Automatic speech summarizationapplied to english broadcast news. In ICASSP, 2002.

[55] C. Huang, T. Chen, and E. Chang. Accent issue in large vocabulary continuous speechrecognition. International Journal of Speech Technology, 7(2/3):141–153, 2004.

[56] D. Huber. A statistical approach to the segmentation and broad classification of continuousspeech into phrase-sized information units. In ICASSP, 1989.

[57] C. T. Ishi, P. Mokhtari, and N. Campbell. Perceptually-related acoustic-prosodic features ofphrase finals in spontaneous speech. In Eurospeech, 2003.

[58] F. Jensen. Bayesian Networks and Decision Graphs. Springer-Verlag, 2001.

[59] N. Kaiki and Y. Sagisaka. Pause characteristics and local phrase-dependency structure injapanese. In ICSLP, 1992.

[60] G. Kochanski, E. Grabe, J. Coleman, and B. Rosner. Loudness predicts prominence: fun-damental frequency lends little. Journal of the Acoustic Society of America, 118(2):1038–1054, August 2005.

[61] B. Kolluru, Y. Gotoh, and H. Christensen. Multistage compaction approach to broadcastnews summarization. In Interspeech, 2005.

[62] A. Komatsu, E. Ohira, and A. Ichikawa. Prosodical sentence structure inference for naturalconversational speech understanding. In Eurospeech, 1989.

[63] R. Ladd. The structure of intonational meaning. Indiana University Press, 1980.

[64] R. Ladd. Declination ‘reset’ and the hierarchical organization of utterances. Journal of theAcoustical Society of America, 1988.

[65] R. Ladd. Intonational Phonology. Cambridge University Press, 1996.

[66] I. Lehiste. Phonetic disambiguation of syntactic ambiguity. Journal of the Acoustical Societyof America, 1973.

[67] I. Lehiste, J. Olive, and L. Streeter. Role of duration in disambiguating syntactically am-biguous sentences. Journal of the Acoustical Society of America, 1976.

[68] G.-A. Levow. Context in multi-lingual tone and pitch accent recognition. In Interspeech,2005.

34

[69] G.-A. Levow. Unsupervised and semi-supervised learning of tone and pitch accent. InHLT-NAACL, pages 224–231, Morristown, NJ, USA, 2006. Association for ComputationalLinguistics.

[70] G.-A. Levow. Automatic prosodic labeling with conditional random fields and rich acousticfeatures. In IJCNLP, 2008.

[71] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In ACL Workshop ofText Summarization, 2004.

[72] J. Liscombe, J. Venditti, and J. Hirschberg. Detecting question turns in spoken tutorialdialogues. In Interspeech, 2006.

[73] Y. Liu and P. Fung. Acoustic and phonetic confusions in accented speech recognition. InInterspeech, 2005.

[74] Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. P. Harper. Enrichingspeech recognition with automatic detection of sentence boundaries and disfluencies. IEEETransactions on Audio, Speech & Language Processing, 14(5):1526–1540, 2006.

[75] N. Macdonald. Duration as a syntactic boundary cue in ambiguous sentences. In ICASSP,1976.

[76] M. Marcus, M. Marcinkiewicz, and B. Santorini. Building a large annotated corpus ofEnglish: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993.

[77] E. Marsi, M. Reynaert, A. van den Bosch, W. Daelmans, and V. Hoste. Learning to predictpitch accents and prosodic boundaries in dutch. In ACL, 2003.

[78] S. Maskey and J. Hirschberg. Summarizing speech without text using hidden markov mod-els. In HLT-NAACL, 2006.

[79] C. Nakatani, J. Hirschberg, and B. Grosz. Discourse structure in spoken language: Stud-ies on speech corpora. In AAAI Spring Symposium on Empirical Methods in DiscourseInterpretation and Generation, 1995.

[80] M. O’Malley, D. Kloker, and B. Dara-Abrams. Recovering parentheses from spoken alge-braic expressions. IEEE Transactions on Audio and Electroacoustics, 21(3):217–220, 1973.

[81] M. Ostendorf, P. Price, and S. Shattuck-Hufnagel. The boston university radio news corpus.Technical Report ECS-95-001, Boston University, March 1995.

[82] M. Ostendorf and K. Ross. Multi-level recognition of intonation labels. In Y. Sagisaka,N. Campbell, and N. Higuchi, editors, Computing Prosody: Computational Models for Pro-cessing Spontaneous Speech. Springer, 1997.

[83] R. J. Passonneau and D. J. Litman. Discourse segmentation by human and automated means.Computational Liunguistics, 23(1):103–109, 1997.

35

[84] L. Pevzner and M. Hearst. A critique and improvement of an evaluation metric for textsegmentation. Computational Linguistics, 28(1):19–36, 2002.

[85] J. Pierrehumbert. The phonology and phonetics of English intonation. PhD thesis, MIT,1980.

[86] J. Pierrehumbert. Phonological and phonetic representation. Journal of Phonetics, 18:375–394, 1990.

[87] J. Pierrehumbert and S. Steele. How many rise-fall-rise countours? ISPhS, 1987.

[88] J. Pitrelli, M. Beckman, and J. Hirschberg. Evaluation of prosodic transcription labelingreliability in the tobi framework. In ICSLP, 1994.

[89] J. Platt. Machines using sequential minimal optimization. In B. Schoelkopf and C. Burges,editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1998.

[90] P. Price, M. Ostendorf, S. Shattuck-Hufnagel, and C. Fong. The use of prosody in syntacticdisambiguation. Journal of the Acoustical Society of America, 1991.

[91] P. Price, M. Ostendorf, and C. Wightman. Prosody and parsing. In DARPA Speech NaturalLanguage Workshop, 1989.

[92] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.

[93] V. Rangarajan, S. Narayanan, and S. Bangalore. Acoustic-syntactic maximum entropymodel for automatic prosody labeling. In IEEE-ACL Conference on Spoken Language Tech-nology, 2006.

[94] V. Rangarajan, S. Narayanan, and S. Bangalore. Exploiting acoustic and syntactic featuresfor prosody labeling in a maximum entropy framework. In HLT-NAACL, 2007.

[95] Y. Ren, S.-S. Kim, M. Hasegawa-Johnson, and J. Cole. Speaker-independent automaticdetection of pitch accent. In Speech Prosody, 2004.

[96] A. Rosenberg and J. Hirschberg. On the correlation between energy and pitch accent in readenglish speech. In Interspeech, 2006.

[97] A. Rosenberg and J. Hirschberg. Story segmentation of broadcast news in english, mandarinand arabic. In HLT-NAACL, 2006.

[98] A. Rosenberg and J. Hirschberg. Detecting pitch accent using pitch-corrected energy-basedpredictors. In Interspeech, 2007.

[99] A. Rosenberg and J. Hirschberg. Varying input segmentation for story boundary detectionin english, arabic and mandarin broadcast news. In Interspeech, 2007.

[100] K. Ross and M. Ostendorf. Prediction of abstract prosodic labels for speech synthesis.Computer Speech & Language, 10(3):155–185, 1996.

36

[101] H. Shimodaira and M. Kimura. Accent phrase segmentation using pitch pattern clustering.In ICASSP, 1992.

[102] E. Shriberg, A. Stolcke, D. Hakkani-Tur, and G. Tur. Prosody based automatic segmentationof speech into sentences and topics. Speech Communication, 32(1-2):127–154, 2000.

[103] R. Silipo and S. Greenberg. Prosodic stress revisited: Reassessing the role of fundamentalfrequency. In NIST Speecn Transcription Workshop, 2000.

[104] K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehum-bert, and J. Hirschberg. Tobi: A standard for labeling english prosody. In Proc. of the 1992International Conference on Spoken Language Processing, volume 2, pages 12–16, 1992.

[105] A. M. C. Sluijter and V. J. van Heuven. Spectral balance as an acoustic correlate of linguisticstress. Journal of the Acoustical Society of America, 100(4):2471–2485, 1996.

[106] A. M. C. Sluijter, V. J. van Heuven, and J. J. A. Pacilly. Spectral balance as a cue in theperception of linguistic stress. Journal of the Acoustical Society of America, 101(1):503–513, 1997.

[107] M. Steedman. Structure and intonation. Language, 1991.

[108] A. Stolcke, B. Chen, H. Franco, V. R. R. Gadde, M. Graciarena, M.-Y. Hwang, K. Kirchhoff,A. Mandal, N. Morgan, X. Lei, T. Ng, M. Ostendorf, K. Sonmez, A. Venkataraman, D. Ver-gyri, W. Wang, J. Zheng, and Q. Zhu. Recent innovations in speech-to-text transcriptionat sri-icsi-uw. IEEE Transactions on Audio, Speech & Language Processing, 14(5):1729–1744, 2006.

[109] S. Strassel and M. Glenn. Creating the annotated tdt-4 y2003 evaluation corpus.http://www.nist.gov/speech/tests/tdt/tdt2003/papers/ldc.ppt, 2003.

[110] X. Sun. Pitch accent predicting using ensemble machine learning. In ICSLP, 2002.

[111] A. Syrdal and J. McGorg. Inter-transcriber reliability of tobi prosodic labeling. In ICSLP,2000.

[112] J. ‘t Hart, R. Colier, and A. Cohen. A Perceptual Study of Intonation. Cambridge UniversityPress, 1990.

[113] F. Tamburini. Prosodic prominence detection in speech. In Proc. 7th International Sympo-sium on Signal Processing and its Applications = ISSPA2003, pages 385–388, 2003.

[114] F. Tamburini. Automatic prominence identification and prosodic typology. In Proc. Inter-Speech 2005, pages 1813–1816, 2005.

[115] P. Taylor. Analysis and synthesis of intonation using the tilt model. Journal of the AcousticalSociety of America, 2000.

[116] C. Teixeira, I. Trancoso, and A. Serralheiro. Accent identification. In ICSLP, 1996.

37

[117] J. Tepperman, A. Kazemzadeh, and S. Narayanan. A text-free approach to assessing nonna-tive intonation. In Interspeech, 2007.

[118] K. Ting and I. Witten. Stacking bagged and dagged models. ICML, 1997.

[119] G. Tur, D. Hakkani-Tur, A. Stolcke, and E. Shriberg. Integrating prosodic and lexical cuesfor automatic topic segmentation. Computational Linguistics, 27:31–57, 2001.

[120] A. Tyler. Discourse structure and the perception of incoherence in international teachingassistants’ spoken discourse. TESOL Quarterly, 26(4):713–729, 1992.

[121] C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science,University of Glasgow, 1979.

[122] T. VanEls and K. de Bot. The role of intonation in foreign accent. Modern LanguageJournal, 71:147–155, 1987.

[123] J. Vassiere. Language-independent prosodic features. In A. Cutler and R. Ladd, editors,Prosody: Models and Measurements, pages 53–66. Springer-Verlag, 1983.

[124] A. Waibel. Prosody and Speech Recognition. Morgan Kaufmann Publishers, San Mateo,CA, 1988.

[125] M. Walker, J. Aberdeen, J. Boland, E. Bratt, J. Garofolo, L. Hirschman, A. Le, S. Lee,S. Narayanan, K. Papineni, B. Pellom, J. Polifroni, A. Potamianos, P. Prabhu, A. Rudnicky,G. Sanders, S. Seneff, D. Stallard, and S. Whittaker. Darpa communicator dialog travelplanning systems: The june 2000 data collection. In Eurospeech, 2001.

[126] M. Q. Wang and J. Hirschberg. Automatic classification of intonational phrase boundaries.Computer Speech and Language, 6(2):175–196, 1992.

[127] G. Ward and J. Hirschberg. Implicating uncertainty: The pragmatics of fall-rise intonation.Language, 61:747–776, 1985.

[128] C. Wightman and M. Ostendorf. Automatic labeling of prosodic patterns. IEEE Transac-tions on Speech and Audio Processig, 2(4), October 1994.

[129] C. Wightman, M. Ostendorf, and S. Shattuck-Hufnagel. Segmental durations in the vicinityof prosodic phrase boundaries. Journal of the Acoustic Society of America, 1992.

[130] I. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. Cunningham. Weka: Practical ma-chine learning tools and techniques with java implementation. In ICONIP/ANZIIS/ANNESInternational Workshop: Emerging Knowledge Engineering and Connectionist-Based In-formation Systems, pages 192–196, 1999.

[131] D. Wolpert. The supervised learning no-free-lunch theorems. In World Conference on SoftComputing, 2001.

[132] C. Wooters, J. Fung, B. Peskin, and X. Anguera. Towards robust speaker segmentation: Theicsi-sri fall 2004 diarization system. In RT-04F Workshop, November 2004.

38

[133] T. Yoon, S. Chavarria, J. Cole, and M. Hasegawa-Johnson. Intertranscriber reliability ofprosodic labeling on telephone conversation using tobi. In ICSLP, 2004.

[134] K. Zechner. Automatic generation of concise summaries of spoken dialogues in unrestricteddomains. In Research and Development in Information Retrieval, 2001.

39

doctoral thesis proposal - columbia universityamaxwell/proposal.pdf · doctoral thesis proposal...

Documents