functional data analysis as a tool for phonetic detail

21
Functional Data Analysis as a tool for Phonetic Detail analysis Michele Gubian April 19, 2010 1 Introduction The so-called Phonetic Detail (PD) refers to any audible detail in the speech signal that is consistently used to support or facilitate conveying meaning, where both detail and meaning are to be intended in the broadest sense [1]. PD can be embedded in the use of pitch, duration, voice quality, energy, etc. also in combinations, and the time span of its implementation can vary from very short (subphonemic, less than 10 ms) up to a whole sentence and beyond. Conveying meaning with PD is by no means restricted to words or phonemes but also to emotional states and interaction in a conversation (e.g. signalling turn taking). This large diversity in phenomena, time scales, etc., makes it difficult to find a single analysis tool that covers everything. Here we focused our attention on a recently proposed suite of computational methods collectively known as Functional Data Analysis (FDA) [2, 3, 4]. FDA provides a mathematical framework that allows to perform statistics on datasets whose elements are entire curves (or in general multidimensional trajectories) whose lengths can differ across the dataset. This means that the unfolding in time of one or more (automatically extracted) features (e.g. pitch, formants) that describe some form of PD can be quantitatively analysed across a whole set of samples. In such a way, first dynamic details that are hard to be iden- tified and quantitatively described (e.g. because other irrelevant variation is superimposed) have a chance to be captured automatically. Second, posited hypotheses on the consistent use of a specific FPD can be checked for statistical significance. The present document offers an overview of FDA by means of a case study on prosody. Part of this work has been used in [5]. Another case study on FDA can be found in [6]. First FDA is briefly introduced below, then the case study offers the playground to see FDA tools in action. 2 Functional Data Analysis In empirical science data is often collected in the form of sampled functions, usually time series (prices of goods, temperatures, etc.). The process of making inference out of these datasets involves questions like “Is the trend of tempera- ture throughout the year in town A different from that in town B?”. To answer this kind of questions usually global statistical indexes are first extracted from 1

Upload: others

Post on 12-Mar-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Functional Data Analysis as a tool for

Phonetic Detail analysis

Michele Gubian

April 19, 2010

1 Introduction

The so-called Phonetic Detail (PD) refers to any audible detail in the speechsignal that is consistently used to support or facilitate conveying meaning, whereboth detail and meaning are to be intended in the broadest sense [1]. PD canbe embedded in the use of pitch, duration, voice quality, energy, etc. also incombinations, and the time span of its implementation can vary from very short(subphonemic, less than 10 ms) up to a whole sentence and beyond. Conveyingmeaning with PD is by no means restricted to words or phonemes but also toemotional states and interaction in a conversation (e.g. signalling turn taking).This large diversity in phenomena, time scales, etc., makes it difficult to find asingle analysis tool that covers everything.

Here we focused our attention on a recently proposed suite of computationalmethods collectively known as Functional Data Analysis (FDA) [2, 3, 4]. FDAprovides a mathematical framework that allows to perform statistics on datasetswhose elements are entire curves (or in general multidimensional trajectories)whose lengths can differ across the dataset. This means that the unfolding intime of one or more (automatically extracted) features (e.g. pitch, formants)that describe some form of PD can be quantitatively analysed across a wholeset of samples. In such a way, first dynamic details that are hard to be iden-tified and quantitatively described (e.g. because other irrelevant variation issuperimposed) have a chance to be captured automatically. Second, positedhypotheses on the consistent use of a specific FPD can be checked for statisticalsignificance.

The present document offers an overview of FDA by means of a case studyon prosody. Part of this work has been used in [5]. Another case study on FDAcan be found in [6]. First FDA is briefly introduced below, then the case studyoffers the playground to see FDA tools in action.

2 Functional Data Analysis

In empirical science data is often collected in the form of sampled functions,usually time series (prices of goods, temperatures, etc.). The process of makinginference out of these datasets involves questions like “Is the trend of tempera-ture throughout the year in town A different from that in town B?”. To answerthis kind of questions usually global statistical indexes are first extracted from

1

the time series (mean, variance, peaks, etc.) and then multivariate statisti-cal techniques are applied on those indexes (ANOVA, linear regression, etc.).However, sometimes patterns are not easily revealed by simple (scalar) statisti-cal indexes, while they reside in the dynamics of the signal in time. FunctionalData Analysis (FDA) [2, 3, 4] is a suite of computational techniques that extendclassic methods form statistics to the function domain, offering the possibility tomake quantitative inference from sets of whole stretches of signals without theintermediate step of extracting statistical indexes out of them, a process thatalways brings to information loss, and in practice makes inference on dynamictraits of signals problematic.

Applying FDA to a set of sampled time series involves two main steps. Oneis data preparation, which consists in transforming the sampled signals into afunctional form, usually employing basis functions like B-splines and standardleast squares interpolation often including a regularization term. In this pro-cess, all functions are normalized on the same time interval, in order to be ableto compare them across time. In cases when a set of landmarks can be reli-ably recognized in all functions (e.g. a series of peaks with a clear physicalinterpretation) then those landmarks can be used to produce a time registeredversion of the whole set of functions, bringing all corresponding landmarks tocoincide in (normalized) time. The second part is data analysis. Many tech-niques from multivariate statistics have been extended to functions, includingfunctional Principal Component Analysis (fPCA) and different versions of func-tional linear modeling. These will be illustrated directly on the data in thefollowing sections.

3 Case Study Motivation

In Neapolitan Italian, as in other romance languages, the modality oppositionbetween Question and Statement can be expressed by intonational means alone.The same syntactic structure, with the same lexical content and displaying thesame sequence of segments, can be uttered with two different intonational con-tours which lead to two different pragmatic meanings. A great body of researchin the Autosegmental-Metrical framework for the study of the phonology of in-tonation has shown that contours can be successfully modeled as a sequenceof discrete, local events. Perception experiment in this framework are usuallybased on the manipulation through resynthesis of individual points in the F0

contour. Recent works, though, explore the hypothesis that dynamic propri-eties of F0 contours (e.g. global shapes) can be perceptually relevant too. Inthis perspective, resynthesis should not be performed on individual points, butrather on longer stretches of signal [7, 8, 9].

4 Material

Two male native speakers of Neapolitan Italian (‘AS’ and ‘SC’) were recordedin a sound proof booth by means of a Roland Edirol UA 25EX sound card con-nected to a laptop and through a Sennheiser E 835 microphone. Three carriersentences were used having the same syllable count and lexical stress positionsand containing two accents also in the same relative positions (‘Mile

¯na lo vuole

2

ama¯ro (?)’ = Milena drinks it (i.e. her coffee) black, ‘Vale

¯ria viene alle no

¯ve

(?)’ = Valeria arrives at 9, ‘Ame¯lia dorme da no

¯nna (?)’ = Amelia sleeps at

grandma’s). Each sentence was pronounced five times in Q and five time in Smodality by each speaker. Three out of the 2× 3× 5× 2 = 60 utterances werediscarded, leaving thus 57 utterances. The beginning and the end of the utter-ance were marked (i.e. silence is removed), and within this analysis interval thebeginning and the end of the two accented vowels were manually marked by thesecond author of this work, a native speaker of Neapolitan Italian. The dura-tion of each sentence is around 1 second irrespective of the speaker/sentence.F0 was extracted from each utterance using Praat autocorrelation-based pitchextractor with default parameter settings, which results in the computation ofF0 values every 10 ms. No particular attention was put into producing smoothF0 curves in Praat, since smoothing is taken care of in FDA standard procedure.

5 Data preparation

In order to eliminate uninteresting but heavy variation in the signal due tospeaker identity, F0 was first converted into semitones (st), then the time averagewas subtracted from each sampled pitch curve. Thus what the y axis will showin each plot will be the following:

y(t) = F0(t)[st]− F0(t)[st] (1)

where t represents time and x is the average of x over time across the utterance.Figure 1 shows the raw data before any FDA processing. At first sight it is notclear which of the two main factors, i.e. Q/S modality and speaker identity, isplaying the major role and how. Note also that the labeling information aboutthe accented vowel onset/offset is not used yet.

The first FDA step is to interpolate each raw sampled curve to obtain afunctional representation of it, i.e. passing from an ordered sequence

((t1, y1), (t2, y2), · · · , (tN , yN ))

where N is the number of F0 samples for that utterance, to an explicit functiony(t), from which in principle you could extract the value of y at any instant oftime, also somewhere in between the original samples. This operation has severalpurposes. First it allows the application of FDA tools, which accept functionsrather than sampled curves as input. Second, instead of a mere interpolationwhat is usually carried out is a smoothing that if taken in a principled way canreach a good compromise between retaining the smallest detail in the signal(overfitting) and a coarse representation of it that throws away useful informa-tion (underfitting). Fig. 2 shows an example on how raw data was smoothed.Smoothing was performed using the results of an experiment performed by Xu[10]. Details can be found in the Appendix. However, you could also applyeither strict mathematical criteria, like generalized cross-validation ([2], Par.5.4.3), other sources of prior knowledge, or just eye inspection of many plotslike the one in Fig. 2.

Any FDA data processing requires that each curve (function) should havethe same duration. This means that in absence of any landmark provided bythe annotation we would simply linearly normalize durations to a common time

3

interval. The total duration of each of the 57 utterances was compactly dis-tributed with mean value 1.03 s, standard deviation 0.09 s and no significanteffect from neither of the considered factors (modality, speaker, sentence). Thuswe decided to ignore it further on in the analysis. If the original duration of thecurves were deemed to have an effect then it could be reintroduced later on, e.g.by computing correlations with fPCA coefficients.

The alignment we performed was not just a linear normalization, but itmade use of the so-called landmark registration tool provided within the FDAtools suite. It is a nonlinear time warping procedure that applies distortionsto the time axis of each curve in such a way to align all landmarks in time.In our case, we had four landmarks available from the manual annotation, i.e.onset and offset of each of the two accented vowels in the sentence. Thoseonset/offset points in time were made to coincide in (normalized) time as muchas possible while keeping the curve distortion reasonable. The procedure isautomatic and will be illustrated in the tutorial (or see [2], Chap. 7). Thereason to do that is that FDA statistics is based on the hypothesis that a timeinstant means the same thing across the set of curves. Of course this cannotbe strictly true in many practical cases. Landmark registration allows then torefer the curves to points that are deemed to have the same meaning acrossthe set of curves. Landmarks will then be perfectly aligned (in the ideal case),while points close to them would be dragged towards them in a smooth way.Landmarks can be manually (like in our case) or automatically computed, butin any case they come from a priori knowledge of the problem at hand (seeDiscussion). Figg. 3 and 4 show the effect of landmark registration. Note thatlandmarks get very well aligned except for a few outliers in the offset of thesecond accented vowel (4th landmark). Indeed some landmarks were put at theend or even beyond the end of the utterance (which is an illegal position for thelandmark registration procedure), thus these points probably deserve particularattention. I decided not to force the alignment process further in order to avoidunnecessary distortion.

6 Functional Principal Component Analysis (fPCA)

Classic PCA is a way to extract and display the main modes of variation ofa set of multidimensional data. Starting from a dataset in its original set ofcoordinates, a new set of coordinates is found such that by expressing (project-ing) the data points on it, the first projection accounts for the most part ofthe variance in the dataset, the second for the second most part of it, and soon. Fig. 5 shows an example of PCA applied to a fictitious dataset contain-ing people age and salary. The original coordinates are the ‘age’ and ‘salary’axes by which every data point is expressed. The new set of coordinates (PC1,PC2) captures the “natural” modes of variation of the dataset. In the pictureit is visible that the data is mainly varying along the PC1 direction, which isan expression of the correlation between age and salary. PC2 instead could beinterpreted as a relative wealth irrespective of age. Every point in the datasetcan now be re-expressed in terms of the PC coordinate system. For example thepoint indicated by an arrow gets a negative score on the PC1 axis, meaning thatthe person is relatively young, and a positive score on the PC2 axis, meaningthat s/he is relatively wealthy considering age. Note that PCA finds orthogonal

4

coordinates such that the variations across the dataset can be decomposed intoindependent descripting scores, i.e. the projections of points on the PC axes.In our example, it means that while a high value of age would be more likelyto be associated with a high value of salary (i.e. they are positively correlated),a (positive) high value of PC1 does not bring any expectation on the value ofPC2 (i.e. they are uncorrelated). While in classic PCA principal componentsare vectors of the same dimension as the data vectors, in fPCA principal com-ponents become functions defined on the same time interval as the functionaldataset. It is not possible to visualize orthogonality in this case. However, thesame concepts exposed in the example above still apply.

fPCA was applied to the 57 landmark-registered pitch contours describedabove. Fig. 6 shows the first two principal components, which together explainaround 78% of the variance in the dataset. For each component, the solid lineshows the average signal, i.e. each point is the average of all the 57 pitch valuesat that (normalized) time, while the ‘+’ and the ‘-’ curves in each panel representthe effect of adding/subtracting a multiple of one of the principal componentfunctions to the average curve, the latter being the same in all panels. Ineach panel, also the four landmarks are indicated by vertical dashed lines. Tomake a parallel with Fig. 5, the average curve corresponds to the origin of the(PC1,PC2) coordinate system, while the ‘+’ curve in the PC1 panel correspondsto a point lying somewhere on the positive side of the PC1 axis, i.e. scoringzero on PC2.

PC1 basically expresses the tendency of being ahead or behind in the regionof the first accented vowel, and at the same time shows that a late peak inthe first accented vowel region is associated with a high peak in the secondaccented vowel region. PC2 expresses mostly a difference in excursion. Notealso that in all cases PCs stick to the mean in the middle region. This doesnot mean that there are no variations in that region, but that variations arenot systematic enough to be captured. How to relate this fPCA picture to thefactors we are interested in? Fig. 7 shows boxplots describing how PC scoresare distributed with respect to both Q/S modality and speaker identity. It isinteresting to see how both speakers seem to use the two systematic variationmodes expressed by PC1 and PC2 in a different yet consistent way. Let us startwith PC1. Fig. 7(a) shows that PC1 coefficients are higher for questions thanfor statements for both speakers, but for speaker SC they are generally higherthan for speaker AS. PC1 coefficients express how similar a curve is to the ‘+’or ‘-’ prototypes in the first panel of Fig. 6. Positive values correspond to curveswhose peak corresponding to the first accented vowel is realized later, vice versafor negative values. So this means that both speakers tend to realize questionswith a later first peak than what they do for statements, but those shifts arerelative to a characteristic of the speaker, since speaker SC tends to realize thosepeaks generally later than speaker AS. A similar story goes for PC2 (Fig. 7(b)).Questions consistently present a richer pitch excursion than statements, but thistime AS is using generally more excursion than SC does. Fig. 8 shows anotherway of displaying the distribution of PC scores across the dataset with respectto Q/S modality and speaker identity. Fig. 8(a) shows how easy it is to separatequestions from answers just by extracting the two first principal components,even without considering the identity of the speaker. Fig. 8(b) shows how thetwo speakers express Q/S variations in a different way: SC is more consistentwhile AS has more intrinsic variability. The latter corresponds to the wider

5

span of boxplots for AS in Fig. 7.

7 Functional linear models

In this section we show how to apply one of the FDA extensions of linear modelsto a functional domain. The model is the following:

y(t) = µ(t) + βS(t) · xS + βQ(t) · xQ + ε(t) (2)

where µ(t) represent the global mean signal in the same sense as in the previoussection and corresponds to the intercept in classic linear models. xS and xQ arethe only predictors and work as indicator variables, e.g. xS = 1 if a sentence isa statement, xS = 0 otherwise. βS(t) and βQ(t) are the extension of the slopecoefficients in classic linear models and are functions of time. The unexplainedvariation is expressed by ε(t). In words eq. (2) says that we will try to predictthe global shape y(t) of a pitch contour on the base of the information aboutits Q/S modality. The model (2) will then predict the curve y(t) = µ(t) +βS(t)in case the sentence is a statement, y(t) = µ(t) + βQ(t) in case of a question(ε(t) cannot be predicted like in classic linear modeling). This kind of modelmay look redundant in that just one binary variable, say xmod = 0 if statement,1 if question, would have been enough. However, this model structure can beextended to categories with more than two values, e.g. parts of speech.

Fig. 9 shows results. This linear model basically confirms what fPCA alreadyshowed. In Fig. 9(a) the Question curve has a later first peak and it uses moreextension, the Statement curve does the other way round. Fig. 9(b) displaysthe percentage of explained variance, which with FDA becomes a function oftime. The areas around the two accented vowels are the ones in which themodel is able to explain up to 60-80% of the sample variance. Fig. 9(c) showspointwise 95% confidence intervals for the three model functions, µ(t), βS(t)and βQ(t). Note that the β’s confidence intervals refer to the difference fromthe mean and not from each other. In a contrast setting, where you computeconfidence intervals for the difference βS(t)− βQ(t), those intervals would showhigher significance (i.e. be farther away from y = 0).

As a test, I also applied the same model structure to the case where youwant to predict the speaker rather than the Q/S modality, i.e. the model:

y(t) = µ(t) + βAS(t) · xAS + βSC(t) · xSC + ε(t) (3)

Fig. 10 shows results. You can see that AS is using more extension than SC (cf.Fig. 7(b)) but there are no evident shifts of the first peak, since they cancel outwith different modalities (cf. Fig. 7(a)).

8 Discussion

8.1 A priori knowledge

Let us focus on fPCA. The first issue concerns the a priori knowledge that wasused and how it influenced the results. The input of the fPCA tool was a bunchof sampled F0 curves that were previously (i) smoothed and (ii) landmark-registered. There is some arbitrariness in deciding how much smoothing to

6

apply, and this is one of the steps where a priori knowledge can play a role.In short, you have to decide how much detail is necessary and how much toconsider noise. I used a method that is based on experimental measurements(see Appendix), but I don’t think that eye inspection and experiences judgementwould have brought to anything substantially different, since introducing moredetail (i.e. smooth less) would just introduce more ripples that would cancelout in the final PC curves (same with linear models).

Landmark-registration is almost automatic (there is a free parameter inside,but in the case like ours I just let the alignment go up to a reasonable level).However, the choice of the landmark is imposed a priori knowledge. Here theonset/offset of the two accented vowels were assumed to be the points in timethat matter. Using different landmarks, or no landmarks at all would producedifferent results. I have run fPCA on unregistered data and PC curves looksimilar to those in Fig. 7 but the high degree of separability in the (PC1,PC2)plane that is shown in Fig. 8 gets severely reduced. The explanation is thatstill the first peak has a variable position in the set of unregistered curves, how-ever those variations are due both to Q/S modality and to random speech ratevariations. On the other hand, by introducing other landmarks (e.g. all vowelsonset/offset) would probably make structure (ripples) appear in the middle part(see Fig. 2 to see those ripples), since the three analyzed sentences contain thesame number of syllables (8) and lexical stress in the same positions. That partis now flat simply because ripples there are not synchronous across the set ofcurves and then cancel out. If there are no systematic differences in that intervalthen the result would be that the mean signal would become rippled, but PC’swould not depart from it in a way that is correlated with Q/S modality. Thislatter experiment would require other alignment data. An automatic alignmentcould be performed with forced alignment (e.g. with HTK or Spraak).

8.2 Power of fPCA

Thus it appears that prior information is indeed playing a key role. Having saidthat, now let us look at what fPCA was able to show and let us try to compareit with a more traditional analysis, assuming to use the same prior information,i.e. the position of onset/offset of the two accented vowels. A simple experimentcould have taken height and time position of the first F0 peak relative to, say,first vowel onset. This would have probably revealed something interestingclose to the result provided by PC1 (you could try it if you want). However,looking at PC1 in Fig. 6 we see that a long range dependency was also revealed,i.e. when the first peak is late the second is higher and vice versa, and this ismore difficult to spot. Moreover, PC2 showed a tendency to use more or lessextension. This is also not so easily measurable by hand. More in general, ifyou don’t expect those kind of variations to be there then you would probablynot think to measure them, while fPCA reveals these patterns automatically.Finally, the results in Fig. 6 are not only of immediate interpretation but includea quantitative counterpart illustrated in figg. 7 and 8.

8.3 Reconstruction and re-synthesis with fPCA

A possible application of the fPCA work presented here that goes beyond analy-sis is re-synthesis for perception studies. Fig. 12 shows how using PC functions

7

and the relative scores you can reconstruct the original signal. Recall that(f)PCA provides a new set of coordinates that can be used to represent theoriginal data. In our case the “coordinate system” is made of functions of time.The origin is the average signal µ(t) shown in black in Fig. 11, as well as inmost of the figures (e.g. the solid curves in Fig. 6). Since PC1 accounts for64% of the sample variance (Fig. 6), you may start reconstructing the signal byadding PC1(t) (red curve in Fig. 11) to µ(t) multiplied by the right score s1,i.e. the one you read out from the x axis in Fig. 8. The analogy with Fig. 5would be to reconstruct (approximate) the point pointed by the arrow with itsprojection on the PC1 axis. To get a more refined reconstruction you repeat theprocedure with PC2. You can decide to stop with adding components where itis convenient. In a formula, reconstruction is expressed as follows:

y(t) ≈ µ(t) +K∑

k=1

skPCk(t) (4)

where K is the number of PC’s you decide to use.Reconstruction is not the only thing you can do using PC’s. Looking at

Fig. 8(a) you can imagine for example to “move” a point from the Questionto the Statement cluster and listen to the achieved effect. You could proceedas follow. Choose a signal to re-synthesize. Starting from its original positionin the (PC1,PC2) plane, choose the desired destination position (e.g. from theQuestion to the Statement cluster). Reconstruct an F0 contour using eq. (4),reverse the landmark registration, recover the original signal length and linearlyre-expand the contour to its original time interval, reconvert to unnormalizedHz, generate a set of samples from the functional representation and finallyuse Praat F0 synthesizer to apply the new pitch contour to the original signal.All these operation are automatic and require only easy scripting. This idea isdeveloped in [5].

Appendices

A Data smoothing

This appendix describes in detail the principle I have used in the smoothing ofthe dataset analyzed in this work. The example I provide here comes from adifferent dataset. In that case, pitch was computed once every 5 ms (instead of10 ms) and the duration of the whole utterance was generally shorter than thoseanalyzed in this work, since it was the realization of only one syllable. Thesedifferences do not impair the illustration of the adopted smoothing principle inany substantial way. The final value of the soothing parameter λ = 108 foundbelow was very different than the one found here (λ = 10−−5), but this ismainly due to the signal representation (time in ms vs. s). These kind of detailswill be addressed in the upcoming tutorial.

8

A.1 Using maximum pitch speed empirical models as asmoothing reference

Data interpolation follows basic guidelines from the book [2], thou with someadaptation. B-splines are used as a function basis and a penalized squared errorminimization problem is solved like the following:

min{SSE + λ · PEN} (5)

where SSE is the sum of squared errors of the fitting curve with respect tothe original time samples, PEN is a measure of curve roughness and λ > 0 isa coefficient that weights the importance between the two (see [2] for details).Following de Boor’s theorem (cf. [2], Ch. 5), a spline knot is put on eachtime sample. Note that each curve has in general a different duration, thusa different number of 5ms samples. It follows that each of them is separatelyinterpolated with a different spline basis. The roughness penalty coefficient λwas not chosen using mean squared error-related empirical methods, like gen-eralized cross-validation ([2], Par. 5.4.3). The reason is that the data appearednoisy, with high frequency content that did not reflect the underlying physicalphenomenon (vocal fold vibration). Some results in [10] were used as guide-lines to empirically determine how much pitch curves should be smoothed. Inthat paper, empirically obtained linear equations relate the pitch excursion in avoluntary gesture to its corresponding maximum achievable pitch change speed.More precisely, given an observed voluntary pitch gesture (a rise or a fall) elicitedin such a way that the maximum controllable speed in pitch change is used by asubject, two kinds of linear equations can be derived. One has pitch excursionas predictor and average rate of pitch change (i.e. the pitch excursion dividedby the time required to achieve it) as dependent variable. Another has pitchexcursion as predictor as well, but the peak instantaneous pitch change rate asdependent variable. The equations for a rising pitch contour are reported here(cf. [10], Tables VI and VII, line ‘Mean’, column ‘Rise speed’):

ave. speed = 10.8 + 5.6 · excursion (6)

Max. speed = 12.4 + 10.5 · excursion (7)

where both speed ’s are in semitones/s and excursion is in semitones. It followsthat the maximum average or instantaneous speed a speaker can voluntarilychange his/her pitch depends on the pitch excursion, larger excursions beingreachable at a faster speed. This result is exemplified in Fig. 13 for eq. (6).

We made use of this result to choose the value of the roughness penaltycoefficient λ by visually inspecting a subset of pitch contours and their firstderivatives in time, i.e. their instantaneous speed. We decided to smooth acurve feature if it could not be plausibly deemed as the effect of a voluntarygesture. Of course, involuntary and fast dynamic features could be present aswell in the signal, but we preferred to smooth them, since they are not easilytold apart from measurement errors and, more importantly, keeping a high levelof detail in the curve interpolation would lead to blur the overall curve trends ofinterest. Fig. 14 shows an example of the effect of changing λ in eq. (5) in termsof fit of the original samples (left panels) and in plausibility of the correspondingpitch speed contour (both panels). Here, the glitch in the original data occurringat around 150 ms causes the interpolated contours and their respective speed

9

contours to be implausible for λ = 102, 104 and 106. For example, for λ = 106

we see that the glitch is interpolated with a rise followed by a short plateau andthen by a second and final rise. Thus, the first rise gesture up to the plateauwould be a circa 1 st raise performed at an average speed of around 25 st/s(judging the left panel by eye inspection) and at a peak speed of around 40 st/s(looking at the right panel). However, eq. (6) and (7) predict that average andpeak pitch change speed for a 1 st rising gesture cannot exceed 16 and 25 st/s,respectively. On the other hand, the value of λ = 108 brings a more uniformrising contour of around 6 semitones, performed at an average and peak pitchchange speed of around 20 and 30 st/s respectively, well below their predictedlimits. Since smoothing more (λ > 108) would lead to a further increase in thefitting error (greater SSE in eq.(5)) in this case we would take 108 as optimalrougness penalty coefficient. After repeating this analysis for several cases, wedecided to adopt λ = 108 for all curves.

References

[1] R. Carlson and S. Hawkins, “When is phonetic detail a detail?” Proc.ICPhS XVI, pp. 211 – 214, 2007.

[2] J. O. Ramsay and B. W. Silverman, Functional Data Analysis - 2nd Ed.Springer, 2005.

[3] ——, Applied Functional Data Analysis - Methods and Case Studies.Springer, 2002.

[4] J. O. Ramsay, G. Hookers, and S. Graves, Functional Data Analysis withR and MATLAB. Springer, 2009.

[5] M. Gubian, F. Cangemi, and L. Boves, “Automatic and data driven pitchcontour manipulation with functional data analysis,” submitted to SpeechProsody, Chicago, Illinois, 2010.

[6] M. Gubian, F. Torreira, H. Strik, and L. Boves, “Functional data analysisas a tool for analyzing pronunciation variation - a case study on the frenchword c’ etait,” Proceedings of INTERSPEECH 2009, 6–10 September 2009,Brighton, UK, pp. 2199–2202.

[7] M. D’Imperio and D. House, “Perception of questions and statements inneapolitan italian,” in Kokkinakis, G., Fakotakis, N. & Dermatas, E. (eds.)Proceedings of Eurospeech ‘97, vol. 1, pp. 251 – 254, 1997.

[8] M. D’Imperio, “The role of perception in defining tonal targets and theiralignment,” PhD thesis, The Ohio State University, 2000.

[9] M. D’Imperio and F. Cangemi, “The interplay between tonal alignmentand rise shape in the perception of two neapolitan rising accents,” selectedpapers from PaPI conference (Las Palmas, June 2009), (to appear).

[10] Y. Xu and X. Sun, “Maximum speed of pitch change and how it may relateto speech,” J. Acoust. Soc. Am., vol. 111, no. 3, pp. 1399–1413, March2002.

10

(a) Q/S modality

(b) Speaker

Figure 1: Raw data, y axis as in (1). Colours are used to highlight (a) the Q/Smodality distinction or (b) the speaker identity.

11

Figure 2: An example on how raw data was smoothed. Circles represent theoriginal samples while the continuous line is the result of a smoothing procedure.See Appendix for details.

12

(a) Linear time normalization

(b) Landmark registration

Figure 3: (a) Linear time normalization of the whole dataset and subsequent (b)landmark registration. Coloured dots show the position of the four landmarkson each curve.

13

Figure 4: Boxplots showing the dispersion of landmark positions across the setof curves before and after the application of landmark registration.

Figure 5: An example for classic PCA. A fictitious datasets collects people ageand salary. The first two PC are shown as well as the projection of the pointindicated by the arrow on the PC axes.

14

Figure 6: Functional PCA applied to the 57 landmark-registered pitch contoursdataset. Solid line shows the average signal, while the ‘+’ and the ‘-’ curvesrepresent the effect of adding/subtracting a multiple of the principal compo-nent function to the average curve. In each panel, also the four landmarks areindicated by vertical dashed lines.

15

(a) PC1 (b) PC2

Figure 7: Boxplots showing the distribution of (a) PC1 and (b) PC2 scores withrespect to Q/S modality and speaker identity (speakers ‘AS’ and ‘SC’).

(a) Q/S modality (b) Q/S modality and speaker

Figure 8: Scatterplots showing the distributions of PC scores with respect toQ/S modality and speaker identity.

16

(a) Prediction (b) R2(t) (% explained var.)

(c) µ(t) (Intercept) and βS,Q(t) conf. intervals

Figure 9: Results from model (2). In (a) the black curve is the mean, the othertwo are the predicted pitch contours for Q/S modalities. In (b) the percentageof explained variance R2(t), which in FDA becomes a function of time. In (c)95% pointwise confidence intervals for the functions µ(t) and βS,Q(t) (the fartherfrom y = 0, the better).

17

(a) Prediction (b) R2(t) (% explained var.)

(c) µ(t) (Intercept) and βAS,SC(t) conf. intervals

Figure 10: Results from model (3). In (a) the black curve is the mean, the othertwo are the predicted pitch contours for speaker AS and SC. In (b) the percent-age of explained variance R2(t), which in FDA becomes a function of time. In(c) 95% pointwise confidence intervals for the functions µ(t) and βAS,SC(t).

18

Figure 11: PCA components.

Figure 12: Reconstruction of one of the curves (black) by using just the meansignal (red) or the first (green) or the first two (blue) principal components. Here‘original curve’ means the curve after smoothing and landmark registration.

19

Figure 13: Three rising pitch excursions of 4, 7 and 12 semitones are reachedat their respective maximum average speed according to eq. (6). A largervoluntary pitch excursion can be reached at a higher speed [10].

20

(a) pitch, λ = 102 (b) pitch speed, λ = 102

(c) pitch, λ = 104 (d) pitch speed, λ = 104

(e) pitch, λ = 106 (f) pitch speed, λ = 106

(g) pitch, λ = 108 (h) pitch speed, λ = 108

Figure 14: The effect of the roughness penalty λ in eq.(5) on pitch speed. Leftpanels report pitch curves, the y axis is in semitones after removing the timeaverage. Right panels are corresponding pitch change in time (speed) curvesexpressed in semitones/s.

21