sherer voice tune

30
Vocal communication of emotion: A review of research paradigms Klaus R. Scherer * Department of Psychology, University of Geneva, 40. Boulevard du Pont d’Arve, CH-1205 Geneva, Switzerland Abstract The current state of research on emotion effects on voice and speech is reviewed and issues for future research efforts are discussed. In particular, it is suggested to use the Brunswikian lens model as a base for research on the vocal communication of emotion. This approach allows one to model the complete process, including both encoding (ex- pression), transmission, and decoding (impression) of vocal emotion communication. Special emphasis is placed on the conceptualization and operationalization of the major elements of the model (i.e., the speakerÕs emotional state, the listenerÕs attribution, and the mediating acoustic cues). In addition, the advantages and disadvantages of research paradigms for the induction or observation of emotional expression in voice and speech and the experimental ma- nipulation of vocal cues are discussed, using pertinent examples drawn from past and present research. Ó 2002 Elsevier Science B.V. All rights reserved. Zusammenfassung Der Aufsatz gibt einen umfassenden Uberblick uber den Forschungsstand zum Thema der Beeinflussung von Stimme und Sprechweise durch Emotionen des Sprechers. Allgemein wird vorgeschlagen, die Forschung zur vokalen Kommunikation der Emotionen am BrunswikÕschen Linsenmodell zu orientieren. Dieser Ansatz erlaubt den gesamten Kommunikationsprozess zu modellieren, von der Enkodierung (Ausdruck), uber die Transmission ( Ubertragung), bis zur Dekodierung (Eindruck). Besondere Aufmerksamkeit gilt den Problemen der Konzeptualisierung und Opera- tionalisierung der zentralen Elemente des Modells (z.B., dem Emotionszustand des Sprechers, den Inferenzprozessen des Horers, und den zugrundeliegenden vokalen Hinweisreizen). Anhand ausgewahlter Beispiele empirischer Unter- suchungen werden die Vor- und Nachteile verschiedener Forschungsparadigmen zur Induktion und Beobachtung des emotionalen Stimmausdrucks sowie zur experimentellen Manipulation vokaler Hinweisreize diskutiert. Ó 2002 Elsevier Science B.V. All rights reserved. R esum e LÕ etat actuel de la recherche sur lÕeffet des emotions dÕun locuteur sur la voix et la parole est d ecrit et des approches prometteuses pour le futur identifi ees. En particulier, le mod ele de perception de Brunswik (dit ‘‘de la lentille’’ est propos e) comme paradigme pour la recherche sur la communication vocale des emotions. Ce mod ele permet la mod elisation du processus complet, de lÕencodage (expression) par la transmission au d ecodage (impression). La con- ceptualisation et lÕop erationalization des el ements centraux du mod ele (lÕ etat emotionnel du locuteur, lÕinf erence de cet etat par lÕauditeur, et les indices auditifs) sont discut e en d etail. De plus, en analysant des exemples de la recherche dans le * Tel.: +41-22-705-9211/9215; fax: +41-22-705-9219. E-mail address: [email protected] (K.R. Scherer). 0167-6393/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII:S0167-6393(02)00084-5 Speech Communication 40 (2003) 227–256 www.elsevier.com/locate/specom

Upload: simona-damico

Post on 10-Oct-2014

33 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sherer Voice Tune

Vocal communication of emotion: A review ofresearch paradigms

Klaus R. Scherer *

Department of Psychology, University of Geneva, 40. Boulevard du Pont d’Arve, CH-1205 Geneva, Switzerland

Abstract

The current state of research on emotion e!ects on voice and speech is reviewed and issues for future research e!ortsare discussed. In particular, it is suggested to use the Brunswikian lens model as a base for research on the vocalcommunication of emotion. This approach allows one to model the complete process, including both encoding (ex-pression), transmission, and decoding (impression) of vocal emotion communication. Special emphasis is placed on theconceptualization and operationalization of the major elements of the model (i.e., the speaker!s emotional state, thelistener!s attribution, and the mediating acoustic cues). In addition, the advantages and disadvantages of researchparadigms for the induction or observation of emotional expression in voice and speech and the experimental ma-nipulation of vocal cues are discussed, using pertinent examples drawn from past and present research.! 2002 Elsevier Science B.V. All rights reserved.

Zusammenfassung

Der Aufsatz gibt einen umfassenden !UUberblick !uuber den Forschungsstand zum Thema der Beeinflussung vonStimme und Sprechweise durch Emotionen des Sprechers. Allgemein wird vorgeschlagen, die Forschung zur vokalenKommunikation der Emotionen am Brunswik!schen Linsenmodell zu orientieren. Dieser Ansatz erlaubt den gesamtenKommunikationsprozess zu modellieren, von der Enkodierung (Ausdruck), !uuber die Transmission ( !UUbertragung), biszur Dekodierung (Eindruck). Besondere Aufmerksamkeit gilt den Problemen der Konzeptualisierung und Opera-tionalisierung der zentralen Elemente des Modells (z.B., dem Emotionszustand des Sprechers, den Inferenzprozessendes H!oorers, und den zugrundeliegenden vokalen Hinweisreizen). Anhand ausgew!aahlter Beispiele empirischer Unter-suchungen werden die Vor- und Nachteile verschiedener Forschungsparadigmen zur Induktion und Beobachtung desemotionalen Stimmausdrucks sowie zur experimentellen Manipulation vokaler Hinweisreize diskutiert.! 2002 Elsevier Science B.V. All rights reserved.

R"eesum"ee

L!"eetat actuel de la recherche sur l!e!et des "eemotions d!un locuteur sur la voix et la parole est d"eecrit et des approchesprometteuses pour le futur identifi"eees. En particulier, le mod#eele de perception de Brunswik (dit ‘‘de la lentille’’ estpropos"ee) comme paradigme pour la recherche sur la communication vocale des "eemotions. Ce mod#eele permet lamod"eelisation du processus complet, de l!encodage (expression) par la transmission au d"eecodage (impression). La con-ceptualisation et l!op"eerationalization des "eel"eements centraux du mod#eele (l!"eetat "eemotionnel du locuteur, l!inf"eerence de cet"eetat par l!auditeur, et les indices auditifs) sont discut"ee en d"eetail. De plus, en analysant des exemples de la recherche dans le

*Tel.: +41-22-705-9211/9215; fax: +41-22-705-9219.E-mail address: [email protected] (K.R. Scherer).

0167-6393/02/$ - see front matter ! 2002 Elsevier Science B.V. All rights reserved.PII: S0167-6393 (02 )00084-5

Speech Communication 40 (2003) 227–256www.elsevier.com/locate/specom

Page 2: Sherer Voice Tune

domaine, les avantages et d"eesavantages de di!"eerentes m"eethodes pour l!induction et l!observation de l!expression "eemo-tionnelle dans la voix et la parole et pour la manipulation exp"eerimentale de di!"eerents indices vocaux sont "eevoqu"ees.! 2002 Elsevier Science B.V. All rights reserved.

Keywords: Vocal communication; Expression of emotion; Speaker moods and attitudes; Speech technology; Theories of emotion;Evaluation of emotion e!ects on voice and speech; Acoustic markers of emotion; Emotion induction; Emotion simulation; Stresse!ects on voice; Perception/decoding

1. Introduction: Modeling the vocal communicationof emotion

The importance of emotional expression inspeech communication and its powerful impact onthe listener has been recognized throughout his-tory. Systematic treatises of the topic, together withconcrete suggestions for the strategic use of emo-tionally expressive speech, can be found in earlyGreek and Roman manuals on rhetoric (e.g., byAristotle, Cicero, Quintilian), informing all latertreatments of rhetoric in Western philosophy(Kennedy, 1972). Renewed interest in the expres-sion of emotion in face and voice was sparked inthe 19th century by the emergence of modernevolutionary biology, due to the contributions bySpencer, Bell, and particularly Darwin (1872,1998). The empirical investigation of the e!ect ofemotion on the voice started at the beginning of the20th century, with psychiatrists trying to diagnoseemotional disturbances through the newly devel-oped methods of electroacoustic analysis (e.g.,Isserlin, 1925; Scripture, 1921; Skinner, 1935).

The invention and rapid dissemination of thetelephone and the radio also led to increasing sci-entific concern with the communication of speakerattributes and states via vocal cues in speech(Allport and Cantril, 1934; Herzog, 1933; Pear,1931). However, systematic research programsstarted in the 1960s when psychiatrists renewedtheir interest in diagnosing a!ective states viavocal expression (Alpert et al., 1963; Moses, 1954;Ostwald, 1964; Hargreaves et al., 1965; Stark-weather, 1956), non-verbal communication re-searchers explored the capacity of di!erent bodilychannels to carry signals of emotion (Feldman andRim"ee, 1991; Harper et al., 1978; Knapp, 1972;Scherer, 1982b), emotion psychologists chartedthe expression of emotion in di!erent modalities

(Tomkins, 1962; Ekman, 1972, 1992; Izard, 1971,1977), linguists and particularly phoneticians dis-covered the importance of pragmatic informationin speech (Mahl and Schulze, 1964; Trager, 1958;Pittenger et al., 1960; Ca" and Janney, 1994), andengineers and phoneticians specializing in acousticsignal processing started to make use of ever moresophisticated technology to study the e!ects ofemotion on the voice (Lieberman and Michaels,1962; Williams and Stevens, 1969, 1972). In recentyears, speech scientists and engineers, who hadtended to disregard pragmatic and paralinguisticaspects of speech in their e!ort to develop modelsof speech communication for speech technologyapplications, have started to devote more atten-tion to speaker attitudes and emotions––oftenin the interest to increase the acceptability ofspeech technology for human users. The confer-ence which has motivated the current special issueof this journal (ISCA Workshop on Voice andEmotion, Newcastle, Northern Ireland, 2000)and a number of recent publications (Amir andRon, 1998; Bachorowski, 1999; Bachorowski andOwren, 1995; Banse and Scherer, 1996; Cowieand Douglas-Cowie, 1996; Erickson et al., 1998;Iida et al., 1998; Kienast et al., 1999; Klasmeyer,1999; Morris et al., 1999; Murray and Arnott,1993; Mozziconacci, 1998; Pereira and Watson,1998; Picard, 1997; Rank and Pirker, 1998; Sobinand Alpert, 1999) testifies to the lively researchactivity that has been sprung up in this domain.This paper attempts to review some of the centralissues in empirical research on the vocal com-munication of emotion and to chart some of thepromising approaches for interdisciplinary researchin this area.

I have repeatedly suggested (Scherer, 1978,1982a) to base theory and research in this area ona modified version of Brunswik!s functional lens

228 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 3: Sherer Voice Tune

model of perception (Brunswik, 1956; Gi!ord,1994; Hammond and Stewart, 2001). Since thedetailed argument can be found elsewhere (seeKappas et al., 1991; Scherer et al., in press), I willonly briefly outline the model (shown in the upperpart of Fig. 1, which represents the conceptuallevel). The process begins with the encoding, orexpression, of emotional speaker states by certainvoice and speech characteristics amenable to ob-jective measurement in the signal. Concretely, theassumption is that the emotional arousal of thespeaker is accompanied by physiological changesthat will a!ect respiration, phonation, and articu-lation in such a way as to produce emotion-specificpatterns of acoustic parameters (see (Scherer,1986) for a detailed description). Using Brunswik!sterminology, one can call the degree to which suchcharacteristics actually correlate with the under-lying speaker state ecological validity. As theseacoustic changes can serve as cues to speaker a!ectfor an observer, they are called distal cues (distal inthe sense of remote or distant from the observer).They are transmitted, as part of the speech signal,to the ears of the listener and perceived via theauditory perceptual system. In the model, theseperceived cues are called proximal cues (proximalin the sense of close to the observer).

There is some uncertainty among Brunswikiansexactly how to define and operationalize proximalcues in di!erent perceptual domains (see (Ham-mond and Stewart, 2001) for the wide variety ofuses and definitions of the model). While Brunswikapparently saw the proximal stimulus as close butstill outside of the organism (Hammond andStewart, 2001), the classic example given for thedistal–proximal relationship in visual percep-tion––the juxtaposition between object size (distal)and retinal size (proximal)––suggests that theproximal cue, the pattern of light on the retina, isalready inside the organism. Similarly, in auditoryperception, the fundamental frequency of a speechwave constitutes the distal characteristic that givesrise to the pattern of vibration along the basilarmembrane, and, in turn, the pattern of excitationalong the inner hair cells, the consequent excita-tion of the auditory neurons, and, finally, its rep-resentation in the auditory cortex. Either phase inthis input, transduction, and coding process couldbe considered a proximal representation of thedistal stimulus. I believe that it makes sense toextend this term to the neural representation of thestimulus information as coded by the respectiveneural structures. The reason is twofold: (1) It isdi"cult to measure the raw input (e.g., vibration

Fig. 1. A Brunswikian lens model of the vocal communication of emotion.

K.R. Scherer / Speech Communication 40 (2003) 227–256 229

Page 4: Sherer Voice Tune

of the basilar membrane) and thus one could notsystematically study this aspect of the model inrelation to others; (2) The immediate input into theinference process, which Brunswik called cue uti-lization, is arguably the neural representation inthe respective sensory cortex. Thus, the proximalcue for fundamental frequency would be perceivedpitch. While fraught with many problems, we dohave an access to measuring at least the consciouspart of this representation via self-report (seeB!aanziger and Scherer, 2001).

One of the most important advantages of themodel is to highlight the fact that objectivelymeasured distal characteristics are not necessarilyequivalent to the proximal cues they produce inthe observer. While the proximal cues are basedon (or mimick) distal characteristics, the latter maybe modified or distorted by (1) the transmissionchannel (e.g., distance, noise) and (2) the structuralcharacteristics of the perceptual organ and thetransduction and coding process (e.g., selectiveenhancement of certain frequency bands). Theseissues are discussed in somewhat greater detailbelow.

The decoding process consists of the inferenceof speaker attitudes and emotions based on inter-nalized representations of emotional speech mod-ifications, the proximal cues. The fit of the modelcan be ascertained by operationalizing and mea-suring each of its elements (see operational level inFig. 1), based on the definition of a finite numberof cues. If the attribution obtained through listenerjudgments corresponds (with better than chancerecognition accuracy) to the criterion for speakerstate (e.g., intensity of a certain emotion), themodel describes a functionally valid communica-tion process. However, the model is also extremelyuseful in cases in which attributions and criteria donot match since it permits determination of themissing or faulty link of the chain. Thus, it ispossible that the respective emotional state doesnot produce reliable externalizations in the form ofspecific distal cues in the voice. Alternatively, validdistal cues might be degraded or modified duringtransmission and perception in such a fashion thatthey no longer carry the essential informationwhen they are proximally represented in the lis-tener. Finally, it is possible that the proximal cues

reliably map the valid distal cues but that the in-ference mechanism, i.e., the cognitive representa-tion of the underlying relationships, is flawed inthe respective listener (e.g., due to lack of su"cientexposure or inaccurate stereotypes). To my knowl-edge, there is no other paradigm that allows toexamine the process of vocal communication in ascomprehensive and systematic fashion. This is whyI keep arguing for the utility of basing research inthis area explicitly on a Brunswikian lens model.

Few empirical studies have sought to model aspecific communication process by using the com-plete lens model, mostly due to considerations in-volving the investment of the time and moneyrequired (but see Gi!ord, 1994; Juslin, 2000). In anearly study on the vocal communication of speakerpersonality, trying to determine which personalitytraits are reliably indexed by vocal cues and cor-rectly inferred by listeners, I obtained naturalspeech samples in simulated jury discussions withadults (German and American men) for whomdetailed personality assessments (self- and peer-ratings) had been obtained (see Scherer, 1978,1982a). Voice quality was measured via expert(phoneticians) ratings (distal cues) and personalityinferences (attributions) by lay listeners! ratings(based on listening to content-masked speech sam-ples). In addition, a di!erent group of listenerswas asked to rate the voice quality with the help ofa rating scale with natural language labels forvocal characteristics (proximal cues). Path ana-lyses were performed to test the complete lensmodel. This technique consists of a systematicseries of regression analyses to test the causalassumptions in a model containing mediatingvariables (see Bryman and Cramer, 1990, pp. 246–251). The results for one of the personality traitsstudied are illustrated in Fig. 2. The double arrowscorrespond to the theoretically specified causalpaths. Simple and dashed arrows correspond tonon-predicted direct and indirect e!ects that ex-plain additional variance. The graph shows whyextroversion was correctly recognized from thevoice: (1) extroversion is indexed by objectivelydefined vocal cues (ecologically valid in the Bruns-wikian sense), (2) these cues are not too drasticallymodified in the transmission process, and (3) thelisteners! inference structure in decoding mirrors

230 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 5: Sherer Voice Tune

the encoding structure. These conditions were notmet in the case of emotional stability. While therewere strong inference structures, shared by mostlisteners (attributing emotional stability to speak-ers with resonant, warm, low-pitched voices), thesedo not correspond to an equivalent encoding struc-ture (i.e., there was no relationship between voicefrequency and habitual emotional stability in thesample of speakers studied; see (Scherer, 1978)for the data and further discussion).

Clearly, a similar approach could be used withemotions rather than personality traits as speakercharacteristics. Unfortunately, so far no completelens model has been tested in this domain. Yet thistype of model is useful as a heuristic device todesign experimental work in this area even if onlyparts of the model are investigated. The curvedarrows in Fig. 1 indicate some of the major issuesthat can be identified with the help of the model.These issues, and the evidence available to date,are presented below.

2. A review of the literature

2.1. Encoding studies

The basis of any functionally valid communi-cation of emotion via vocal expression is that

di!erent types of emotion are actually character-ized by unique patterns or configurations ofacoustic cues. In the context of a Brunswikian lensmodel that means that the identifiable emotionalstates of the sender are in fact externalised by aspecific set of distal cues. Without such distin-guishable acoustic patterns for di!erent emotions,the nature of the underlying speaker state couldnot be communicated reliably. Not surprisingly,then, there have been a relatively large number ofempirical encoding studies conducted over the lastsix decades, attempting to determine whether elic-itation of emotional speaker states will producecorresponding acoustic changes. These studies canbe classified into three major categories: naturalvocal expression, induced emotional expression,and simulated emotional expression.

2.1.1. Natural vocal expressionWork in this area has made use of material that

was recorded during naturally occurring emotionalstates of various sorts, such as dangerous flightsituations for pilots, journalists reporting emotion-eliciting events, a!ectively loaded therapy sessions,or talk and game shows on TV (Johannes et al.,2000; Cowie and Douglas-Cowie, 1996; Duncanet al., 1983; Eldred and Price, 1958; Hargreaveset al., 1965; Frolov et al., 1999; Huttar, 1968;Kuroda et al., 1979; Niwa, 1971; Roessler and

Fig. 2. Two-dimensional path analysis model for the inference of extroversion from the voice (reproduced from Scherer, 1978). Thedashed line represents the direct path, the double line the postulated indirect paths, and the single lines the indirect paths compatiblewith the model. Coe"cients shown are standardized coe"cients except for rCD1 and rCD2, which are Pearson rs (the direction shown istheoretically postulated). R2s based on all predictors from which paths lead to the variable. !p < 0:05; !!p < 0:01.

K.R. Scherer / Speech Communication 40 (2003) 227–256 231

Page 6: Sherer Voice Tune

Lester, 1976, 1979; Simonov and Frolov, 1973;Sulc, 1977; Utsuki and Okamura, 1976; Williamsand Stevens, 1969, 1972; Zuberbier, 1957; Zwirner,1930). The use of naturally occurring voice chan-ges in emotionally charged situations seems theideal research paradigm since it has very highecological validity. However, there are some seri-ous methodological problems. Voice samples ob-tained in natural situations, often only for a singleor a very small number of speakers, are generallyvery brief, not infrequently su!ering from bad re-cording quality. In addition, there are problems indetermining the preise nature of the underlyingemotion and the e!ect of regulation (see below).

2.1.2. Induced emotionsAnother way to study the vocal e!ects of

emotion is to experimentally induce specific emo-tional states in groups of speakers and to recordspeech samples. A direct way of inducing a!ectivearousal and studying the e!ects of the voice is theuse of psychoactive drugs. Thus, Helfrich et al.(1984) studied the e!ects of antidepressive drugson several vocal parameters (compared to placebo)over a period of several hours. Most inductionstudies have used indirect paradigms that includestress induction via di"cult tasks to be completedunder time pressure, the presentation of emotion-inducing films or slides, or imagery methods (Al-pert et al., 1963; Bachorowski and Owren, 1995;Bonner, 1943; Havrdova and Moravek, 1979;Hicks, 1979; Karlsson et al., 1998; Markel et al.,1973; Plaikner, 1970; Roessler and Lester, 1979;Scherer, 1977, 1979; Scherer et al., 1985; Skinner,1935; Tolkmitt and Scherer, 1986). While this ap-proach, generally favoured by experimental psy-chologists because of the degree of control ita!ords, does result in comparable voice samplesfor all participants, there are a number of seriousdrawbacks. Most importantly, these proceduresoften produce only relatively weak a!ect. Fur-thermore, in spite of using the same procedure forall participants, one cannot necessarily assumethat similar emotional states are produced in allindividuals, precisely because of the individualdi!erences in event appraisal mentioned above.Recently, Scherer and his collaborators, in thecontext of a large scale study on emotion e!ects in

automatic speaker verification, have attempted toremedy some of these shortcomings by developinga computerized induction battery for a variety ofdi!erent states (Scherer et al., 1998).

2.1.3. Simulated (portrayed) vocal expressionsThis has been the preferred way of obtaining

emotional voice samples in this field. Professionalor lay actors are asked to produce vocal expres-sions of emotion (often using standard verbalcontent) as based on emotion labels and/or typicalscenarios (Banse and Scherer, 1996; Bortz, 1966;Coleman and Williams, 1979; Costanzo et al.,1969; Cosmides, 1983; Davitz, 1964; Fairbanksand Pronovost, 1939; Fairbanks and Hoaglin,1941; Fonagy, 1978, 1983; Fonagy and Magdics,1963; Green and Cli!, 1975; H!oo!e, 1960; Kaiser,1962; Kienast et al., 1999; Klasmeyer and Send-lmeier, 1997; Klasmeyer and Meier, 1999; Klas-meyer and Sendlmeier, 1999; Klasmeyer, 1999;Kotlyar and Morozov, 1976; Levin and Lord,1975; Paeschke et al., 1999; Scherer et al., 1972a,b,1973; Sedlacek and Sychra, 1963; Sobin and Al-pert, 1999; Tischer, 1993; van Bezooijen, 1984;Wallbott and Scherer, 1986; Whiteside, 1999;Williams and Stevens, 1972). There can be littledoubt that simulated vocal portrayal of emotionsyields much more intense, prototypical expressionsthan are found for induced states or even naturalemotions (especially when they are likely to behighly controlled; see (Scherer, 1986, p. 159)).However, it cannot be excluded that actors over-emphasize relatively obvious cues and miss moresubtle ones that might appear in natural expres-sion of emotion (Scherer, 1986, p. 144). It has of-ten been argued that emotion portrayals reflectsociocultural norms or expectations more than thepsychophysiological e!ects on the voice as theyoccur under natural conditions. However, it canbe argued that all publicly observable expres-sions are to some extent ‘‘portrayals’’ (given thesocial constraints on expression and unconscioustendencies toward self-presentation; see (Banse andScherer, 1996)). Furthermore, since vocal por-trayals are reliably recognized by listener-judges(see below) it can be assumed that they reflect atleast in part ‘‘normal’’ expression patterns (if thetwo were to diverge too much, the acted version

232 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 7: Sherer Voice Tune

would lose its credibility). However, there can belittle doubt that actors! portrayals are influencedby conventionalised stereotypes of vocal expres-sion.

2.1.4. Advantages and disadvantages of methodsAs shown above, all of the methods that have

been used to obtain vocal emotion expressionsamples have both advantages and disadvantages.In the long run, it is probably the best strategyto look for convergences between all three ap-proaches in the results. Johnstone and Scherer(2000, pp. 226–227) have reviewed the convergingevidence with respect to the acoustic patterns thatcharacterize the vocal expression of major modalemotions. Table 1 summarizes their conclusions insynthetic form.

Much of the consistency in the findings is linkedto di!erential levels of arousal or activation for thetarget emotions. Indeed, in the past, it has oftenbeen assumed that contrary to the face, capable ofcommunicating qualitative di!erences betweenemotions, the voice could only mark physiologicalarousal (see (Scherer, 1979, 1986) for a detaileddiscussion). That this conclusion is erroneous isshown by the fact that judges are almost as accu-rate in inferring di!erent emotions from vocal asfrom facial expression (see below). There are twomain reasons why it has been di"cult to demon-strate qualitative di!erentiation of emotions inacoustic patterns apart from arousal: (1) only alimited number of acoustic cues has been studied,and (2) arousal di!erences within emotion familieshave been neglected.

As to (1), most studies in the field have limitedthe scope of acoustic measurement to F0, energy,

and speech rate. Only very few studies have lookedat the frequency distribution in the spectrum orformant parameters. As argued by Scherer (1986),it is possible that F0, energy, and rate may be mostindicative of arousal whereas qualitative, valencedi!erences may have a stronger impact on sourceand articulation characteristics.

As to (2), most investigators have tended toview a few basic emotions as rather uniform andhomogeneous, as suggested by the writings ofdiscrete emotion theory (see Scherer, 2000a). Morerecently, the concept of emotion families has beenproposed (e.g., Ekman, 1992), acknowledging thefact that there are many di!erent kinds of anger, ofjoy, or of fear. For example, the important vocaldi!erences between the expression as hot anger(explosive rage) and cold, subdued or controlled,anger may explain some of the lack of replicationfor results concerning the acoustic patterns ofanger expression in the literature. The interpreta-tion of such discrepancies is rendered particularlydi"cult by the fact that researchers generally donot specify which kind of anger has been producedor portrayed (see Scherer, 1986). Thus, di!erentinstantiations or variants of specific emotions,even though members of the same family, mayvary appreciably with respect to their acousticexpression patterns.

Banse and Scherer (1996) have conducted anencoding study, using actor portrayal, which hasattempted to remedy both of these problems. Theyused empirically derived scenarios for 14 emotions,ten of which consisted of pairs of two members ofa same emotion family for five modal emotions,varying in arousal. In addition to the classical setof acoustic parameters, they measured a number

Table 1Synthetic compilation of the review of empirical data on acoustic patterning of basic emotions (based on Johnstone and Scherer, 2000)

Stress Anger/rage Fear/panic Sadness Joy/elation Boredom

IntensityF0 floor/meanF0 variabilityF0 range ( )Sentence contoursHigh frequency energy ( )Speech and articulation rate ( )

K.R. Scherer / Speech Communication 40 (2003) 227–256 233

Page 8: Sherer Voice Tune

of indicators of energy distribution in the spec-trum. Banse and Scherer interpret their data asclearly demonstrating that these states are acous-tically di!erentiated with respect to quality (e.g.,valence) as well as arousal. They underline theneed for future work that addresses the issue ofsubtle di!erences between the members of the sameemotion family and that uses an extensive set ofdi!erent acoustic parameters, pertinent to all as-pects of the voice and speech production process(respiration, phonation, and articulation).

One of the major shortcomings of the encodingstudies conducted to date has been the lack oftheoretical grounding, most studies being moti-vated exclusively by the empirical detection ofacoustic di!erences between emotions. As in mostother areas of scientific inquiry, an atheoreticalapproach has serious shortcomings, particularly innot being able to account for lack of replicationand in not allowing the identification of themechanism underlying the e!ects found. It is dif-ficult to blame past researchers for the lack oftheory, since emotion psychology and physiology,the two areas most directly concerned, have notbeen very helpful in providing tools for theorizing.For example, discrete emotion theory (Ekman,1972, 1992; Izard, 1971, 1977; see Scherer, 2000a),which has been the most influential source of emo-tion conceptualization in this research domain, hasspecified the existence of emotion-specific neuralmotor patterns as underlying the expression. How-ever, apart from this general statement, no detaileddescriptions of the underlying mechanisms orconcrete hypotheses have been forthcoming (ex-cept to specify patterns of facial expression––mostly based on observation; see (Wehrle et al.,2000)).

It is only with the advent of componentialmodels, linked to the development of apprai-sal theories, that there have been attempts at adetailed hypothetical description for the patternof motor expressions to be expected for specificemotions (see Scherer, 1984, 1992a; Roseman andSmith, 2001). For the vocal domain, I have pro-posed a comprehensive theoretical grounding ofemotion e!ects on vocal expression, as based onmy component process model of emotion, yieldinga large number of concrete predictions as to the

changes in major acoustic parameters to be ex-pected for 14 modal emotions (Scherer, 1986). Theresults from studies that have provided the firstexperimental tests of these predictions are de-scribed below.

In concluding this section, it is appropriate topoint out the dearth of comparative approachesthat examine the relative convergence betweenempirical evidence on the acoustic characteristicsof animal, particularly primate, vocalizations (see(Fichtel et al., 2001) for an impressive recent ex-ample) and human vocalizations. Such compari-sons would be all the more appropriate since thereis much evidence on the phylogenetic continuity ofa!ect vocalization, at least for mammals (see Hau-ser, 2000; Scherer, 1985), and since we can assumeanimal vocalizations to be direct expressions ofthe underlying a!ect, largely free of control orself-presentation constraints.

2.2. Decoding studies

Work in this area examines to what extent lis-teners are able to infer actual or portrayed speakeremotions from content-free speech samples. Re-search in this tradition has a long history and hasproduced a large body of empirical results. In mostof these studies actors have been asked to portraya number of di!erent emotions by producingspeech utterances with standardized or nonsensecontent. Groups of listeners are given the task torecognize the portrayed emotions. They are gen-erally required to indicate the perceived emotionon rating sheets with standard lists of emotionlabels, allowing to compute the percentage ofstimuli per emotion that were correctly recognized.A review of approximately 30 studies conductedup to the early 1980s yielded an average accuracypercentage of about 60%, which is about five timeshigher than what would be expected if listenershad guessed in a random fashion (Scherer, 1989).Later studies reported similar levels of averagerecognition accuracy across di!erent emotions,e.g., 65% in a study by van Bezooijen (1984) and56% in a study by Scherer et al. (1991).

One of the major drawbacks of all of thesestudies is that they tend to study discrimination(deciding between alternatives) rather than recog-

234 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 9: Sherer Voice Tune

nition (recognizing a particular category in its ownright). In consequence, it is essential to correct theaccuracy coe"cients for guessing by taking thenumber of given alternatives and the empiricalresponse distribution in the margins into account(Wagner, 1993). Most of the studies reviewedabove have corrected the raw accuracy percentagesto account for the e!ects of guessing. Unfortu-nately, there is no obvious way to compute a singlecorrection factor (see Banse and Scherer, 1996). Inthe area of facial expression research, Ekman(1994) has suggested to use di!erent comparisonlevels of chance accuracy for positive and negativeemotions, given that there are usually fewer posi-tive than negative emotion exemplars in the facialstimulus sets used. While this is also true for mostvocal stimulus sets, there may be a less clear-cutpositive–negative distinction for acoustic parame-ters as compared to facial muscle actions (seeBanse and Scherer, 1996; Scherer et al., 1991).

As in the case of facial expression, vocal ex-pressions of emotion by members of one cultureare, on the whole, recognized with better thanchance accuracy by members of other cultures (seeFrick, 1985; van Bezooijen et al., 1983; Scherer,1999b). Recently, Scherer et al. (2001a), reporteddata from a large scale cross-cultural study (usingGerman actor portrayals and listener-judges from9 countries in Europe, Asia, and the US) showingan overall accuracy percentage of 66% across allemotions and countries. These findings were in-terpreted as evidence for the existence of universalinference rules from vocal characteristics to spe-cific emotions across cultures particularly since thepatterns of confusion were very similar across allcountries, including Indonesia. Yet, the data alsosuggest the existence of culture- and/or language-specific paralinguistic patterns in vocal emotionexpression, since, in spite of using language-freespeech samples, the recognition accuracy decreasedwith increasing dissimilarity of the language spo-ken by the listeners from German (the native lan-guage of the actors).

One of the issues that remain to be explored ingreater detail is the di!erential ease with whichdi!erent emotions can be decoded from the voice.In this context, it is interesting to explore whetherthe frequencies of confusion are comparable for

vocal and facial emotion expressions. Table 1shows a comparison of accuracy figures for vocaland facial portrayal recognition obtained in a set ofstudies having examined a comparable range ofemotions in both modalities (the data in this table,adapted from (Scherer, 1999b), are based on re-views by Ekman (1994), for facial expression, andScherer et al. (2001a), for vocal expression). As thiscompilation shows there are some major di!erencesbetween emotions with respect to their recognitionaccuracy scores. In addition, there are major dif-ferences between facial and vocal expression. Whilejoy is almost perfectly recognized in the face, lis-teners seem to have trouble identifying this emotionunequivocally in the voice. In the voice, sadnessand anger are generally best recognized, followedby fear. Disgust portrayals in the voice are recog-nized barely above chance level. Johnstone andScherer (2000) have attempted to explain the originof these di!erences between facial and vocal ex-pression on the basis of evolutionary pressure to-wards accurate vocal communication for di!erentemotions. Preventing others from eating rottenfood by emitting disgust signals may be most usefulto conspecifics eating at the same place as the sig-naler. In this context, facial expressions of disgustare most adequate. In addition, they are linked tothe underlying functional action (or action ten-dency) of regurgitating and blocking unpleasantodors. According to Darwin, such adaptive actiontendencies (‘‘serviceable habits’’) are at the basis ofmost expressions (see Scherer, 1985, 1992a). Incontrast, in encountering predators, there is clearadaptive advantage in being able to warn friends (infear) or threaten foes (in anger) over large dis-tances, something for which vocal expression isideally suited. The di!erences in the recognizabilityof the vocal expression of di!erent emotions sug-gest that it is profitable to separately analyze therecognizability of these emotions.

As is evident in Table 2, the recognition accu-racy for vocal expressions is generally somewhatlower than that of facial expression. On the whole,in reviewing the evidence from the studies to date,one can conclude that the recognition of emotionfrom standardized voice samples, using actorportrayals, attains between 55% and 65% accu-racy, about five to six times higher than what

K.R. Scherer / Speech Communication 40 (2003) 227–256 235

Page 10: Sherer Voice Tune

would be expected by chance. In comparison, fa-cial expression studies generally report an averageaccuracy percentage around 75%. There are manyfactors that are potentially responsible for thisdi!erence, including, in particular, the inherentdynamic nature of vocal stimuli which is less likelyto produce stable patterns, contrary to facial ex-pression where basic muscle configurations seemto identify the major emotions (Ekman, 1972,1994), reinforced by the frequent use of staticstimulus material (photos). Another importantfactor could be that members of a similar emotionfamily (see above) are more distinct from eachother in vocal as compared to facial expression.For example, the important vocal di!erences be-tween the expression of exuberant joy and quiethappiness may explain some of the lack of repli-cation for results concerning the recognition of joythat are observed in the literature (since research-ers generally do not specify which kind of joy hasbeen used; see (Scherer, 1986)). In the abovementioned study by Banse and Scherer (1996) twomembers of the same emotion family for fivemodal emotions (i.e., 10 out of a total of 14 targetemotions) with variable levels of activation orarousal were used. Listeners were asked to identifythe expressed emotion for a total of 224 selectedtokens, yielding an average accuracy level of 48%(as compared to 7% expected by chance if all 14categories were weighted equally). If the agreementbetween families is computed for those emotionswhere there were two variants (yielding 10 cate-gories), the accuracy percentage increases to 55%(as compared to 10% expected by chance). Thisindicates that, as one might expect, most of theconfusions for these emotion categories occurredwithin families.

So far, the discussion has focussed on accuracy,overall or emotion-specific. However, the confu-sions listeners make are arguably even more in-teresting as errors are not randomly distributedand as the patterns of misidentification provideimportant information on the judgment process.Thus, confusion matrices should be reported asa matter of fact, which is generally not the casefor older studies and, unfortunately, also for manymore recent studies. The research by Banse andScherer (1996), reported above, can serve as anexample for the utility of analyzing the confu-sion patterns in recognition data in great detail.These authors found that each emotion has aspecific confusion pattern (except for disgust por-trayals which were generally confused with almostall other negative emotions). As mentioned above,there are many errors between members of thesame emotion family (for example, hot anger isconfused consistently only with cold anger andcontempt). Other frequent errors occur betweenemotions with similar valence (for example interestis confused more often with pride and happinessthan with the other eleven emotions taken to-gether). Johnstone and Scherer (2000) have sug-gested that confusion patterns can help to identifythe similarity or proximity between emotion cate-gories taking into account quality, intensity, andvalence. It is likely that emotions similar on one ormore of these dimensions are easier to confuse.

In addition, analysis of similarities or di!erencesin the confusion patterns between di!erent popu-lations of listeners can be profitably employed toinfer the nature of the underlying inference pro-cesses. For example, Banse and Scherer (1996)found that not only the average accuracy peremotion but also the patterns of confusion were

Table 2Accuracy (in %) of facial and vocal emotion recognition in studies in Western and Non-Western countries (reproduced from Scherer,2001)

Neutral Anger Fear Joy Sadness Disgust Surprise Mean

Facial/Western/20 78 77 95 79 80 88 78Vocal/Recent Western/11 74 77 61 57 71 31 62Facial/Non-Western/11 59 62 88 74 67 77 65Vocal/Non-Western/1 70 64 38 28 58 52

Note: Empty cells indicate that the respective emotions have not been studied in these regions. Numbers following the slash in column 1indicate the number of countries studied.

236 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 11: Sherer Voice Tune

highly similar between human judge groups andtwo statistical categorization algorithms (discrimi-nant analysis and bootstrapping). They interpretedthis as evidence that the judges are likely to haveused inference mechanisms that are comparable tooptimized statistical classification procedures. In adi!erent vein, as reported above, Scherer et al.(2001a,b) used the fact that confusion patternswere similar across di!erent cultures as an indica-tion for the existence of universal inference rules.This would not have been possible on the basis ofcomparable accuracy coe"cients since judgesmight have arrived at these converging judgmentsin di!erent ways. Clearly, then, it should be im-perative for further work in this area to routinelyreport confusion matrices.

In closing this section I will address an issuethat is often misunderstood––the role of stereo-types. It is often claimed that decoding studiesusing actor portrayals are fundamentally flawedsince actors supposedly base their portrayals oncultural stereotypes of what certain emotionsshould sound like. These stereotypes are expectedto match those that listeners use in decoding theexpression intention. Since the term ‘‘stereotype’’has a very negative connotation, implying that therespective beliefs or inference rules are false, it iscommonly believed that stereotypes are ‘‘bad’’.This is not necessarily true, though, and manysocial psychologists have pointed out that stereo-types often have a kernel of truth. This is one ofthe reasons why they are shared by many indi-viduals in the population. Stereotypes are oftengeneralized inference rules, based on past experi-ence or cultural learning, that allow the cognitiveeconomy of making inferences on the basis of verylittle information (Scherer, 1992b). Thus, unlessthe inference system is completely unrelated toreality, i.e., if there is no kernel of truth, neitherthe socially shared nature nor the tendency ofovergeneralization is problematical. On the con-trary, it is a natural part of the human informationprocessing system. It is intriguing to speculate onthe origin and development of such stereotypes.Are they based on self- or other-observation, orboth? Do they represent the average of many dif-ferent emotional expression incidents with manydi!erent intensities or are they representative of

extreme exemplars (in terms of intensity and pro-totypicality)? What is the role of media descrip-tions? One of the potential approaches to suchissues consists in the study of how the ability todecode a!ect from the voice develops in the child(see Morton and Trehub, 2001). Obviously, theanswers to these issues have important implica-tions on research in this area.

It is a di!erent issue, of course, whether actorportrayals that use such stereotypical inferencestructures provide useful information for decodingstudies. Clearly, if the portrayals are exclusivelybased on and mimick socially shared inferencerules to a very large extent, then the utility of thefinding that listeners can recognize these portrayalsis limited to increasing our understanding of thenature of these inference rules (and, in the case ofcross-cultural comparison, the degree of their uni-versality). However, it is by no means obvious thatall actor portrayals are based on shared inferencerules. If actors use auto-induction methods, e.g.,Stanislavski or method acting procedures, as theyare often directed to do by researchers in this area,or if they base their portrayals on auto-observationof their own emotional expressions, the distal cuesthey produce may be almost entirely independentof shared inference rules used by observers. In thiscase they can be treated as a valid criterion forspeaker state in the sense of the Brunswikian lensmodel. Since it is di"cult to determine exactly howactors have produced their portrayals, especiallysince they may well be unaware of the mechanismsthemselves, one needs to combine natural obser-vation, induction, and portrayal methods to de-termine the overlap of the expressive patternsfound (see below).

2.3. Inference studies

While the fundamental issue studied by decod-ing studies is the ability of listeners to recognizethe emotional state of the speaker, inference studiesare more concerned with the nature of the under-lying voice–emotion inference mechanism, i.e.,which distal cues produce certain types of emotioninferences in listeners. As one might expect, theprocedure of choice in such studies is to system-atically manipulate or vary the acoustic cues in

K.R. Scherer / Speech Communication 40 (2003) 227–256 237

Page 12: Sherer Voice Tune

sample utterances in such a way as to allow de-termination of the relative e!ect of particular pa-rameters on emotion judgment. Three majorparadigms have been used in this area: (1) cuemeasurement and regression, (2) cue masking, and(3) cue manipulation via synthesis.

(1) Cue measurement and statistical association.The simplest way to explore the utilization ofdistal cues in the inference process is to measurethe acoustic characteristics of vocal emotion por-trayals (acted or natural), and then to correlatethese with the listeners! judgments of the under-lying emotion or attitude of the speaker. Suchstudies generate information on which vocalcharacteristics are likely to have determined thejudges! inference (Scherer et al., 1972a,b; van Be-zooijen, 1984). Banse and Scherer (1996), in thestudy reported above, used multiple regressionanalysis to regress the acoustic parameters mea-sured on the listeners! judgments. The data showedthat a large proportion of the variance in judg-ments of emotion was explained by a set of about9–10 acoustic measures, including mean F0, stan-dard deviation of F0, mean energy, duration ofvoiced periods, proportion of energy up to 1000Hz, and spectral drop-o!. Table 3 (modified fromBanse and Scherer, 1996, Table 8, pp. 632) showsthe detailed results. The direction (i.e., positiveor negative) and strength of beta-coe"cients in-dicates the information value and the relativeimportance for each parameter. The multiple cor-relation coe"cient R in the last row indicates thepercentage of the variance in the listener judg-ments accounted for by the variables entered intothe regression model.

While such results are interesting and useful,this method is limited by the nature of the voicesamples that are used. Obviously, if certain pa-rameters do not vary su"ciently, e.g., becauseactors did not produce a su"ciently extreme ren-dering of a certain emotion, the statistical analysisis unlikely to yield much of interest. In addition,the results are largely dependent on the precisionof parameter extraction and transformation.

(2) Cue masking. In this approach verbal/vocalcues are masked, distorted, or removed from vocalemotion expressions (from real life or produced byactors) to explore the ensuing e!ects on emotion T

able

3Multiple

correlationcoe"

cients

andbeta-weigh

tsresultingfrom

regressingjudges!em

otionratings

onresidual

acoustical

param

eters

HAng

CAng

Pan

Anx

Des

Sad

Ela

Hap

Int

Bor

Sha

Pri

Dis

Con

MF0

0.18

3!)0.14

0.33

1!!!

0.28

7!!

0.39

!!!

0.11

40.17

8!)0.16

2)0.01

6)0.40

9!!!

0.15

3)0.35

7!!!

)0.17

7)0.35

8!!!

SdF0

0.15

6!!

0.04

4)0.12

)0.35

9!!!

)0.18

!!0.00

90.07

30.11

60.02

9)0.03

80.02

90.13

70.05

40.14

2!

MElog

0.08

50.19

30.22

3)0.42

1!!

)0.22

0)0.53

8!!!

0.27

9)0.13

0)0.07

50.40

8!!

)0.40

5!!

0.15

0.11

00.10

1DurV

o)0.09

3)0.01

2)0.01

9)0.12

90.22

9!!

0.11

5)0.06

1"0.14

3)0.18

40.32

8!!!

)0.07

9)0.04

)0.00

70.00

7Ham

mI

0.19

4!!

0.0

)0.09

0.05

70.16

7!0.11

8)0.10

3)0.04

5)0.01

4)0.13

50.1

)0.16

)0.09

9)0.05

3PE1000

)0.08

3)0.09

10.12

50.01

80.15

00.31

3!0.18

0!0.06

60.06

10.00

8)0.02

5)0.06

4)0.29

4!)0.14

9DO10

00)0.00

70.03

00.18

8)0.04

6)0.13

60.05

)0.01

)0.08

4)0.15

80.22

80.09

1)0.02

4)0.04

1)0.13

3

R0.63

!!!

0.16

0.39

!!!

0.38

!!!

0.44

!!!

0.49

!!!

0.39

!!!

0.27

!0.18

0.40

!!!

0.38

!!!

0.28

!0.17

0.36

!!!

MF0:

mean

fundam

entalfrequency,SdF0:

stan

dard

deviation,MElog:

mean

energy,DurV

o:duration

ofvo

iced

periods,

Ham

mI:

Ham

marbergindex,PE10

00:

proportionofvo

iced

energy

upto

1000

Hz,DO10

00:slopeofvo

iced

spectral

energy

above

1000

Hz;HAng:

hotan

ger,CAng:

cold

anger,Pan

:pan

icfear,Anx:

anxiety,

Des:despair,Sad

:sadness,Ela:elation,Hap

:hap

piness,Int:interest,Bor:boredom,Sha:

sham

e,Pri:pride,

Dis:disgu

st,Con:contempt.

! p60:05

.!!p60:01

.!!! p

60:00

1.

238 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 13: Sherer Voice Tune

inferences on the part of judges listening to themodified material. This technique has been usedearly on, particularly in the form of low-pass fil-tering, by many pioneers in this research domain(Starkweather, 1956; Alpert et al., 1963). Morerecent studies by Scherer et al. (1984, 1985), andFriend and Farrar (1994) have applied di!erentmasking techniques (e.g., filtering, randomizedsplicing, playing backwards, pitch inversion, andtone-silence coding), each of which removes dif-ferent types of acoustic cues from the speech signalor distorts certain cues while leaving others un-changed. For example, while low-pass filteringremoves intelligibility and changes the timbre ofthe voice, it does not a!ect the F0 contour. Incontrast, randomized splicing of the speech signal(see Scherer et al., 1985) removes all contour andsequence information from speech but retains thecomplete spectral information (i.e., the timbre ofthe voice).

The systematic combination of these techniquesallows one to isolate the acoustic cues that carryparticular types of emotional information. Intel-ligibility of verbal content is removed by all ofthese procedures, permitting the use of natural,‘‘real life’’ speech material without regard for thecue value of the spoken message. For example, inthe study by Scherer et al. (1984), highly natu-ral speech samples from interactions between civilservants and citizens were used. Using a variety ofmasking procedures allowed determining whichtype of cues carried what type of a!ective infor-mation (with respect to both emotions and speakerattitudes). Interestingly, judges were still able todetect politeness when the speech samples weremasked by all methods jointly. The data furthershowed that voice quality parameters and F0 levelcovaried with the strength of the judged emotion,conveying emotion independently of the verbalcontext. In contrast, di!erent intonation contourtypes interacted with sentence grammar in a con-figurational manner, and conveyed primarily atti-tudinal information.

(3) Cue manipulation via synthesis. The devel-opment of synthesis and copy synthesis methods inthe domain of speech technology has providedresearchers in the area of vocal expression ofemotion with a remarkably e!ective tool, allowing

to systematically manipulate vocal parameters todetermine the relative e!ect of these changes onlistener judgment. In a very early study, Lieber-man and Michaels (1962) produced variationsof F0 and envelope contour and measured thee!ects on emotion inference. Scherer and Oshinsky(1977) used a MOOG synthesizer and system-atically varied, in a complete factorial design,amplitude variation, pitch level, contour and vari-ation, tempo, envelope, harmonic richness, tonal-ity, and rhythm in sentence-like sound sequencesand musical melodies. Their data showed therelative strength of the e!ects of individual para-meters, as well as of particular parameter config-urations, on emotion attributions by listeners.More recent work (Abadjieva et al., 1995; Brei-tenstein et al., 2001; Burkhardt and Sendlmeier,2000; Cahn, 1990; Carlson, 1992; Carlson et al.,1992; Heuft et al., 1996; Murray and Arnott, 1993;Murray et al., 1996) shows great promise as alsodemonstrated in other papers in this issue. Notsurprisingly, much of this work is performed withthe aim of adding emotional expressivity to textto speech systems.

Copy synthesis (or resynthesis) techniques allowone to take neutral natural voices and systemati-cally change di!erent cues via digital manipulationof the sound waves. In a number of studies byScherer and his collaborators, this technique hasbeen used to systematically manipulate F0 level,contour variability and range, intensity, duration,and accent structure of real utterances (Ladd et al.,1985; Tolkmitt et al., 1988). Listener judgments ofapparent speaker attitude and emotional state foreach stimulus showed strong direct e!ects for allof the variables manipulated, with relatively fewe!ects due to interactions between the manipulatedvariables. Of all variables studied, F0 range e!ectedthe judgments the most, with narrow F0 rangesignaling sadness and wide F0 range being judgedas expressing high arousal, negative emotions suchas annoyance or anger. High speech intensity wasalso interpreted as a sign of negative emotions oraggression. Fast speech led to inferences of joy,with slow speech judged as a mark of sadness. Onlyveryminor e!ects for speaker and utterance contentwere found, indicating that the results are likely togeneralize over di!erent speakers and utterances.

K.R. Scherer / Speech Communication 40 (2003) 227–256 239

Page 14: Sherer Voice Tune

It is to be expected that the constant improve-ment of speech synthesis will provide the researchcommunity with ever-more refined and naturalsounding algorithms that should allow a largenumber of systematic variations of acoustic pa-rameters pertinent to emotional communication.One of the developments that are most eagerlyawaited by researchers in this domain is the pos-sibility to use di!erent glottal source character-istics for the manipulation of voice timbre informant synthesis. There have been some recentreports on work in that direction (Burkhardt andSendlmeier, 2000; Carlson, 1992; Lavner et al.,2000).

2.4. Transmission studies

The Brunswikian lens model acknowledges theneed to separately model the transmission of thedistal signals from the sender to the receiver/lis-tener where they are represented on a subjective,proximal level. Most studies in the area of thevocal communication of emotion have consideredthis voyage of the sound waves from the mouth ofthe speaker to the ear or brain of the listener as a"quantit"ee n"eegligeable! and have given little or noconsideration to it. However, a Brunswikian ap-proach, trying to systematically model all aspectsof the communication process forces the re-searcher to separately model this element since itmay be responsible for errors in inference. Thiswould be the case, for example, if the transmissionprocess systematically changed the nature of thedistal cues. In this case, the proximal cues couldnot be representative of the characteristics of thedistal cues at the source. A well-known example isthe filtering of the voice over standard telephonelines. Given the limited frequency range of thetransmission, much of the higher frequency bandof the voice signal is eliminated. The e!ect on thelistener, i.e., the systematic biasing of the proximalcues, is to hear the pitch of the voice as lower asit is––with the well known ensuing e!ect of over-estimating the age of the speaker at the otherend (often making it impossible to keep motherand daughter apart). Scherer et al. (in press) dis-cuss two of the contributing factors in detail: (1)the transmission of sound through space and elec-

tronic channels, and (2) the transform functions inperception, as determined by the nature of humanhearing mechanisms. Below, the issues will bebriefly summarized.

(1) The transmission of vocal sound throughphysical space is a!ected by many environmen-tal and atmospheric factors including the dis-tance between sender and receiver (weakening thesignal), the presence of other sounds such asbackground noise (disturbing the signal), or thepresence of natural barriers such as walls (filter-ing the signal). All of these factors will lessenthe likelihood that the proximal cues can be afaithful representation of the distal cues with re-spect to their acoustic characteristics. Impor-tantly, the knowledge about such constraints onthe part of the speaker can lead to a modifica-tion of the distal cues by way of an e!ort tomodify voice production to o!set such transmis-sion e!ects. For example, a speaker may shout toproduce more intense speech, requiring more vocale!ort, and greater articulatory precision, if com-municating over large distances. Greater vocale!ort in turn, will a!ect a large number of otheracoustic characteristics related to voice productionat the larynx. Furthermore, the distance betweenspeaker and listener may have an e!ect on posture,facial expression and gestures, all of which arelikely to have e!ects on the intensity and spectraldistribution of the acoustic signal (see Laver,1991).

Just as distance, physical environment, and at-mospheric conditions in the case of immediatevoice transmission, the nature of the medium mayhave potentially strong e!ects on proximal cues inthe case of mediated transmission by intercom,telephone, Internet, or other technical means. Theband restrictions of the line, as well as the qualityof the encoding and decoding components of thesystem can systematically a!ect many aspects ofthe distal voice cues by disturbing, masking, orfiltering and thus render the proximal cues lessrepresentative of the distal cues. While much ofthis work has been done in the field of engineering,only little has been directly applied to modelingthe process of the vocal communication of emo-tion (but see Baber and Noyes, 1996). It is to behoped that future research e!orts in the area will

240 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 15: Sherer Voice Tune

pay greater attention to this important aspect ofthe speech chain.

(2) Another factor that is well known but gen-erally neglected in research on vocal communica-tion is the transformation of the distal signalthrough the transfer characteristics of the humanhearing system. For example, the perceived loud-ness of voiced speech signals correlates morestrongly with the amplitude of a few harmonics oreven a single harmonic than with its overall in-tensity (Gramming and Sundberg, 1988; Titze,1992). In addition, listeners seem to use an internalmodel of the specific spectral distribution of har-monics and noise in loud and soft voices to inferthe vocal e!ort with which a voice was produced.Klasmeyer (1999, p. 112) has shown that thisjudgment is still reasonably correct when bothloud and soft voices are presented at the sameperceived loudness level (the soft voice havingmore than six times higher overall intensity thanthe loud voice). Similar e!ects can be shown forthe perception of F0 contours and the perceivedduration of a spoken utterance (see (Scherer et al.,in press) for further detail).

Given space limitations I could only very brieflytouch upon the issue of voice perception which iscentrally implicated in the distal–proximal rela-tionship. Recent research paints an increasinglymore complex picture of this process which ischaracterized by a multitude of feedforward andfeedback processes between modalities (e.g., vi-sual–vocal), levels of processing (e.g., peripheral–central), and input versus stored schemata. Thereis, for example, the well-documented phenomenonof the e!ects of facial expression perception onvoice perception (e.g., Borod et al., 2000; de Gel-der, 2000; de Gelder and Vroomen, 2000; Mass-aro, 2000). Things are further complicated by thefact that voice generally carries linguistic infor-mation with all that this entails with respect to theinfluence of phonetic-linguistic categories on per-ception. A particularly intriguing phenomenon isthe perception and interpretation of prosodic fea-tures which is strongly a!ected by centrally storedtemplates. There is currently an upsurge in neu-ropsychological research activity in this area (e.g.,Pell, 1998; Shipley-Brown et al., 1988; Steinhaueret al., 1999; see also Scherer et al., in press).

The fact that the role of voice sound transmis-sion and the transform functions specific to thehuman hearing system have been rarely studied inthis area of research, is not only regrettable fortheoretical reasons (preventing one from modelingthe communication process in its entirety) but alsobecause one may have missed parameters that arecentral in the di!erentiation of the vocal expres-sion of di!erent emotions. Thus, rather than bas-ing all analyses on objective measures of emotionalspeech, it might be fruitful to employ, in addition,variables that reflect the transformations that dis-tal cues undergo in the representation as proximalcues after passage through the hearing mechanism(e.g., perceived loudness, perceived pitch, per-ceived rhythm, or Bark bands in spectral analysis;see (Zwicker, 1982)).

2.5. Representation studies

The result of sound transmission and receptionby the listener!s hearing system is the storage, inshort term memory, of the proximal cues or per-cepts of the speaker!s vocal production. Modelingof the voice communication process with theBrunswikian lens model requires direct measure-ment of these proximal cues, i.e., the subjectiveimpressions of the vocal qualities. Unfortunately,this is one of the most di"cult aspects of the re-search guided by this model. So far, the onlypossibility to empirically measure these proximalcues is to rely on listeners! verbal report of theirimpressions of the speaker!s voice and speech. Asis always the case with verbal report, the descrip-tion is constrained by the semantic categories forwhich there are words in a specific language. Inaddition, since one wants to measure the impres-sions of na!ııve, untrained listeners, these wordsmust be currently known and used, in order toensure that all listeners studied understand thesame concept and the underlying acoustic patternby a particular word. In most languages, includingEnglish, there are only relatively few words de-scribing vocal qualities and only a subset of thoseis frequently used in normal speech or in literature.

Only very few e!orts have been made to de-velop instruments that can be used to obtainproximal voice percepts. One of the pioneers in

K.R. Scherer / Speech Communication 40 (2003) 227–256 241

Page 16: Sherer Voice Tune

this area has been John Laver (1980, 1991) whosuggested a vocal profile technique to describevoice quality on the basis of phonatory and ar-ticulatory settings and to obtain reliable judgmentson these settings. In some ways, of course, Laverattempts to get at the distal cues via the use ofperceived voice quality rather than obtaining anoperationalized measure of the subjective impres-sion (in other words, he wants his judges to agreewhereas a psychologist interested in proximal cueswould be happy to see di!erences in perceivedvoice quality). Scherer (1978, 1982a,b) designed arating sheet to obtain voice and speech ratingsreflecting such subjective impressions from na!ııvelisteners. This e!ort was later continued by an in-terdisciplinary group of researchers (Sangsue et al.,1997). One of the main problems is to obtainsu"cient reliability (as measured by interrateragreement) for many of the categories that are notvery frequently employed, such as nasal, sharp, orrough. In our laboratory, B!aanziger is currentlyinvolved in another attempt to develop a standardinstrument for proximal voice cue measurement inthe context of Brunswikian modeling (see B!aanzigerand Scherer, 2001).

Representation studies, of which there are cur-rently extremely few (see (Scherer, 1974) for anexample in the area of personality), study the in-ference structures, i.e., the mental algorithmsthat are used by listeners, to infer emotion on thebasis of the proximal cues. The inference algo-rithms must be based on some kind of represen-tation, based on cultural lore and/or individualexperience, as to which vocal impressions arelikely to be indicative of particular types of emo-tion.

3. Issues for future research

3.1. How should emotion be defined?

One of the major problems in this area is thelack of a consensual definition of emotion and ofqualitatively di!erent types of emotions. The prob-lem already starts with the distinction of emo-tions from other types of a!ective states of thespeaker, such as moods, interpersonal stances, at-

titudes or even a!ective personality traits. Scherer(2000a) has suggested using a design feature ap-proach to distinguish these di!erent types of af-fective states. Table 4 reproduces this proposal. Itis suggested to distinguish the di!erent states bythe relative intensity and duration, by the degree ofsynchronization of organismic subsystems, by theextent to which the reaction is focused on aneliciting event, by the degree to which appraisalhas played a role in the elicitation, the rapidity ofstate changes, and the extent to which the statea!ects open behavior. As the table shows, thedi!erent types of a!ective states do vary quitesubstantially on the basis of such a design featuregrid and it should be possible to distinguish themrelatively clearly, if such an e!ort were consistentlymade.

It should be noted that emotions constitutequite a special group among the di!erent types ofa!ective states. They are more intense, but ofshorter duration. In particular, they are charac-terized by a high degree of synchronization be-tween the di!erent subsystems. Furthermore, theyare likely to be highly focused on eliciting eventsand produced through cognitive appraisal. Interms of their e!ect, they strongly impact behaviorand are likely to change rapidly. One of the issuesthat is very important in the context of studyingthe role of voice and speech as a!ected by di!erenttypes of a!ective states is the question as to howoften particular types of a!ective states are likelyto occur and in which contexts. This is particularlyimportant for applied research that is embedded inspeech technology contexts. Obviously, it may beless useful to study the e!ects of very rare a!ectiveevents on voice and speech if one wants to developa speech technology instrument that can be cur-rently used in everyday life.

Another issue that must not be neglected is theexistence of powerful regulation mechanisms asdetermined by display rules and feeling rules(Gross and Levenson, 1997; see Scherer, 1994a,2000b for a detailed discussion). Such regulatione!orts are particularly salient in the case of veryintense emotions, such as hot anger, despair,strong fear, etc. It is unfortunate, that it is exactlythose kinds of emotions, likely to be highly regu-lated and masked under normal circumstances in

242 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 17: Sherer Voice Tune

everyday life, that are consistently studied by re-search in this area. While such research is of greatinterest in trying to understand the psychobiolog-ical basis of emotional expression in the voice, it isnot necessarily the best model to look at the e!ectof a!ective states on voice and speech in everydaylife with the potential relevance to speech tech-nology development. Rather, one would think thatthe kinds of a!ective modifications of voice andspeech that are likely to be much more pervasive ineveryday speaking, such as the e!ects of stress (seeBaber and Noyes, 1996; Tolkmitt and Scherer,1986), deception (see Anolli and Ciceri, 1997;Ekman et al., 1991), relatively mild mood changes,attitudes or interpersonal stances or styles vis-#aa-visa particular person (see Ambady et al., 1996;Granstr!oom, 1992; Ofuka et al., 2000; Scherer et al.,1984; Siegman, 1987a,b; Tusing and Dillard,2000), may be of much greater import to that kindof applied question.

3.2. Which theoretical model of emotion should beadopted?

This issue, which has received little attention sofar, has strong implications for the types of emo-tions to be studied, given that di!erent psycho-logical theories of emotion have proposed di!erentways of cutting up the emotion domain into dif-ferent qualitative states or dimensions. The issueof the most appropriate psychological model ofemotion is currently intensively debated in psy-chology. Scherer (2000a) has provided an outlineof the di!erent types of approaches and the rela-tive advantages and disadvantages. The choice ofmodel also determines the nature of the theoreticalpredictions for vocal patterning. While past re-search in this area has been essentially atheoretical,it would be a sign of scientific maturity if futurestudies were based on an explicit set of hypotheses,allowing more cumulative research e!orts.

Table 4Design feature delimitation of di!erent a!ective states

Type of a!ective state: brief definition (examples) Intensity Duration Syn-chroni-zation

Eventfocus

Appraisalelicita-tion

Rapid-ity ofchange

Behav-ioralimpact

Emotion: relatively brief episode of synchronizedresponse of all or most organismic subsystems inresponse to the evaluation of an external orinternal event as being of major significance(angry, sad, joyful, fearful, ashamed, proud,elated, desperate)

# # –# ## # # # # # # # # # # # # # # # #

Mood: di!use a!ect state, most pronounced aschange in subjective feeling, of low intensity butrelatively long duration, often without apparentcause (cheerful, gloomy, irritable, listless, de-pressed, buoyant)

#–# # ## # # # ## #

Interpersonal stances: a!ective stance taken to-ward another person in a specific interaction,colouring the interpersonal exchange in thatsituation (distant, cold, warm, supportive, con-temptuous)

#–# # #–# # # ## # # # # ##

Attitudes: relatively enduring, a!ectively col-oured beliefs, preferences, and predispositionstowards objects or persons (liking, loving, hating,valueing, desiring)

0–# # # # –# ## 0 0 # 0–# #

Personality traits: emotionally laden, stablepersonality dispositions and behavior tenden-cies, typical for a person (nervous, anxious,reckless, morose, hostile, envious, jealous)

0–# # # # 0 0 0 0 #

0: low, #: medium, ##: high, # # #: very high, –: indicates a range.

K.R. Scherer / Speech Communication 40 (2003) 227–256 243

Page 18: Sherer Voice Tune

The two theoretical traditions that have moststrongly determined past research in this area arediscrete and dimensional emotion theories. Dis-crete theories stem from Darwin and are defendedby Tomkins (1962), Ekman (1992) and Izard(1971). Theorists in this tradition propose the ex-istence of a small number, between 9 and 14, ofbasic or fundamental emotions that are charac-terized by very specific response patterns in phys-iology as well as in facial and vocal expression.Most studies in the vocal e!ects of emotion areahave followed this model, choosing to examine thee!ects of happiness, sadness, fear, anger and sur-prise. More recently, the term ‘‘the big six’’ hasgained some currency, implying the existence of afundamental set of six basic emotions (althoughthere does not seem to be agreement on which sixthese should be). Unfortunately, to date there hasbeen no attempt by discrete emotion theorists todevelop concrete predictions for vocal patterning.

Another model that has guided quite a bit ofresearch in this area is the dimensional approachto emotion, which goes back to Wundt (1874/1905). In this tradition di!erent emotional statesare mapped in a two or three-dimensional space.The two major dimensions consist of the valencedimension (pleasant–unpleasant, agreeable–dis-agreeable) and an activity dimension (active–pas-sive) (see Scherer, 2000a). If a third dimension isused, it often represents either power or control.Researchers using this model in the investigationof emotion e!ects on voice often restrict their ap-proach to studying positive versus negative andactive versus passive di!erences in the emotionalstates, both with respect to production and mea-surement of listener inferences. Recently, Bacho-rowski and her collaborators (Bachorowski, 1999;Bachorowski and Owren, 1995) have called at-tention to this approach. However, a more elabo-rate theoretical underpinning of this position andthe development of concrete predictions remainsto be developed.

There is some likelihood that a new set of psy-chological models, componential models of emo-tion, which are often based on appraisal theory(see Scherer et al. 2001b), is gaining more influencein this area. The advantage of such componentialmodels is that the attention is not restricted to

subjective feeling state, as is the case with dimen-sional theories, nor to a limited number of sup-posedly basic emotions, as is the case with discreteemotion theory. Componential models are muchmore open with respect to the kind of emotionalstates that are likely to occur, given that they donot expect emotions to consist of a limited numberof relatively fixed neuromotor patterns, as in dis-crete emotion theory. And rather than limiting thedescription to two basic dimensions, these modelsemphasize the variability of di!erent emotionalstates, as produced by di!erent types of appraisalpatterns. Furthermore, they permit to model thedistinctions between members of the same emotionfamily, an issue that, as described above, plays avery important role in this research area.

The most important advantage of componentialmodels, as shown, for the example, by Scherer!scomponent process model described above, is thatthese approaches provide a solid basis for theo-retical elaborations of the mechanisms that aresupposed to underlie the emotion–voice relation-ship and permit to generate very concrete hy-potheses that can be tested empirically. Table 5provides an example for the hypotheses on thee!ect of appraisal results on the vocal mechanismsand the ensuing changes in acoustic parameters.These predictions can be examined directly by sys-tematically manipulating the appraisal dimensionsin laboratory studies (see below). Furthermore,this set of predictions also generates predictionsfor complete patterns of modal emotions (see(Scherer, 1994b) for the di!erence to basic emo-tions) since there are detailed hypotheses as towhich profiles of appraisal outcomes on the dif-ferent dimensions produce certain modal emo-tions. For example, a modal emotion that is partof the fear-family is predicted to be produced bythe appraisal of an event or situation as obstruc-tive to one!s central needs and goals, requiringurgent action, being di"cult to control throughhuman agency, and lack of su"cient power orcoping potential to deal with the situation. Themajor di!erence to anger-producing appraisal isthat the latter entails a much higher evaluation ofcontrollability and available coping potential.

Based on these kinds of hypotheses, sharedamong appraisal theorists, the predictions in Table 5

244 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 19: Sherer Voice Tune

can be synthesized to predicted patterns of acousticchanges that can be expected to characteristicallyoccur during specific modal emotions. These pre-dicted patterns are shown in Table 6. This tablealso provides information on which predictionshave been supported (predicted change occurring

in the same direction) and which have been directlycontradicted (actual change going into the oppositedirection from prediction) in the study by Banseand Scherer (1996). Another test of these predic-tions has recently been performed by Juslin andLaukka (2001). These researchers again found

Table 5Component patterning theory predictions for voice changes (adapted from Scherer, 1986)

Novelty check

Novel Not novel

Interruption of phonation, sudden inhalation, ingressive(fricative) sound with glottal stop (noise-like spectrum)

No change

Intrinsic pleasantness check

Pleasant Unpleasant

Faucal and pharyngeal expansion, relaxation of tract walls,vocal tract shortened by mouth corners retracted upward(increase in low frequency energy, F1 falling, slightly broaderF1 bandwidth, velopharyngeal nasality, resonances raised)

Faucal and pharyngeal constriction, tensing of tract walls, vocaltract shortened by mouth corners retracted downward (more highfrequency energy, F1 rising, F2 and F3 falling, narrow F1bandwidth, laryngopharyngeal nasality, resonances raised)

‘‘Wide voice’’ ‘‘Narrow voice’’

Goal/need significance check

Relevant and consistent Relevant and discrepant

Overall relaxation of vocal apparatus, increased salivation (F0at lower end of range, low-to-moderate amplitude, balancedresonance with slight decrease in high-frequency energy)

Overall tensing of vocal apparatus, decreased salivation (F0 andamplitude increase, jitter and shimmer, increase in high frequencyenergy, narrow F1 bandwidth, pronounced formant frequencydi!erences)

‘‘Relaxed voice’’ ‘‘Tense voice’’If event conducive to goal: relaxed voice# wide voice If event conducive to goal: tense voice# wide voiceIf event obstructive to goal: relaxed voice# narrow voice If event obstructive to goal: tense voice# narrow voice

Coping potential check

Control No control

(As for relevant and discrepant) Hypotonus of vocal apparatus (low F0 and restricted F0 range,low amplitude, weak pulses, very low high-frequency energy,spectral noise, format frequencies tending toward neutral setting,broad F1 bandwidth)

‘‘Tense voice’’ ‘‘Lax voice’’

High power Low power

Deep, forceful respiration, chest register phonation (low F0,high amplitude, strong energy in entire frequency range)

Head register phonation (raised F0, widely spaced harmonics withrelatively low energy)

‘‘Full voice’’ ‘‘Thin voice’’

Norm/self-compatibility check

Standards surpassed Standards violated

Wide voice# full voice (# relaxed voice if expected or # tensevoice if unexpected)

Narrow voice# thin voice (# lax voice if no control or # tensevoice if control)

K.R. Scherer / Speech Communication 40 (2003) 227–256 245

Page 20: Sherer Voice Tune

Table 6Predictions for emotion e!ects on selected acoustic parameters (based on Table 4 and appraisal profiles; adapted from Scherer, 1986)

ENJ/HAP

ELA/JOY

DISP/DISG

CON/SCO

SAD/DEJ

GRI/DES

ANX/WOR

FEAR/TER

IRR/COA

RAG/HOA

BOR/IND

SHA/GUI

F0Perturbation <$ > > > > >Mean < U > U > <> <> U > U > >> U <> U <> < U >Range <$ > < > >> < >>Variability < > < > >> < >> UContour < > < > > >> < $ >Shift regularity $ < < < >

FormantsF1 Mean < < > > > > > > > > > >F2 Mean < < < < < < < < < <F1 Bandwidth > <> << < <> << < << << << < <Formant precision > > > < > > > > > >

IntensityMean < U > U > >> << U > U > U > U >> U <>Range <$ > < > > >Variability < > < > >

Spectral parametersFrequency range > > > >> > >> >> > > >High-frequency en-ergy

< <> U > > <> >> U > >> >> >> U <> >

Spectral noise >

DurationSpeech rate < > U < U > >> U > UTransition time > < > < < <

Note: ANX/WOR: anxiety/worry; BOR/IND: boredom/indi!erence; CON/SCO: contempt/scorn; DISP/DISG: displeasure/disgust; ELA/JOY: elation/joy; ENJ/HAP:enjoyment/happiness; FEAR/TER: fear/terror; GRI/DES: grief/desperation; IRR/COA: irritation/cold anger; RAGE/HOA: rage/hot anger; SAD/DEJ: sadness/dejec-tion; SHA/GUI: shame/guilt; F0: fundamental frequency; F1: first formant; F2: second formant; >: increase; <: decrease. Double symbols indicate increased predictedstrength of the change. Two symbols pointing in opposite directions refer to cases in which antecedent voice types exert opposing influence. (U) prediction supported,( ) prediction contradicted by results in (Banse and Scherer, 1996).

246K.R.Scherer

/Speech

Com

munication

40(2003)

227–256

Page 21: Sherer Voice Tune

support for many predictions as well as cases ofcontradiction indicating the need to revise thetheoretical predictions.

3.3. How should emotional speech samples beobtained?

Part of this question is obviously linked to thepsychological model of emotion that is beingadopted and the considerations that follow therefrom. Another issue concerns the aim of the re-spective study, for example the relative link toapplied questions for which a high degree of eco-logical validity is needed. However, the most im-portant issues concern the nature of the emotioninduction for the context in which the voice sam-ples are obtained. As mentioned above, most re-search has used the portrayal of vocal emotions bylay or professional actors. This approach has beenmuch criticized because of the potential artifactsthat might occur with respect to actors using ste-reotypical expression patterns. However, as dis-cussed above, the issue is much more complex. Itcan be argued, that actors using auto-inductioncan produce valid exemplars of vocal expression.With respect to regulation attempts such as displayrules, it has been argued that most normal speechcan be expected to be more or less ‘‘acted’’ in away that is quite similar to professional acting (seeabove).

On the whole, it would seem that many of theissues having to do with portrayal have been in-su"ciently explored. For example, there is goodevidence that the use of laymen or even lay actorsis potentially flawed since these speakers do notmaster the ability to voluntarily encode emotionsin such a way that they can be reliably recognized.It is generally required that portrayed emotionsbe well recognized by listeners, representing atleast what can be considered as a shared com-munication pattern for emotional information ina culture. Using professional actors masteringStanislavski techniques or method acting, as wellas providing appropriate instructions for the por-trayals, including realistic scenarios, can alleviatemany of the di"culties that may be associated withacting. Yet, obviously, one has to carefully inves-tigate to what extent such acted material corre-

sponds to naturally occurring emotional speech.Unfortunately, so far there has been no study inwhich a systematic attempt has been made tocompare portrayed and naturally occurring vocalemotions. This would seem to be one of the highestpriorities in the field. A first attempt is currentlymade by Scherer and his collaborators in thecontext of the ASV study mentioned above: Inaddition to the experimental induction of di!erentstates through a computerized induction battery,speakers are asked to portray a number of com-parable states and of discrete emotions, allowingto compare natural and portrayed a!ect variationsin the voice within and across speakers (Schereret al., 2000).

With the increasing availability of emotionalmaterial in the media, especially television, therehas been a tendency to rely on emotional speech tobe recorded o! the air. However, this procedurehas also many obvious drawbacks. As to the pre-sumed spontaneity of the expression, one cannot,unfortunately, exclude that there have been priorarrangements between the organizers of game orreality shows and the participants. One generalquestion that can never be answered satisfactorilyis to what extent the fact of being on the air willmake speakers act almost like an actor would in aparticular situation. This danger is particularlypronounced in cases in which normal individualsare on public display, as in television shows. Notonly is their attention focussed on how their ex-pressions will be evaluated by the viewers, en-couraging control of expression (display rules;Ekman, 1972), in addition there are strong ‘‘feelingrules’’ in the sense of what they think is requiredof them as an appropriate emotional response(Hochschild, 1983). Furthermore, it is not alwaysobvious which emotion the speakers have reallyexperienced in the situation; since in many casesspeakers cannot be interrogated on what theyreally felt in the situation (especially in cases inwhich material is taped o! the radio or television).Generally, researchers infer emotional quality onthe basis of the type of situation, assuming thateverybody would react in the same fashion to acomparable event. However, as appraisal theoristsof emotion have convincingly shown (see over-views in (Scherer, 1999a; Scherer et al., 2001b)) it is

K.R. Scherer / Speech Communication 40 (2003) 227–256 247

Page 22: Sherer Voice Tune

the subjective appraisal of a situation, which maybe highly di!erent between individuals, that de-termines the emotional reaction. Finally, in manycases, it is impossible to obtain a more systematicsampling of vocal production for di!erent types ofemotions for the same speaker and a comparableset for other speakers. This, however, is absolutelyessential, given the great importance of individualdi!erences in the vocal encoding of emotion (seeBanse and Scherer, 1996). In consequence, thehope of speech researchers to use emotional ex-pression displays recorded from television gameshows or other material of this sort as a corpus of‘‘natural expressions’’ may well reside on a some-what hollow foundation.

One of the methods to obtain emotional voicesamples that has been rather neglected in this re-search domain is the experimental induction ofemotion. There are several possibilities for chang-ing the state of speakers through experimentalmanipulations that have been shown to yield ra-ther consistent results. One of the most straight-forward possibilities is to increase the cognitiveload of the speaker while producing speech, gen-erally through some kind of rather e!ortful cog-nitive tasks. This procedure is often used to inducecognitive stress and has been shown to reliablyproduce speech changes (Bachorowski and Owren,1995; Scherer et al., 1998; Karlsson et al., 2000;Tolkmitt and Scherer, 1986). Another possibility isto use the presentation of agreeable and disagree-able media material, such slides, videotapes, etc.that can also be shown to produce relatively con-sistent emotional e!ects in many subjects (West-ermann et al., 1996). This procedure has also beenshown reliably to produce vocal changes in anumber of studies. It is often the method of choicein studies of facial expressions and one could as-sume, in consequence that it could provide a usefulmethod for vocal researchers well.

Psychologists have developed quite a numberof induction techniques of this type. One of themore frequently used techniques is the so-calledVelten induction technique (Westermann et al.,1996). In this method speakers read highly emo-tionally charged sentences over and over againwhich seems to produce rather consistent a!ectivechanges. Another possibility that is used rather

frequently in the domain of physiological psycho-logy is to produce some rather realistic emotionalinductions through the behavior of experimentersin the laboratory. For example, Stemmler and hiscollaborators (Stemmler et al., 2001) have angeredsubjects through rather arrogant and o!ensivebehavior of the experimenter. The size of the ex-perimental physiological reactions showed the de-gree to which that manipulation resulted in ratherintense emotions. Thomas Scherer (2000) has an-alyzed vocal production of the subjects under thiscondition, demonstrating rather sizable e!ectsfor several vocal parameters. Finally, a numberof researchers have experimented with computergames as a means of emotion induction in highlymotivated players (see Johnstone et al., 2001;Kappas, 1997). For example, in order to examinesome of the detailed predictions in Scherer!s (1986)component process model, Johnstone (2001) ma-nipulated di!erent aspects of an arcade game (e.g.,systematically varying a player!s power by varyingshooting capacity) and observed the direct e!ect ofemotion antecedent appraisal results on vocal pa-rameters. A number of significant main e!ects, andparticularly interaction e!ects for di!erent ap-praisal dimensions, illustrate the high promise ofthis approach.

Obviously, the experimental manipulation ofemotional state is the preferred method for ob-taining controlled voice and speech variations overmany di!erent speakers allowing one to clearlyexamine the relative importance of di!erent emo-tions as well as individual di!erences betweenspeakers. Unless this possibility of investigatingwithin-speaker di!erences for di!erent emotionsand between-speaker di!erences for set of emotionsis provided, our knowledge of the mechanisms thatallow emotions to mediate voice production and toa!ect acoustic parameters will remain very limitedindeed.

Another issue that is of great importance in thiscontext is the validation of the e!ect of a portrayalor manipulation. In many recent studies, it seems tobe taken for granted that if one requires thespeaker, either a lay or professional actor, to pro-duce a certain portrayal the e!ect is likely to besatisfactory. As can be shown in a large number ofstudies, one has to use rather sophisticated selection

248 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 23: Sherer Voice Tune

methods to select those portrayals that do actuallycommunicate the desired emotional state. In con-sequence, in this context it is not advisable to doacoustic parameter extraction for vocal portrayalsfor which it has not been clearly demonstrated thatjudges can reliably judge the emotion to be por-trayed, using appropriate statistical techniques (seeBanse and Scherer, 1996). Similarly, in studies thatrecord natural emotional speech from the media,more e!orts need to be made to insure that thepresumed emotional arousal is actually present. Asmentioned above, it has generally been taken forgranted that a certain situation will provoke certaintypes of emotions. Appraisal theorists have shown(Scherer, 1999a; Roseman and Smith, 2001) this isnot necessarily the case. Rather, there is a largelikelihood that di!erent individuals will react verydi!erently to a similar situation. To accept a re-searcher!s impression of what emotion underliesa particular expression as a validation is clearlymethodologically unsound. At the very least, onewould require a consensual judgment of severalindividuals to agree on what the likely meaning ofthe expression is. In most experimental studiesusing induction methods, the validation is auto-matically produced through the use of manipula-tion checks, probably the best method to insurethat a particular emotional state has been at thebasis of a particular type of voice production.

3.4. How to measure emotional inferences?

Most studies in this area have used the ratherconvenient procedure of presenting listeners withlists of emotion that correspond to the types ofstimuli that are used in the study, requiring themto judge which emotion best describes a particu-lar stimulus (categorical choice). In some studies,judges are given the option to indicate on therating sheet the relative intensity of several emo-tions (emotion blends). In the psychology ofemotion there has recently been quite a debateabout the adequacy of di!erent types of methodsto obtain judgments on expressive material. Inthe domain of the facial expression of emotion,Russell (1994) has strongly attacked the use offixed lists of alternatives, demonstrating poten-tial artifacts with this method. Critics have pointed

out that some of the demonstrations by Russellare clearly biased (Ekman, 1994). The method ofproviding fixed alternatives may not yield re-sults that are too di!erent from asking subjectsto use free report (Ekman, personal communica-tion). However, the debate has shown that onecannot take a particular method for granted andthat it is necessary to examine potential problemswith di!erent methods (see Frank and Stennett,2001).

In addition, there are serious problems of sta-tistical analysis, especially in the case where judgeshave to provide intensity ratings for several emo-tions. In many cases, there are large numbers ofzero responses, which changes the nature of thestatistical analyses to be used. Furthermore, thereis no established method for analyzing mixed orblended emotions, both with respect to the pro-duction and to the inference from expressive ma-terial. All these issues pose serious problems forresearch in this area of which so far remain largelyunacknowledged (see (Scherer, 1999a) for a reviewof these issues).

3.5. Which parameters should be measured?

As has been shown above, part of the problemin demonstrating the existence of qualitative dif-ferences between the vocal expression for di!erenttypes of discrete emotions and the hypothesisthat the voice might only reflect arousal, has beendue to a limitation with respect to the number ofacoustic parameters extracted. Recent researchusing a larger number of parameters (Banse andScherer, 1996; Klasmeyer, 2000) has shown thatspectral parameters play a major role in di!eren-tiating qualitative di!erences between emotions. Inconsequence, it is absolutely essential that a largevariety of parameters be measured in future re-search. One of the more important aims wouldseem to be to achieve some kind of standardizationwith respect to the types of parameters extracted.In order to insure accumulation of findings and ahigh degree of comparability of results, it is nec-essary to measure similar sets of parameters (in-cluding similar transformations and aggregations)in studies conducted by di!erent investigators inmultiple laboratories.

K.R. Scherer / Speech Communication 40 (2003) 227–256 249

Page 24: Sherer Voice Tune

3.6. New questions and research paradigms

In recent years there have been enormous tech-nological advances in speech technology, physio-logical recording, and brain imagery. The qualityof the methods and their widespread availabilitywill hopefully encourage future research attempt-ing to come to grips with the underlying mecha-nisms in voice production and voice perception.As mentioned above, there is currently a strongincrease in work on the neurophysiological ground-ing of the perception of vocal and prosodic fea-tures, especially as linked to emotional expression.Unfortunately, there has been less neuroscienceresearch on voice production in states of emo-tional arousal. There is little question that thestudy of the brain mechanisms underlying theproduction and perception of emotional speechwill be one of the major growth areas in thisfield. The same is true for speech technology. Whilethe engineers working in this area showed littleinterest in emotional speech until fairly recently,this situation has drastically changed (as demon-strated by the ISCA conference in Newcastleupon which this special issue is based; see editors!introduction). Unfortunately, there still is a ten-dency on their part to neglect the underlyingmechanisms in the interest of fast algorithmicsolutions.

In spite of the methodological advances andthe much increased research activity in this area,one major shortcoming––the lack of interactionbetween researchers from various disciplines,working on di!erent angles and levels of theproblem––remains unresolved. If one were able toovercome this isolation, allowing neuroscientists,for example, to use high-quality synthesis of emo-tional utterances, systematically varying theoreti-cally important features, one could rapidly attain amuch more sophisticated type of research designand extremely interesting data. However, beneficialinterdisciplinary e!ects are not limited to techno-logically advanced methodology. Neuroscientistsand speech scientists could benefit from advancesin the psychology of emotion, allowing them toask more subtle questions than that of di!erentacoustic patterning of a handful of so-called basicemotions.

4. Conclusion

In the past, studies on the facial expression ofemotion have far outnumbered work on vocalexpression. There are signs that this may change.Given the strong research activity in the field, thereis hope that there will soon be a critical mass ofresearchers that can form an active and vibrantresearch community. The work produced by sucha community will have an extraordinary impactsince it is, to a large extent, interdisciplinary work,given the origin of the researchers coming fromacoustics, engineering, linguistics, phonetics, andpsychology, and contributing concepts, theories,and methods from their respective disciplines.Given the ever-increasing sophistication of voicetechnology, both with respect to analysis and syn-thesis, in the service of this research, we are likelyto see some dramatic improvement in the scopeand quality of research on vocal correlates ofemotion and a!ective attitudes in voice and speech.However, in our enthusiasm to use the extraordi-nary power of voice technology, we need to becareful not to compromise the adequacy of theoverall research design and the sophistication ofemotion definition and measurement for the ele-gance of vocal measurement and synthesis. To thisend, and to arrive at a truly cumulative process ofbuilding up research evidence, we need a thoroughgrounding of present and future research in pastwork and its achievements. In line with the posi-tion taken throughout this paper, I strongly en-courage the continuous refinement of theoreticalmodels based on established literature in all of thedisciplines pertinent to the issue of emotion e!ectson the voice. It can serve as a strong antidote toone-shot, quickly designed studies that are morelikely than not to repeat the errors made in thepast.

References

Abadjieva, E., Murray, I.R., Arnott, J.L., 1995. Applyinganalysis of human emotional speech to enhance syntheticspeech. In: Proc. Eurospeech 95, pp. 909–912.

Allport, G.W., Cantril, H., 1934. Judging personality fromvoice. J. Soc. Psychol. 5, 37–55.

250 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 25: Sherer Voice Tune

Alpert, M., Kurtzberg, R.L., Friedho!, A.J., 1963. Transientvoice changes associated with emotional stimuli. Arch. Gen.Psychiatr. 8, 362–365.

Ambady, N., Koo, J., Lee, F., Rosenthal, R., 1996. More thanwords: Linguistic and nonlinguistic politeness in twocultures. J. Pers. Soc. Psychol. 70 (5), 996–1011.

Amir, N., Ron, S., 1998. Towards an automatic classification ofemotions in speech. In: Proc. ICSLP 98, Vol. 3, pp. 555–558.

Anolli, L., Ciceri, R., 1997. The voice of deception: Vocalstrategies of naive and able liars. J. Nonverb. Behav. 21 (4),259–284.

Baber, C., Noyes, J., 1996. Automatic speech recognition inadverse environments. Hum. Fact. 38 (1), 142–155.

Bachorowski, J.A., 1999. Vocal expression and perception ofemotion. Current Direct. Psychol. Sci. 8 (2), 53–57.

Bachorowski, J.A., Owren, M.J., 1995. Vocal expression ofemotion: Acoustic properties of speech are associated withemotional intensity and context. Psychol. Sci. 6 (4), 219–224.

Banse, R., Scherer, K.R., 1996. Acoustic profiles in vocalemotion expression. J. Pers. Soc. Psychol. 70 (3), 614–636.

B!aanziger, T., Scherer, K.R., 2001. Relations entre caract"eeris-tiques vocales perc!ues et "eemotions attribu"eees. In: Proc.Journ"eees Prosodie, Grenoble, 10–11 October 2001.

Bonner, M.R., 1943. Changes in the speech pattern underemotional tension. Amer. J. Psychol. 56, 262–273.

Borod, J.C., Pick, L.H., Hall, S., Sliwinski, M., Madigan,N., Obler, L.K., Welkowitz, J., Canino, E., Erhan,H.M., Goral, M., Morrison, C., Tabert, M., 2000. Rela-tionships among facial, prosodic, and lexical channels ofemotional perceptual processing. Cogn. Emot. 14 (2), 193–211.

Bortz, J., 1966. Physikalisch-akustische Korrelate der vokalenKommunikation. Arbeiten aus dem psychologischen Insti-tut der Universit!aat Hamburg, 9 [Reports of the PsychologyDepartment, University of Hamburg, Vol. 9].

Breitenstein, C., Van Lancker, D., Daum, I., 2001. Thecontribution of speech rate and pitch variation to theperception of vocal emotions in a German and an Americansample. Cogn. Emot. 15 (1), 57–79.

Brunswik, E., 1956. Perception and the Representative Designof Psychological Experiments. University of CaliforniaPress, Berkeley.

Bryman, A., Cramer, D., 1990. Quantitative Data Analysis forSocial Scientists. Taylor and Francis, Hove, Great Britain.

Burkhardt, F., Sendlmeier, W.F., 2000. Verification of acous-tical correlates of emotional speech using formant-synthesis.In: Proc. ISCA Workshop on Speech and Emotion, 5–7September 2000, Newcastle, Northern Ireland, pp. 151–156.

Ca", C., Janney, R.W., 1994. Toward a pragmatics of emotivecommunication. J. Pragmat. 22, 325–373.

Cahn, J., 1990. The generation of a!ect in synthesised speech.J. Amer. Voice I/O Soc. 8, 1–19.

Carlson, R., 1992. Synthesis: Modeling variability and con-straints. Speech Communication 11 (2–3), 159–166.

Carlson, R., Granstr!oom, B., Nord, L., 1992. Experiments withemotive speech––acted utterances and synthesized replicas.In: Proc. ICSLP 92, Vol. 1, pp. 671–674.

Coleman, R.F., Williams, R., 1979. Identification of emotionalstates using perceptual and acoustic analyses. In: Lawrence,V., Weinberg, B. (Eds.), Transcript of the 8th Symposium:Care of the Professional Voice, Part I. The Voice Founda-tion, New York.

Cosmides, L., 1983. Invariances in the acoustic expression ofemotion during speech. J. Exp. Psychol.: Hum. Percept.Perform. 9, 864–881.

Costanzo, F.S., Markel, N.N., Costanzo, P.R., 1969. Voicequality profile and perceived emotion. J. Couns. Psychol.16, 267–270.

Cowie, R., Douglas-Cowie, E., 1996. Automatic statisticalanalysis of the signal and prosodic signs of emotion inspeech. In: Proc. ICSLP!96, Philadelphia, pp. 1989–1992.

Darwin, C., 1872. The Expression of Emotions in Man andAnimals. John Murray, London (third ed., P. Ekman (Ed.),1998, Harper Collins, London).

Davitz, J.R., 1964. The Communication of Emotional Mean-ing. McGraw-Hill, New York.

de Gelder, B., 2000. Recognizing emotions by ear and by eye.In: Lane, R.D., Nadel, L. (Eds.), Cognitive Neuroscience ofEmotion. Oxford University Press, New York, pp. 84–105.

de Gelder, B., Vroomen, J., 2000. Bimodal emotion perception:Integration across separate modalities, cross-modal percep-tual grouping or perception of multimodal events? Cogn.Emot. 14, 321–324.

Duncan, G., Laver, J., Jack, M.A., 1983. A psycho-acousticinterpretation of variations in divers! voice fundamentalfrequency in a pressured helium–oxygen environment.Work in Progress, 16, 9–16. Department of Linguistics,University of Edinburgh, UK.

Ekman, P., 1972. Universals and cultural di!erences in facialexpression of emotion. In: Cole, J.R. (Ed.), NebraskaSymposium on Motivation. University of Nebraska Press,Lincoln, pp. 207–283.

Ekman, P., 1992. An argument for basic emotions. Cogn.Emot. 6 (3/4), 169–200.

Ekman, P., 1994. Strong evidence for universals in facialexpressions: A reply to Russell!s mistaken critique. Psychol.Bull. 115, 268–287.

Ekman, P., O!Sullivan, M., Friesen, W.V., Scherer, K.R., 1991.Face, voice, and body in detecting deception. J. Nonverb.Behav. 15, 125–135.

Eldred, S.H., Price, D.B., 1958. A linguistic evaluation offeeling states in psychotherapy. Psychiatry 21, 115–121.

Erickson, D., Fujimura, O., Pardo, B., 1998. Articulatorycorrelates of prosodic control: Emphasis and emotion.Lang. Speech 41, 399–417.

Fairbanks, G., Hoaglin, L.W., 1941. An experimental study ofthe durational characteristics of the voice during theexpression of emotion. Speech Monogr. 8, 85–90.

Fairbanks, G., Pronovost, W., 1939. An experimental study ofthe pitch characteristics of the voice during the expression ofemotion. Speech Monogr. 6, 87–104.

K.R. Scherer / Speech Communication 40 (2003) 227–256 251

Page 26: Sherer Voice Tune

Feldman, R.S., Rim"ee, B. (Eds.), 1991. Fundamentals ofNonverbal Behavior. Cambridge University Press, NewYork.

Fichtel, C., Hammerschmidt, K., J!uurgens, U., 2001. On thevocal expression of emotion. A multi-parametric analysis ofdi!erent states of aversion in the squirrel monkey. Behav-iour 138 (1), 97–116.

Fonagy, I., 1978. A new method of investigating the perceptionof prosodic features. Lang. Speech 21, 34–49.

Fonagy, I., 1983. La vive voix. Payot, Paris.Fonagy, I., Magdics, K., 1963. Emotional patterns in intona-

tion and music. Z. Phonetik 16, 293–326.Frank, M.G., Stennett, J., 2001. The forced-choice paradigm

and the perception of facial expressions of emotion. J. Pers.Soc. Psychol. 80 (1), 75–85.

Frick, R.W., 1985. Communicating emotion: The role ofprosodic features. Psychol. Bull. 97, 412–429.

Friend, M., Farrar, M.J., 1994. A comparison of content-masking procedures for obtaining judgments of discretea!ective states. J. Acoust. Soc. Amer. 96 (3), 1283–1290.

Frolov, M.V., Milovanova, G.B., Lazarev, N.V., Mekhedova,A.Y., 1999. Speech as an indicator of the mental statusof operators and depressed patients. Hum. Physiol. 25 (1),42–47.

Gi!ord, R., 1994. A lens-mapping framework for understand-ing the encoding and decoding of interpersonal dispositionsin nonverbal behavior. J. Pers. Soc. Psychol. 66, 398–412.

Gramming, P., Sundberg, J., 1988. Spectrum factors relevant tophonetogram measurement. J. Acoust. Soc. Amer. 83,2352–2360.

Granstr!oom, B., 1992. The use of speech synthesis in exploringdi!erent speaking styles. Speech Communication 11, 347–355.

Green, R.S., Cli!, N., 1975. Multidimensional comparison ofstructures of vocally and facially expressed emotion.Percept. Psychophys. 17, 429–438.

Gross, J.J., Levenson, R.W., 1997. Hiding feelings: The acutee!ects of inhibiting negative and positive emotion. J.Abnorm. Psychol. 106 (1), 95–103.

Hammond, K.R., Stewart, T.R. (Eds.), 2001. The EssentialBrunswik: Beginnings, Explications, Applications. OxfordUniversity Press, New York.

Hargreaves, W.A., Starkweather, J.A., Blacker, K.H., 1965.Voice quality in depression. J. Abnorm. Psychol. 70, 218–220.

Harper, R.G., Wiens, A.N., Matarazzo, J.D., 1978. Nonver-bal Communication: The State of the Art. Wiley, NewYork.

Hauser, M.D., 2000. The sound and the fury: Primate vocal-izations as reflections of emotion and thought. In: Wallin,N.L., Merker, B. (Eds.), The Origins of Music. The MITPress, Cambridge, MA, USA, pp. 77–102.

Havrdova, Z., Moravek, M., 1979. Changes of the voiceexpression during suggestively influenced states of experi-encing. Activitas Nervosa Superior 21, 33–35.

Helfrich, H., Standke, R., Scherer, K.R., 1984. Vocal indica-tors of psychoactive drug e!ects. Speech Communication 3,245–252.

Herzog, A., 1933. Stimme und Pers!oonlichkeit [Voice andpersonality]. Z. Psychol. 130, 300–379.

Heuft, B., Portele, T., Rauth, M., 1996. Emotions in timedomain synthesis. In: Proc. ICSLP 96, Vol. 3, Philadelphia,USA, pp. 1974–1977.

Hicks, J.W., 1979. An acoustical/temporal analysis of emo-tional stress in speech. Diss. Abstr. Intern. 41 (4-A).

Hochschild, A.R., 1983. The Managed Heart: The Commer-cialization of Human Feeling. University of CaliforniaPress, Berkeley.

H!oo!e, W.L., 1960. !UUber Beziehungen von Sprachmelodie undLautst!aarke [On the relationships between speech melodyand intensity]. Phonetica 5, 129–159.

Huttar, G.L., 1968. Relations between prosodic variables andemotions in normal American English utterances. J. SpeechHear. Res. 11, 481–487.

Iida, A., Campbell, N., Iga, S., Higuchi, F., Yasumura, M.,1998. Acoustic nature and perceptual testing of corporaof emotional speech. In: Proc. ICSLP 98, Sydney, Vol. 4,pp. 1559–1562.

Isserlin, M., 1925. Psychologisch-phonetische Untersuchun-gen. II. Mitteilung [Psychological-phonetic studies. Secondcommunication]. Z. Gesamte Neurol. Psychiatr. 94, 437–448.

Izard, C.E., 1971. The Face of Emotion. Appleton-Century-Crofts, New York.

Izard, C.E., 1977. Human Emotions. Plenum Press, New York.Johannes, B., Petrovitsh Salnitski, V., Gunga, H.C., Kirsch, K.,

2000. Voice stress monitoring in space – possibilities andlimits. Aviat. Space Environ. Md. 71 (9, Section 2, Suppl.),A58–A65.

Johnstone, T., 2001. The communication of a!ect throughmodulation of non-verbal vocal parameters. Ph.D. Thesis,University of Western Australia.

Johnstone, T., Scherer, K.R., 2000. Vocal communicationof emotion. In: Lewis, M., Haviland, J. (Eds.), Handbookof emotion, second ed.. Guilford, New York, pp. 220–235.

Johnstone, T., van Reekum, C.M., Scherer, K.R., 2001. Vocalcorrelates of appraisal processes. In: Scherer, K.R., Schorr,A., Johnstone, T. (Eds.), Appraisal Processes in Emotion:Theory, Methods, Research. Oxford University Press, NewYork and Oxford, pp. 271–284.

Juslin, P.N., 2000. Cue utilization in communication ofemotion in music performance: Relating performance toperception. J. Exp. Psychol.: Hum. Percep. Perform. 26,1797–1813.

Juslin, P.N., Laukka, P., 2001. Impact of intended emotionintensity on cue utilization and decoding accuracy in vocalexpression of emotion. Emotion 1 (4), 381–412.

Kaiser, L., 1962. Communication of a!ects by single vowels.Synthese 14, 300–319.

Kappas, A., 1997. His master!s voice: Acoustic analysis ofspontaneous vocalizations in an ongoing active coping task.

252 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 27: Sherer Voice Tune

In: Thirty-Seventh Annual Meeting of the Society forPsychophysiological Research, Cape Cod.

Kappas, A., Hess, U., Scherer, K.R., 1991. Voice and emotion.In: Rim"ee, B., Feldman, R.S. (Eds.), Fundamentals ofNonverbal Behavior. Cambridge University Press, Cam-bridge and New York, pp. 200–238.

Karlsson, I., B!aanziger, T., Dankovicova, J., Johnstone, T.,Lindberg, J., Melin, H., Nolan, F., Scherer, K.R., 1998.Speaker verification with elicited speaking styles in theVeriVox project. In: Proc. RLA2C, Avignon, pp. 207–210.

Karlsson, I., Banziger, T., Dankovicova, J., Johnstone, T.,Lindberg, J., Melin, H., Nolan, F., Scherer, K.R., 2000.Speaker verification with elicited speaking styles in theVeriVox project. Speech Communication 31 (2–3), 121–129.

Kennedy, G., 1972. The Art of Rhetoric in the Roman World.300 BC–AD 300. Princeton University Press, Princeton, NJ.

Kienast, M., Paeschke, A., Sendlmeier, W.F., 1999. Articula-tory reduction in emotional speech. In: Proc. Eurospeech99, Budapest, Vol. 1, pp. 117–120.

Klasmeyer, G., 1999. Akustische Korrelate des stimmlichemotionalen Ausdrucks in der Lautsprache. In: Wodarz,H.-W., Heike, G., Janota, P., Mangold, M. (Eds.), ForumPhoneticum, Vol. 67. Hector, Frankfurt am Main.

Klasmeyer, G., 2000. An automatic description tool for time-contours and long-term average voice features in largeemotional speech databases. In: Proc. ISCA WorkshopSpeech and Emotion, Newcastle, Northern Ireland, pp.66–71.

Klasmeyer, G., Meier, T., 1999. Rhythm in Emotional Speech,DAGA-Tag der ASA-Tagung 99 in Berlin, CD-Rom zurASA-EAA-DEGA Gemeinschaftstagung Berlin 99.

Klasmeyer, G., Sendlmeier, W.F., 1997. The classification ofdi!erent phonation types in emotional and neutral speech.Forensic Linguist. 4, 104–124.

Klasmeyer, G., Sendlmeier, W.F., 1999. Voice and emotionalstates. In: Kent, R., Ball, M. (Eds.), Voice QualityMeasurement. Singular Publishing Group, San Diego, CA,pp. 339–359.

Knapp, M.L., 1972. Nonverbal Communication in HumanInteraction. Holt Rinehart and Winston, New York.

Kotlyar, G.M., Morozov, V.P., 1976. Acoustical correlates ofthe emotional content of vocalized speech. J. Acoust. Acad.Sci. USSR 22, 208–211.

Kuroda, I., Fujiwara, O., Okamura, N., Utusuki, N., 1979.Method for determinating pilot stress through analysis ofvoice communication. Aviat. Space Environ. Md. 47, 528–533.

Ladd, D.R., Silverman, K.E.A., Tolkmitt, F., Bergmann, G.,Scherer, K.R., 1985. Evidence for the independent functionof intonation contour type, voice quality, and F0 range insignaling speaker a!ect. J. Acoust. Soc. Amer. 78, 435–444.

Laver, J., 1980. The Phonetic Description of Voice Quality.Cambridge University Press, Cambridge.

Laver, J., 1991. The Gift of Speech. Edinburgh UniversityPress, Edinburgh.

Lavner, Y., Gath, I., Rosenhouse, J., 2000. The e!ects ofacoustic modifications on the identification of familiarvoices speaking isolated vowels. Speech Communication30 (1), 9–26.

Levin, H., Lord, W., 1975. Speech pitch frequency as anemotional state indicator. IEEE Trans. Systems, ManCybernet. 5, 259–273.

Lieberman, P., Michaels, S.B., 1962. Some aspects of funda-mental frequency and envelope amplitude as related to theemotional content of speech. J. Acoust. Soc. Amer. 34, 922–927.

Mahl, G.F., Schulze, G., 1964. Psychological research in theextralinguistic area. In: Sebeok, T.A., Hayes, A.S., Bateson,M.C. (Eds.), Approaches to Semiotics. Mouton, TheHague, pp. 51–124.

Markel, N.N., Bein, M.F., Phillis, J.A., 1973. The relationbetween words and tone of voice. Lang. Speech 16, 15–21.

Massaro, D.W., 2000. Multimodal emotion perception: Anal-ogous to speech processes. In: Proc. ISCA Workshop onSpeech and Emotion, 5–7 September 2000, Newcastle,Northern Ireland, pp. 114–121.

Morris, J.S., Scott, S.K., Dolan, R.J., 1999. Saying it withfeeling: Neural responses to emotional vocalizations. Neu-ropsychologia 37 (10), 1155–1163.

Moses, P., 1954. The Voice of Neurosis. Grune and Stratton,New York.

Mozziconacci, S.J.L., 1998. Speech variability and emotion:Production and perception. Ph.D. Thesis, Technical Uni-versity Eindhoven, Eindhoven.

Morton, J.B., Trehub, S.E., 2001. Children!s understanding ofemotion in speech. Child Develop. 72 (3), 834–843.

Murray, I.R., Arnott, J.L., 1993. Toward a simulation ofemotion in synthetic speech: A review of the literature onhuman vocal emotion. J. Acoust. Soc. Amer. 93, 1097–1108.

Murray, I.R., Arnott, J.L., Rohwer, E.A., 1996. Emotionalstress in synthetic speech: Progress and future directions.Speech Communication 20 (1–2), 85–91.

Niwa, S., 1971. Changes of voice characteristics in urgentsituations. Reports of the Aeromedical Laboratory, JapanAir Self-Defense Force, 11, pp. 246–251.

Ofuka, E., McKeown, J.D., Waterman, M.G., Roach, P.J.,2000. Prosodic cues for rated politeness in Japanese speech.Speech Communication 32 (3), 199–217.

Ostwald, P.F., 1964. Acoustic manifestations of emotionaldisturbances. In: Disorders of Communication, Vol. 42.Research Publications, pp. 450–465.

Paeschke, A., Kienast, M., Sendlmeier, W.F., 1999. F0-contours in emotional speech. In: Proc. ICPhS 99, SanFrancisco, Vol. 2, pp. 929–933.

Pear, T.H., 1931. Voice and Personality. Chapman and Hall,London.

Pell, M.D., 1998. Recognition of prosody following unilateralbrain lesion: Influence of functional and structural attri-butes of prosodic contours. Neuropsychologia 36 (8), 701–715.

K.R. Scherer / Speech Communication 40 (2003) 227–256 253

Page 28: Sherer Voice Tune

Pereira, C., Watson, C., 1998. Some acoustic characteristics ofemotion. In: Proc. ICSLP 98, Sydney, Vol. 3, pp. 927–934.

Picard, R.W., 1997. A!ective Computing. MIT Press, Cam-bridge, MA.

Pittenger, R.E., Hockett, C.F., Danehy, J.J., 1960. The FirstFive Minutes: A Sample of Microscopic Interview Analysis.Martineau, Ithaca, NY.

Plaikner, D., 1970. Die Ver!aanderung der menschlichen Stimmeunter dem Einfluss psychischer Belastung. [Voice changesproduced by psychic load]. Ph.D. Thesis, University ofInnsbruck, Austria.

Rank, E., Pirker, H., 1998. Generating emotional speech with aconcatenative synthesizer. In: Proc. ICSLP 98, Sydney, Vol.3, pp. 671–674.

Roessler, R., Lester, J.W., 1976. Voice predicts a!ect duringpsychotherapy. J. Nerv. Mental Disease, 163, 166–176.

Roessler, R., Lester, J.W., 1979. Vocal pattern in anxiety. In:Fann, W.E., Pokorny, A.D., Koracau, I., Williams, R.L.(Eds.), Phenomenology and Treatment of Anxiety. Spec-trum, New York.

Roseman, I., Smith, C., 2001. Appraisal theory: Overview,assumptions, varieties, controversies. In: Scherer, K.R.,Schorr, A., Johnstone, T. (Eds.), Appraisal Processes inEmotion: Theory, Methods, Research. Oxford UniversityPress, New York and Oxford, pp. 3–19.

Russell, J.A., 1994. Is there universal recognition of emotionfrom facial expression? A review of the cross-culturalstudies. Psychol. Bull. 115, 102–141.

Sangsue, J., Siegwart, H., Grosjean, M., Cosnier, J., Cornut, J.,Scherer, K.R., 1997. Developpement d!un questionnaired!"eevaluation subjective de la qualit"ee de la voix et de laparole, QEV. Geneva Stud. Emot. Comm. 11 (1), Availablefrom: <http://www.unige.ch/fapse/emotion/genstudies/gen-studies.html>.

Scherer, K.R., 1974. Voice quality analysis of American andGerman speakers. J. Psycholing. Res. 3, 281–290.

Scherer, K.R., 1978. Personality inference from voice quality:The loud voice of extroversion. Europ. J. Soc. Psychol. 8,467–487.

Scherer, K.R., 1979. Non-linguistic indicators of emotion andpsychopathology. In: Izard, C.E. (Ed.), Emotions in Per-sonality and Psychopathology. Plenum Press, New York,pp. 495–529.

Scherer, K.R., 1982a. Methods of research on vocal commu-nication: Paradigms and parameters. In: Scherer, K.R.,Ekman, P. (Eds.), Handbook of Methods in NonverbalBehavior Research. Cambridge University Press, Cam-bridge, UK, pp. 136–198.

Scherer, K.R. (Ed.), 1982b. Vokale Kommunikation: Nonver-bale Aspekte des Sprachverhaltens. [Vocal Communication:Nonverbal Aspects of Speech]. Beltz, Weinheim.

Scherer, K.R., 1984. On the nature and function of emotion: Acomponent process approach. In: Scherer, K.R., Ekman, P.(Eds.), Approaches to Emotion. Erlbaum, Hillsdale, NJ, pp.293–318.

Scherer, K.R., 1985. Vocal a!ect signalling: A comparativeapproach. In: Rosenblatt, J., Beer, C., Busnel, M., Slater,

P.J.B. (Eds.), Advances in the Study of Behavior. AcademicPress, New York, pp. 189–244.

Scherer, K.R., 1986. Vocal a!ect expression: A review and amodel for future research. Psychol. Bull. 99 (2), 143–165.

Scherer, K.R., 1989. Vocal correlates of emotion. In: Wagner,H., Manstead, A. (Eds.), Handbook of Psychophysiology:Emotion and Social Behavior. Wiley, London, pp. 165–197.

Scherer, K.R., 1992a. What does facial expression express? In:Strongman, K. (Ed.), International Review of Studies onEmotion, Vol. 2. Wiley, Chichester, pp. 139–165.

Scherer, K.R., 1992b. On social representations of emotionalexperience: Stereotypes, prototypes, or archetypes. In: vonCranach, M., Doise, W., Mugny, G. (Eds.), Social Repre-sentations and the Social Bases of Knowledge. Huber, Bern,pp. 30–36.

Scherer, K.R., 1994a. A!ect bursts. In: van Goozen, S.H.M.,van de Poll, N.E., Sergeant, J.A. (Eds.), Emotions: Essayson Emotion Theory. Erlbaum, Hillsdale, NJ, pp. 161–196.

Scherer, K.R., 1994b. Toward a concept of ‘‘modal emotions’’.In: Ekman, P., Davidson, R.J. (Eds.), The Nature ofEmotion: Fundamental Questions. Oxford University Press,New York/Oxford, pp. 25–31.

Scherer, K.R., 1999a. Appraisal theories. In: Dalgleish, T.,Power, M. (Eds.), Handbook of Cognition and Emotion.Wiley, Chichester, pp. 637–663.

Scherer, K.R., 1999b. Universality of emotional expression. In:Levinson, D., Ponzetti, J., Jorgenson, P. (Eds.), Encyclope-dia of Human Emotions. Macmillan, New York, pp. 669–674.

Scherer, K.R., 2000a. Psychological models of emotion. In:Borod, J. (Ed.), The Neuropsychology of Emotion. OxfordUniversity Press, Oxford/New York, pp. 137–162.

Scherer, K.R., 2000b. Emotional expression: A royal road forthe study of behavior control. In: Perrig, W., Grob, A.(Eds.), Control of Human Behavior, Mental Processes, andConsciousness. Lawrence Erlbaum Associates, Hillsdale,NJ, pp. 227–244.

Scherer, K.R., Oshinsky, J.S., 1977. Cue utilization in emotionattribution from auditory stimuli. Motiv. Emot. 1, 331–346.

Scherer, K.R., Koivumaki, J., Rosenthal, R., 1972a. Minimalcues in the vocal communication of a!ect: Judging emotionsfrom content-masked speech. J. Psycholinguist. Res. 1, 269–285.

Scherer, K.R., Rosenthal, R., Koivumaki, J., 1972b. Mediatinginterpersonal expectancies via vocal cues: Di!erent speechintensity as a means of social influence. Europ. J. Soc.Psychol. 2, 163–176.

Scherer, K.R., London, H., Wolf, J., 1973. The voice ofconfidence: Paralinguistic cues and audience evaluation. J.Res. Person. 7, 31–44.

Scherer, K.R., Ladd, D.R., Silverman, K.E.A., 1984. Vocalcues to speaker a!ect: Testing two models. J. Acoust. Soc.Amer. 76, 1346–1356.

Scherer, K.R., Feldstein, S., Bond, R.N., Rosenthal, R., 1985.Vocal cues to deception: A comparative channel approach.J. Psycholinguist. Res. 14, 409–425.

254 K.R. Scherer / Speech Communication 40 (2003) 227–256

Page 29: Sherer Voice Tune

Scherer, K.R., Banse, R., Wallbott, H.G., Goldbeck, T., 1991.Vocal cues in emotion encoding and decoding. Motiv.Emot. 15, 123–148.

Scherer, K.R., Johnstone, T., Sangsue, J., 1998. L!"eetat "eemo-tionnel du locuteur: facteur n"eeglig"ee mais non n"eegligeablepour la technologie de la parole. Actes des XXII#eemesJourn"eees d!Etudes sur la Parole, Martigny.

Scherer, K.R., Johnstone, T., Klasmeyer, G., B!aanziger, T.,2000. Can automatic speaker verification be improved bytraining the algorithms on emotional speech? In: Proc.ICSLP2000, Beijing, China.

Scherer, K.R., Banse, R., Wallbott, H.G., 2001a. Emotioninferences from vocal expression correlate across lan-guages and cultures. J. Cross-Cult. Psychol. 32 (1), 76–92.

Scherer, K.R., Schorr, A., Johnstone, T. (Eds.), 2001b.Appraisal Processes in Emotion: Theory, Methods, Re-search. Oxford University Press, New York and Oxford.

Scherer, K.R., Johnstone, T., Klasmeyer, G., in press. Vocalexpression of emotion. In: Davidson, R.J., Goldsmith, H.,Scherer, K.R. (Eds.), Handbook of the a!ective sciences,Oxford University Press, New York and Oxford.

Scherer, T., 2000. Stimme, Emotion und Psyche: Untersuchun-gen zur emotionalen Qualit!aat der Stimme [Voice, emotion,and psyche: Studies on the emotional quality of voice].Ph.D. Thesis, University of Marburg, Germany.

Scripture, E.W., 1921. A study of emotions by speechtranscription. Vox 31, 179–183.

Sedlacek, K., Sychra, A., 1963. Die Melodie als Faktor desemotionellen Ausdrucks [Speech melody as a factor ofemotional expression]. Folia Phonatrica 15, 89–98.

Shipley-Brown, F., Dingwall, W., Berlin, C., Yeni-Komshian,G., Gordon-Salant, S., 1988. Hemispheric processing ofa!ective and linguistic intonation contours in normalsubjects. Brain Lang. 33, 16–26.

Siegman, A.W., 1987a. The telltale voice: Nonverbal messagesof verbal communication. In: Siegman, A.W., Feldstein, S.(Eds.), Nonverbal Behavior and Communication, seconded.. Erlbaum, Hillsdale, NJ, pp. 351–433.

Siegman, A.W., 1987b. The pacing of speech in depression. In:Maser, J.D. (Ed.), Depression and Expressive Behavior.Erlbaum, Hillsdale, NJ, pp. 83–102.

Simonov, P.V., Frolov, M.V., 1973. Utilization of human voicefor estimation of man!s emotional stress and state ofattention. Aerospace Md. 44, 256–258.

Skinner, E.R., 1935. A calibrated recording and analysis of thepitch, force and quality of vocal tones expressing happinessand sadness. Speech Monogr. 2, 81–137.

Sobin, C., Alpert, M., 1999. Emotion in speech: The acousticattributes of fear, anger, sadness, and joy. J. Psycholinguist.Res. 28 (4), 347–365.

Starkweather, J.A., 1956. Content-free speech as a source ofinformation about the speaker. J. Pers. Soc. Psychol. 35,345–350.

Steinhauer, K., Alter, K., Friederici, A., 1999. Brain potentialsindicate immediate use of prosodic cues in natural speechprocessing. Nature Neurosci. 2, 191–196.

Stemmler, G., Heldmann, M., Pauls, C.A., Scherer, T.,2001. Constraints for emotion specificity in fear andanger: The context counts. Psychophysiology 38 (2), 275–291.

Sulc, J., 1977. To the problem of emotional changes inthe human voice. Activitas Nervosa Superior 19, 215–216.

Tischer, B., 1993. Die vokale Kommunikation von Gef!uuhlen[The Vocal Communication of Emotions]. Psychologie-Verlags-Union, Weinheim.

Titze, I., 1992. Acoustic interpretation of the voice range profile(Phonetogram). J. Speech Hear. Res. 35, 21–34.

Tolkmitt, F.J., Scherer, K.R., 1986. E!ects of experimentallyinduced stress on vocal parameters. J. Exp. Psychol.: Hum.Percept. Perform. 12, 302–313.

Tolkmitt, F., Bergmann, G., Goldbeck, Th., Scherer, K.R.,1988. Experimental studies on vocal communication. In:Scherer, K.R. (Ed.), Facets of Emotion: Recent Research.Erlbaum, Hillsdale, NJ, pp. 119–138.

Tomkins, S.S., 1962. A!ect, Imagery, Consciousness. ThePositive A!ects, Vol. 1. Springer, New York.

Trager, G.L., 1958. Paralanguage: A first approximation. Stud.Linguist. 13, 1–12.

Tusing, K.J., Dillard, J.P., 2000. The sounds of dominance:Vocal precursors of perceived dominance during interper-sonal influence. Hum. Comm. Res. 26 (1), 148–171.

Utsuki, N., Okamura, N., 1976. Relationship between emo-tional state and fundamental frequency of speech. Reportsof the Aeromedical Laboratory, Japan Air Self-DefenseForce, 16, 179–188.

van Bezooijen, R., 1984. The Characteristics and Recognizabil-ity of Vocal Expression of Emotions. Foris, Dordrecht, TheNetherlands.

van Bezooijen, R., Otto, S., Heenan, T.A., 1983. Recognition ofvocal expressions of emotions: A three-nation study toidentify universal characteristics. J. Cross-Cult. Psychol. 14,387–406.

Wagner, H.L., 1993. On measuring performance in categoryjudgment studies on nonverbal behavior. J. Nonverb.Behav. 17 (1), 3–28.

Wallbott, H.G., Scherer, K.R., 1986. Cues and channelsin emotion recognition. J. Pers. Soc. Psychol. 51, 690–699.

Wehrle, T., Kaiser, S., Schmidt, S., Scherer, K.R, 2000.Studying dynamic models of facial expression of emotionusing synthetic animated faces. J. Pers. Soc. Psychol. 78 (1),105–119.

Westermann, R., Spies, K., Stahl, G., Hesse, F.W., 1996.Relative e!ectiveness and validity of mood inductionprocedures: A meta-analysis. Europ. J. Soc. Psychol. 26(4), 557–580.

Whiteside, S.P., 1999. Acoustic characteristics of vocal emo-tions simulated by actors. Perc. Mot. Skills 89 (3, Part 2),1195–1208.

Williams, C.E., Stevens, K.N., 1969. On determinating theemotional state of pilots during flight: An exploratorystudy. Aerospace Md. 40, 1369–1372.

K.R. Scherer / Speech Communication 40 (2003) 227–256 255

Page 30: Sherer Voice Tune

Williams, C.E., Stevens, K.N., 1972. Emotions and speech: Someacoustical correlates. J. Acoust. Soc. Amer. 52, 1238–1250.

Wundt, W., 1874/1905. Grundz!uuge der physiologischen Psy-chologie, [Fundamentals of physiological psychology, orig.pub. 1874] fifth ed. Engelmann, Leipzig.

Zuberbier, E., 1957. Zur Schreib- und Sprechmotorik derDepressiven. [On the motor aspect of speaking and writing

in depressives]. Z. Psychother. Medizin. Psychol. 7, 239–249.

Zwirner, E., 1930. Beitrag zur Sprache des Depressiven.[Contribution on the speech of depressives]. PhonometrieIII. Spezielle Anwendungen I. Karger, Basel, pp. 171–187.

Zwicker, E., 1982. Psychoacoustics. Springer, New York.

256 K.R. Scherer / Speech Communication 40 (2003) 227–256