Contextually dependent cue realization and cue weighting for alaryngeal contrast in Shanghai Wua)
Jie Zhangb)
Department of Linguistics, University of Kansas, 1541 Lilac Lane, Lawrence, Kansas 66045, USA
Hanbo YanSchool of Chinese Studies and Exchange, Shanghai International Studies University, Shanghai 200083, China
(Received 2 November 2017; revised 3 August 2018; accepted 16 August 2018; published online11 September 2018)
Phonological categories are often differentiated by multiple phonetic cues. This paper reports a pro-
duction and perception study of a laryngeal contrast in Shanghai Wu that is not only cued in multi-
ple dimensions, but also cued differently on different manners (stops, fricatives, sonorants) and in
different positions (non-sandhi, sandhi). Acoustic results showed that, although this contrast has
been described as phonatory in earlier literature, its primary cue is in tone in the non-sandhi con-
text, with vowel phonation and consonant properties appearing selectively for specific manners of
articulation. In the sandhi context where the tonal distinction is neutralized, these other cues may
remain depending on the manner of articulation. Sonorants, in both contexts, embody the weakest
cues. The perception results were largely consistent with the aggregate acoustic results, indicating
that speakers adjust the perceptual weights of individual cues for a contrast according to manner
and context. These findings support the position that phonological contrasts are formed by the integra-
tion of multiple cues in a language-specific, context-specific fashion and should be represented as such.VC 2018 Acoustical Society of America. https://doi.org/10.1121/1.5054014
[MS] Pages: 1293–1308
I. INTRODUCTION
A standard assumption about phonological contrast is
that it is categorical, based on either segments (/p/ vs /b/) or
features ([�voice] for /p/, [þvoice] for /b/; Jakobson et al.,1952; Chomsky and Halle, 1968; Stevens, 2002; Clements,
2009). A major challenge for phoneticians and phonologists
alike is to account for how speakers categorize gradient and
variable acoustic signals into such discrete entities. Two
salient aspects of this challenge relate to how featural con-
trasts are instantiated acoustically. First, contrasts are often
differentiated by multiple acoustic cues. The stop voicing
contrast in English, for example, is associated with differ-
ences in voice-onset time (VOT), closure duration, f0 of the
following vowel, and a host of other acoustic properties
(Lisker, 1986). Second, the acoustic cues for the same con-
trast often depend on the phonological context in which the
contrast appears. For instance, the English voicing contrast
would not benefit from the f0 cue of the following vowel in
the final position, but would benefit from a duration differ-
ence on the vowel preceding it (Chen, 1970; Raphael, 1972).
The investigations of how a contrast is acoustically realized
in a multidimensional fashion, how the different acoustic
cues are weighted in the perception of the contrast, and how
the weighting is affected by the acoustic dimensions along
which the cues vary, the distributional characteristics of the
acoustic cues, the context in which the contrast appears, and
the listeners’ language background have contributed to sig-
nificant theoretical issues in phonetics and phonology, such
as the mode of speech perception (Repp, 1983; Parker et al.,
1986; Massaro, 1987), the nature of distinctive features
(Halle and Stevens, 1971; Kingston, 1992; Stevens and
Keyser, 2010), the production-perception link (Newman,
2003; Shultz et al., 2012; DiCanio, 2014), the influence of
phonological knowledge of a language on perception
(Massaro and Cohen, 1983; Flege and Wang, 1989; Dupoux
et al., 1999; Hall�e and Best, 2007), the theories of perceptual
contribution of secondary cues (Holt et al., 2001; Francis
et al., 2008; Kingston et al., 2008; Llanos et al., 2013), and
the mechanisms of phonetic category learning (Clayards
et al., 2008; Toscano and McMurray, 2010; McMurray
et al., 2011).
This paper contributes to this scholarship by presenting
a case study on the cue realization and cue weighting of a
laryngeal contrast on different segments in different contexts
in Shanghai Wu. Like many Wu dialects of Chinese,
Shanghai has a three-way distinction among voiceless aspi-
rated, voiceless unaspirated, and voiced stops. The voiced
series, however, is not realized with typical closure voicing,
but is known as “voiceless with voiced aspiration” (Chao,
1967), indicating the involvement of breathy phonation. On
fricatives, there is a two-way voicing contrast, whereby the
voiced fricatives are truly voiced, and on sonorants, there is
a modal-murmured distinction that corresponds to the
a)Portions of this work were presented at the 18th International Congress of
Phonetic Sciences, Glasgow, Scotland, UK; the 89th annual meeting of the
Linguistic Society of America, Portland, OR; and the 22nd annual meeting
of the International Association of Chinese Linguistics in conjunction with
the 26th North American Conference on Chinese Linguistics, College
Park, MD.b)Electronic mail: [email protected]
J. Acoust. Soc. Am. 144 (3), September 2018 VC 2018 Acoustical Society of America 12930001-4966/2018/144(3)/1293/16/$30.00
voiceless-voiced distinction in obstruents (Chao, 1967; Xu
and Tang, 1988; Zhu, 1999, 2006).
Shanghai Wu, like other Chinese dialects, is also tonal.
There are three phonetic tones on open or sonorant-closed
syllables, transcribed as 53, 34, and 13, and two phonetic
tones on ?-closed syllables, 55 and 12. But there is a co-
occurrence restriction between tones and onset laryngeal fea-
tures in that the higher tones 53, 34, and 55 only occur on
syllables with voiceless obstruent or modal sonorant onsets,
and the lower tones only occur with phonologically voiced
obstruent or murmured sonorant onsets (Xu and Tang, 1988;
Zhu, 1999, 2006). Therefore, in Shanghai, there is a minimal
contrast between tO3 “to arrive” and dO13 “news,” and this
contrast is cued by both the voice quality of the initial conso-
nant and f0. The examples in Table I illustrate the co-
occurrence of the two rising tones 34 and 13 with the laryn-
geal features in Shanghai.
Tones in connected speech are affected by a tone change
process called tone sandhi in Shanghai. Polysyllabic com-
pound words undergo a rightward spreading tone sandhi pro-
cess by extending the tone on the first syllable over the
entire compound domain and consequently wiping out the
tonal contrasts in non-initial syllables (Zee and Maddieson,
1980; Xu and Tang, 1988; Zhu, 1999, 2006). For example,
tO34 “to arrive” and dO13 “news,” when appearing as the
second syllable of a disyllabic compound, are reported to
lose their tonal difference, as shown in the following exam-
ples: /pO34-tO34/ ! [pO33-tO44] “check-in”; /pO34-dO13/! [pO33-dO44] “news report.” The voicing difference between
the onset consonants on the second syllable, however,
remains, and the voiced stops have been reported to have clo-
sure voicing in this position (Cao and Maddieson, 1992; Ren,
1992; Shen and Wang, 1995; Chen, 2011; Wang, 2011; Gao,
2015; Gao and Hall�e, 2017).
The data pattern in Shanghai, therefore, presents a clear
example in which a phonological contrast is realized differ-
ently on different manners and different positions: stops, fri-
catives, and sonorants can all carry the contrast, but via
different sets of cues; the monosyllabic context is significant
in that it is the only context in which the phonation-tone co-
occurrence, as illustrated in Table I is fully manifested, while
the second syllable of disyllables constitutes a position
where the cues for the contrast are considerably altered by a
tone sandhi process. We specifically focus on the contrast
between voiceless unaspirated/modal and voiced/murmured
consonants co-occurring with a high-rising and a low-rising
tone, respectively (e.g., tO34 vs dO13; me34 vs€me13). As we
review in Sec. I B below, although previous studies have
established the multidimensional nature of this contrast, as
well as the fact that the cues for the contrast vary by prosodic
position, no study has expressly compared the realization of
cues in different manners or studied how the cues are
weighted in perception across manners and positions. This
study aims to achieve these goals. In so doing, it has the
potential to make the following unique contributions. First,
previous studies on the perceptual contributions of voicing
and f0 of a contrast have primarily been conducted on non-
tone languages like English and Spanish, and in these lan-
guages, voicing has been found to be the primary cue
(Abramson and Lisker, 1985; Shultz et al., 2012; Llanos
et al., 2013). Shanghai, being from a tone-language family,
could work in the opposite way with tone as a primary cue
and voicing/voice quality a secondary cue, similar to
Southern Vietnamese (Brunelle, 2009) and Eastern Cham
(Brunelle, 2012). This provides an opportunity to observe
the influence of language background on how cues are
weighted and the limit and potential reasons for the primacy
of a particular cue (see also Francis et al., 2008; Llanos
et al., 2013). Second, the positional dependency of the reali-
zation of this contrast results from not only the position perse, but also a phonological alternation process that, at least
according to the descriptive literature, categorically neutral-
izes one of the cues (tone) in the non-initial context. This
puts the context scenario here, phonologically, between full
realization (e.g., voicing in final position in English) and full
neutralization (e.g., manner contrast in final position in
Korean; Kim and Jongman, 1996) and allows it to contribute
to the large literature on incomplete neutralization (e.g.,
Dinnsen and Charles-Luce, 1984; Port and Crawford, 1989;
Warner et al., 2004; Dmitrieva et al., 2010). Third, phonetic
studies of phonation have primarily focused on vowels (e.g.,
Huffman, 1987; Andruski and Ratliff, 2000; Blankenship,
2002; Wayland and Jongman, 2003; Esposito, 2010a, 2012;
Khan, 2012) and obstruent consonants (e.g., Davis, 1994;
Mikuteit and Reetz, 2007; Dutta, 2009; Berkson, 2016a);
studies on sonorant consonant phonation (e.g., Aoki, 1970;
Traill and Jackson, 1988; Berkson, 2016b) are relatively
rare, presumably due to their typological rarity and the weak
acoustic cues they embody (Berkson, 2016b). Shanghai fur-
nishes an example that has a laryngeal contrast in both
obstruents and sonorants, and thus provides a rare venue to
compare the acoustics and perception of the contrast on the
two types of segments.
A. Acoustic correlates of breathiness
During the production of breathy phonation, the vocal
folds are in a relatively abducted configuration with low lon-
gitudinal tension. Articulatorily, this results in a higher open
quotient of the glottal cycle and a less abrupt glottal closing
gesture; aerodynamically, the increased airflow volume and
the loose vibratory mode of the vocal fold cause turbulence
noise at the glottis, which gives the auditory perception of
breathy voice (Gordon and Ladefoged, 2001).
A host of acoustic parameters that result from these
articulatory and aerodynamic properties have been identified
TABLE I. Examples of laryngeal and tone co-occurrence restrictions in
Shanghai. Voiceless obstruents or modal sonorants co-occur with the high-
rising tone 34; voiced obstruents or murmured sonorants co-occur with the
low-rising tone 13.
Stops Fricatives Sonorants
pu34 “cloth” fi34 “fee” me34 “beautiful”
phu34 “tattered”
bu13 “division” vi13 “fat” m€e13 “plum”
1294 J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan
in the literature. In terms of spectral measures, Klatt and
Klatt (1990) and Holmberg et al. (1995) showed that a
higher open quotient correlates with a greater difference
between the amplitude of the first two harmonics (H1-H2),
and Stevens (1977) and Hanson et al. (2001) demonstrated
that the more gradual glottal closure results in a steeper spec-
tral tilt that can be measured by the amplitude differences
between f0 and F1-F3 (H1-A1, H1-A2, H1-A3). In terms of
periodicity measures, Hillenbrand et al. (1994) advocated
the use of cepstral-peak prominence (CPP), a measure of
peak harmonic amplitude adjusted for the overall amplitude,
of which breathy phonation is expected to have lower values
than modal phonation; the harmonics-to-noise ratio (HNR)
has also been used, with breathy phonation having lower
HNR values (de Krom, 1993). In studies of phonological
breathiness crosslinguistically, these measures have often
been shown to be relevant acoustic and perceptual correlates.
For instance, increased H1-H2 and spectral tilt measures
have been found to be acoustic correlates of breathy vowels
in Hmong (Huffman, 1987; Andruski and Ratliff, 2000;
Esposito, 2012; Garellek et al., 2013), Khmer (Wayland and
Jongman, 2003), Juj’hoansi (Miller, 2007), Hindi (Dutta,
2009), Gujarati (Khan, 2012), Jalapa Mazatec (Blankenship,
2002; Esposito, 2010b; Garellek and Keating, 2011), and
Santa Ana del Valle Zapotec (Esposito, 2010a). Esposito
(2010b) and Garellek et al. (2013), in addition, found that
these measures directly contribute to the perception of
breathiness. Lower CPP values have been found for breathy
vowels in Jalapa Mazatec (Blankenship, 2002; Garellek and
Keating, 2011), White Hmong (Esposito, 2012), and
Gujarati (Khan, 2012). Lower HNR values were found for
breathy vowels in Juj’hoansi (Miller, 2007), but not in
Khmer (Wayland and Jongman, 2003).
Duration measures have also been found to correlate
with breathiness. For stops, breathy stops have shorter clo-
sure durations than their plain counterparts in Bengali
(Mikuteit and Reetz, 2007), Hindi (Dutta, 2009), and
Marathi (Berkson, 2016a), and the shorter closure duration
of voiced stops compared to voiceless stops is well known
(e.g., Lisker, 1986).1 For fricatives, Jongman et al. (2000)
showed that voiced fricatives generally have shorter frication
duration than their voiceless counterparts. The duration pat-
tern for sonorant phonation is scantily documented, but there
is some evidence that breathy sonorants tend to be longer
than their modal counterparts, as reported for Marathi
(Berkson, 2013).
Finally, the phonological co-occurrence between
breathy phonation and lower tones found in Shanghai is
attested elsewhere as well, e.g., in Santa Ana del Valle
Zapotec (Esposito, 2010a) and Hmong (Andruski and
Ratliff, 2000; Esposito, 2012). This may be rooted in the
general f0 lowering effect of breathiness (Laver, 1980;
Gordon and Ladefoged, 2001), which has been well attested,
e.g., in Khmu’ (Abramson et al., 2007), Hindi (Dutta, 2009),
and Marathi (Berkson, 2013). But whether this effect is a
phonetic universal remains controversial, as there are studies
that have shown either an f0 raising effect (Wayland and
Jongman, 2003, for Khmer) or the lack of an f0 correlate
(Garellek and Keating, 2011, for Jalapa Mazatec) for
breathiness.
B. Previous research on the phonation–toneinteraction in Shanghai Wu
As previously stated, existing literature on phonation–
tone interaction in Shanghai has firmly established that the
cues for the laryngeal contrast of interest here are multidi-
mensional in both non-sandhi and sandhi positions. Cao and
Maddieson (1992) showed that for syllables in isolation, i.e.,
the non-sandhi context, H1-H2 and H1-A12 were signifi-
cantly higher at vowel onset after the voiced stop than after
the voiceless unaspirated stop, but the differences disap-
peared at the mid and end points of the vowel; for syllables
in the sandhi context (e.g., second syllable in disyllables),
only the H1-H2 difference remained at vowel onset, and the
magnitude of the difference was smaller; but the voiced
stops were “phonetically voiced.” The acoustic study by Ren
(1992) also showed tapering H1-H2 and H1-A1 differences
on the vowel after voiced and voiceless unaspirated stops in
the non-sandhi position; but in the sandhi position, Ren
found an H1-A1 difference instead of an H1-H2 difference.
Ren (1992) also conducted a perception study in which H1-
H2 was varied in ten steps and f0 in three steps on the initial
portion of the vowel after a stop in the sandhi position
(¨a13–ta34 “shoelace” to ¨a13–da13 “shoe (is) big”). Results
showed that both H1-H2 and f0 had an effect on the percep-
tion of the second syllable: the /d/ response was more likely
with a higher H1-H2; a raised f0 shifted response toward /t/,
while a lowered f0 shifted the response toward /d/. Shen and
Wang (1995) focused on the roles of the closure and release
durations of the stop as the acoustic correlates of stop voic-
ing. They showed that, although the two types of stops did
not differ in their release duration (duration between the stop
burst and the beginning of vowel periodicity), the voiceless
stops had a significantly longer closure duration than the
voiced stops in both initial and medial positions, and the
voiced stops had closure voicing medially. The acoustic
study by Wang (2011) returned similar duration results to
Shen and Wang’s except that she did not find a closure dura-
tion difference based on voicing in the initial position. In a
series of perception studies that manipulated closure dura-
tion and f0, Wang showed that when restricting the tones to
the two rising tones 34 (co-occurs with voiceless) and 13
(co-occurs with voiced), f0 was the primary perceptual cue
for the contrast in initial position; in the medial position,
both f0 and closure duration were used perceptually for the
contrast, but closure voicing was not. Chen (2011) focused
on the f0 perturbation effect from the stop voicing contrast in
the sandhi context and found that the effect was minimal,
and that its size was partly determined by the underlying
tone of the preceding syllable. Chen argued that these pat-
terns potentially serve the purpose of maximizing the tonal
contrast on the preceding syllable, which determines the
pitch contour of the entire sandhi domain; therefore, the f0perturbation here is speaker controlled, at least in part. For
H1-H2, Chen only found the expected difference in the /o/
J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan 1295
context, with the voiced stops inducing greater H1-H2; for
the /i/ context, the effect was the reverse.
In a dissertation (Gao, 2015) and a series of related pub-
lications (Gao and Hall�e, 2013, 2015, 2016, 2017), Gao and
Hall�e presented the most comprehensive study of Shanghai
phonation–tone interaction to date. Their acoustic investiga-
tion included all three manners (stops, fricatives, nasals) as
onsets in monosyllables as well as both syllables of disyl-
lables. In terms of duration, a consonant-vowel (CV) syllable
with a voiceless fricative onset had a longer consonant and a
shorter vowel than one with a corresponding voiced fricative
onset (Gao, 2015); voiced stops had a significantly longer
VOT than voiceless unaspirated stops by around 2–4 ms
(Gao, 2015; Gao and Hall�e, 2017). In terms of voicing in the
initial position, voiced stops rarely had voicing, while voiced
fricatives had voicing ratios (percentages of consonant dura-
tion being voiced) of around 30%–40%; in medial position,
voiced stops and fricatives had over 90% voicing ratios,
compared to around 20%–30% for voiceless ones (Gao,
2015; Gao and Hall�e, 2017). For spectral and periodicity
measures, they showed that for monosyllables, H1-H2, H1-
A1, and H1-A2 were generally higher, while CPP was gener-
ally lower following voiced/murmured onsets than voiceless/
modal ones, but the differences were the greatest and the
most consistent for elder male speakers; linear discriminant
analyses (LDAs) showed that H1-H2 was the most consistent
cue across age and gender groups and in different tonal con-
texts. Only H1-H2 results were reported for the two syllables
in disyllables. Results showed that for the first syllable, H1-
H2 was higher after voiced/murmured onsets, but the differ-
ence was less clear-cut than in monosyllables; for the second
syllable, no H1-H2 difference based on the voicing differ-
ence was found (Gao, 2015; Gao and Hall�e, 2017).
Perceptually, two experiments were conducted to investigate
the effect of duration and voicing patterns on the identifica-
tion of the laryngeal contrast. The first experiment created
“congruent” and “incongruent” monosyllabic stimuli by
imposing the f0 of one CV onto another when the two onsets
differed in voicing, and the results showed that the congru-
ence factor significantly affected the accuracy and reaction
time of tone identification when the onsets were labial frica-
tives, which had the largest voicing difference. The second
experiment created tonal continua between the two rising
tones on both the long C-short V and short C-long V dura-
tion patterns and showed that the duration pattern shifted the
listeners’ identification response toward the category with
that duration pattern, and the incongruence between tone and
duration pattern slowed down the reaction time (Gao and
Hall�e, 2013; Gao, 2015). An additional experiment was car-
ried out to investigate the effect of voice quality on percep-
tion. Tonal continua were again created between the two
rising tones and imposed onto modal and breathy syllables
(both synthesized and naturally produced modal and breathy
syllables were used). Identification results showed that the
voice quality of the syllable shifted the listeners’ identifica-
tion response toward the category with that phonation type,
and the incongruence between tone and phonation slowed
down the reaction time with the exception of naturally
produced tokens with nasal onsets (Gao and Hall�e, 2015;
Gao, 2015).
With the exception of the work of Gao and Hall�e, the
previous studies only investigated a subset of the cues for
stops. But even in the studies by Gao and Hall�e, there was
no direct comparison among the different manners, and their
perception studies were restricted to monosyllables. In the
present work, the goal is to provide a comprehensive look at
the acoustic realization and perception of the contrast
between voiceless unaspirated/modal and voiced/murmured
consonants co-occurring with a high-rising and a low-rising
tone, respectively, across different manners (e.g., tO34 vs
dO13; fi34 vs vi13; me34 vs€me13) and different contexts (san-
dhi, non-sandhi) using a consistent set of methods, and con-
sequently shed light on the language- and context-dependent
nature of contrast realization and perceptual cue weighting,
especially when a phonological alternation process is
involved, as well as the production-perception link. In Secs.
II and III, a production study and a perception study con-
ducted to this end are reported.
II. EXPERIMENT 1: PRODUCTION STUDY
A. Methods
Thirteen monosyllabic voiceless/modal vs voiced/mur-
mured minimal pairs were used for the non-sandhi context
(six stop pairs, four fricative pairs, three sonorant pairs); all
voiceless/modal syllables occurred with the high rising tone
34 and all voiced/murmured syllables with the low rising tone
13 (e.g., pu34 and bu13). The same pairs were then used as the
second syllable of disyllabic compounds with matched first
syllable for the sandhi context (e.g., f@n53-pu34 and f@n53-bu13). Both the monosyllabic and disyllabic words were
embedded in the carrier sentence ˛u34 ˆja34 __ g@?12 @?55
zØ13 “I write the character/word ___.” The reason the target
stimuli were put in sentence-medial position was to allow the
measurement of closure duration for onset stops, as duration
has been more consistently shown as a perceptual cue for the
contrast in previous studies (Wang, 2011; Gao and Hall�e,
2013; Gao, 2015). The trade-off, however, is that this creates
an environment that may also facilitate consonant voicing for
the voiced obstruents even for the monosyllables. Tone sandhi
(or lack thereof) on the target words, however, is not expected
to be affected by the sentential context, as the preceding verb
ˆja34 “to write” and the following demonstrative g@?12 “this”
do not belong to the same prosodic word as the target. The
full word list is given in Table II.
Ten native speakers (5 male, 5 female) with an age range
of 19–30 and a mean age of 25 were recorded in a quiet room
in Shanghai using an Electro-Voice N/D767 cardioid micro-
phone (Burnsville, MN) and a Marantz portable solid state
recorder (PMD 671, Cumberland, RI). Each of them read the
stimuli twice. Subsequent measurements for the two repeti-
tions were averaged before the statistical analyses.
Consonant durations were measured in Praat (Boersma
and Weenink, 2012) by the second author. The duration for
stops was the closure duration and was measured from the
end of the previous syllable to the stop release. For fricatives
and sonorants, the segments themselves were identified from
1296 J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan
the spectrograms and their durations measured. Durations
were analyzed with linear mixed-effects models, with the
laryngeal feature (referred to as voicing for brevity below) as
fixed effects and subject and item as random effects. P-values
were calculated using the lmerTest package in R (Kuznetsova
et al., 2016). Monosyllables and disyllables were analyzed
separately. Stops and fricatives were classified as “voiced”
or “voiceless” depending on whether 50% or more of the
consonant duration (closure for stops, frication duration for
fricatives) had voicing, as determined from the waveforms
and spectrograms in Praat by the first author.
The spectral measure H1*-H2* (corrected H1-H2 based
on the frequencies and bandwidths of formants; Shue et al.,2011) and the periodicity measure CPP were selected to esti-
mate the breathiness induced by the contrast.3 H1*-H2* and
CPP values were measured every millisecond in VoiceSaucev1.12 (Shue et al., 2011), and the measurements over every
9.1% of the vowel duration were averaged, yielding 11 data
points for each vowel for statistical analysis. The Snack
Sound Toolkit (Sj€olander, 2004) was used by VoiceSauce to
find the frequencies and bandwidths of the formants with the
covariance method, a pre-emphasis of 0.96, and a window
length of 25 ms with a frame shift of 1 ms. Fundamental fre-
quencies were measured at 10% intervals during the vowel
using the ProsodyPro Praat script (Xu, 2005–2013). The
Maxf0 and Minf0 parameters in the script, as well as the
octave-jump cost, were adjusted for each speaker, and the f0measurements were manually checked by the second author
against pitch tracks and narrowband spectrograms in Praat to
correct any measurement errors by the script. The f0 values in
Hz were then converted into semitones and z-scored. Growth
curve analyses (Mirman, 2014) were conducted on the H1*-
H2*, CPP, and f0 curves over the vowel using third-order
(cubic) orthogonal polynomials. The models were built up
from the base model that only included subject, item, and sub-
ject-by-voicing random effects. Voicing and its interaction
with the time terms were subsequently added step-wise, and
their effects on model fit were evaluated using log-likelihood
model comparison. Parameter estimates for the full model
were then tested for significance using t-tests, and p-values
were again estimated by the lmerTest package. Different man-
ners and different positions were analyzed separately, and the
voiceless/modal category was used as the baseline. H1*-H2*
and CPP were similarly compared for sonorant consonants,
but the measurements were averaged over every 20% of the
sonorant duration, yielding only five data points for each
sonorant. All statistical analyses were performed using the
lme4 package (Bates et al., 2015) in R (R Core Team, 2014).
To investigate the relative contribution of the different
acoustic cues in the laryngeal contrast for each manner in
monosyllables and disyllables, LDAs were conducted to
explore the extent to which the laryngeal category can be
predicted from the acoustic cues. The greedy.wilks function
in the klaR package (Weihs et al., 2005) in R was used to
conduct stepwise forward variable selection for significant
predictors (p< 0.05), and the lda function in the MASS
package (Venables et al., 2002) was used to derive the coef-
ficients for the variables for the linear discriminant functions.
The overall Wilks’s lambda values (from 0 to 1) for the dis-
crimination (0 means total discrimination, 1 means no dis-
crimination), as well as their F and p values, were calculated
using the manova function (see also Gao, 2015).
B. Results
1. Duration and voicing measures
The consonant duration results are given in Fig. 1. For
both the monosyllables and the second syllable of disyl-
lables, the best model included the interaction between voic-
ing and manner. An analysis with voicing nested under
manner as fixed effects for monosyllables and disyllables,
respectively, was then conducted to get voicing estimates for
the different manners in the same model. For monosyllables,
the effect of voicing is significant for fricatives (estimate
¼ �59.168, Standard Error (SE)¼ 13.344, degrees of freedom
(df)¼ 25.246, t¼�4.434, p< 0.001), but not for stops
(estimate¼�11.073, SE¼ 10.930, df ¼ 25.544, t¼�1.013,
p¼ 0.321) or sonorants (estimate ¼�0.783, SE¼ 15.393,
TABLE II. Word list used in the production experiment. Tone transcriptions reflect the base tones before the application of tone sandhi.
Monosyllables Disyllables
Voiceless/modal Voiced/murmured Voiceless/modal Voiced/murmured
Stops pin34 “pancake” bin13 “bottle” ma13-pin34 “to sell pancakes” ma13-bin13 “to sell bottles”
pu34 “to spread’ bu13 “section” f@n53-pu34 “distribution” f@n53-bu13 “division”
tO34 “to arrive’ dO13 “news” pO34-tO34 “check-in” pO34-dO13 “news report”
ti34 “emperor” di13 “brother” ¨uA~13-ti34 “emperor” ¨uA~13-di13 “royal brother”
kue34 “rail” gue13 “hoop” thiI?55-kue34 “rail” thiI?55-gue13 “iron hoop”
ko˛34 “arch” go˛13 “together” iI?55-ko˛34 “an arch” iI?55-go˛13 “all together”
Fricatives fi34 “fee” vi13 “fat’ ke34-fi34 “to reduce the fee” ke34-vi13 “to lose weight”
f@n34 “hard work” v@n13 “article” fa?55-f@n34 “to work hard” fa?55-v@n13 “to publish an article”
sØ34 “water” zØ13 “porcelain’ dA~13-sØ34 “sugar water” dA~13-zØ13 “porcelain”
su34 “lock” zu13 “seat” tˆin53-su34 “golden lock” tˆin53-zu13 “golden seat”
Sonorants min34 “chirp” m€
in13 “name” �jO34-min34 “bird’s chirps” �jO34-m€
in13 “bird’s name”
me34 “America” m€e13 “plum” ly34-me34 “traveling in the US” ly34-m
€e13 Proper name
gjO34 “bird” g€
jO13 “around” le13-�jO34 “blue bird” le13-g€
jO13 “indiscriminate”
J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan 1297
df¼ 25.151, t¼�0.051, p¼ 0.960). For the second syllable
of disyllables, likewise, the effect of voicing is significant
for fricatives (estimate¼�66.554, SE¼ 12.558,
df¼ 25.792, t¼�5.300, p< 0.001), but not for stops
(estimate¼�8.870, SE¼ 10.250, df¼ 25.755, t¼�0.865,
p¼ 0.395) or sonorants (estimate¼ 0.182, SE¼ 14.484,
df¼ 25.676, t¼ 0.013, p¼ 0.990).
In terms of voicing, 89% of the voiced stops and 100% of
the voiced fricatives in the second syllable of disyllable tokens
were classified as voiced. This generally agrees with the
results of earlier studies (Shen and Wang, 1995; Chen, 2011;
Wang, 2011; Gao, 2015; Gao and Hall�e, 2017). In monosyl-
lables, due to the intervocalic position in which the consonant
appears in the sentential context, 33% of the voiced stop
onsets were also classified as voiced. For fricatives, 100% of
the voiced bilabial fricatives and 50% of the coronal fricatives
were classified as voiced. The tendency for bilabial fricatives
to have more voicing in this position has also been docu-
mented in Gao (2015) and Gao and Hall�e (2017). Voiceless
obstruents were occasionally voiced (11% for monosyllables,
31% for the second syllable of disyllables), contra traditional
descriptions for which we have no good explanation except
that the intervocalic or post-nasal positions in which they
appear perhaps encouraged phonetic voicing.
2. Spectral and periodicity measures
The H1*-H2* and CPP results for the vowels after the
three consonant manners in monosyllables are given in
Figs. 2 and 3, respectively. Model comparisons for
H1*-H2* showed that the model did not significantly
improve with the addition of voicing or its interactions with
the linear, quadratic, and cubic time terms for any manner
(p> 0.15 for all comparisons). For CPP, the interaction
between voicing and the quadratic time term did signifi-
cantly improve the model for fricatives [v2(1)¼ 8.455,
p¼ 0.004]. Parameter estimates for the quadratic interac-
tion (estimate¼ 2.160, SE¼ 0.610, t¼ 3.538, p¼ 0.006)
indicated that voiceless fricatives induced a sharper
peak for the CPP curve on the following vowel than voiced
ones; no other model comparisons were significant (all
p> 0.07).
For the second syllable of disyllables, stops and sonor-
ants again did not exhibit any phonatory difference in H1*-
H2* or CPP on the following vowel based on their laryngeal
features (p> 0.18 for all model comparisons). For fricatives,
however, model comparisons showed that for H1*-H2* the
effect of voicing on the intercept significantly improved the
model [v2(1)¼ 9.564, p¼ 0.002], and parameter estimates
(estimate¼ 2.241, SE¼ 0.568, t¼ 3.942, p¼ 0.002) indi-
cated that voiceless fricatives induced a lower H1*-H2* than
voiced fricatives; for CPP, the effects of the laryngeal fea-
ture on the intercept and quadratic time terms both signifi-
cantly improved the model [intercept: v2(1)¼ 8.752,
p¼ 0.003; quadratic: v2(1)¼ 6.353, p¼ 0.012], and parameter
estimates showed a significant effect for the quadratic interac-
tion (estimate¼ 2.881, SE¼ 0.943, t¼ 3.054, p¼ 0.011),
indicating that voiceless fricatives again induced a sharper
peak for the CPP curve on the following vowel. These results
are given in Figs. 4 and 5.4
FIG. 1. Duration of onset consonants in monosyllables and the second syllable of disyllables. *: p< 0.05; **: p< 0.01; ***: p< 0.001.
FIG. 2. H1*-H2* results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical
lines indicate 6 SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p< 0.05; **: p< 0.01; ***: p< 0.001.
1298 J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan
For the spectral and periodicity measures on the sonor-
ant consonants themselves, for monosyllables, the model for
H1*-H2* did not significantly improve with the addition of
voicing or its interactions with the linear, quadratic, and
cubic time terms (p> 0.75 for all comparisons), but the
model for CPP did improve with the addition of voicing on
the intercept [v2(1)¼ 4.818, p¼ 0.028] and the quadratic
time term [v2(1)¼ 4.064, p¼ 0.044]. Parameter estimates
indicated that the modal sonorants had an overall higher
CPP value than the murmured sonorants (voicing intercept:
estimate¼�1.815, SE¼ 0.510, t¼ 3.561, p¼ 0.005), and
the murmured sonorants had a more U-shaped curve than the
modal sonorants (voicing and quadratic time term interac-
tion: estimate¼ 0.890, SE¼ 0.395, t¼ 2.256, p¼ 0.041).
For sonorant onsets on the second syllable of disyllables, the
models for H1*-H2* and CPP did not significantly improve
with the addition of voicing or its interactions with the lin-
ear, quadratic, and cubic time terms (p> 0.33 for all compar-
isons). The monosyllabic and disyllabic results are given in
Figs. 6 and 7, respectively.
3. f0
The f0 results for the monosyllables and the second syl-
lable of disyllables are given in Figs. 8 and 9, respectively.
For monosyllables, the addition of voicing improved the
model for the stops [v2(1)¼ 8.350, p¼ 0.004] and fricatives
[v2(1)¼ 15.153, p< 0.001], and the addition of its interac-
tion with the linear time term improved the model for the fri-
catives [v2(1)¼ 11.224, p< 0.001] and sonorants [v2(1)
¼ 4.472, p¼ 0.034]. Parameter estimates for the full model,
which include the effects of voicing and its interaction with
the linear, quadratic, and cubic time terms for the three man-
ners are summarized in Table III. With the voiceless/modal
category as the baseline, the negative intercepts indicated
that the f0s after the voiced/murmured consonants were sig-
nificantly lower than those after the voiceless/modal conso-
nants, and the positive coefficients for the interaction
between voicing and the linear time term indicated that the
f0s after the voiced/murmured consonants had sharper rising
slopes than those after the voiceless/modal consonants;
therefore, the f0 difference between the two types of onsets
decreased over the duration of the vowel. For the second syl-
lable in disyllables, however, only for the fricatives did the
addition of the laryngeal feature significantly improve the
model [v2(1)¼ 3.849, p¼ 0.050]. No other model compari-
sons were significant (all p> 0.12). Parameter estimates for
the full models indicated that the effects of voicing on the
intercept or higher time terms were not significant for any
manner, including the fricatives.
4. Linear discriminant analysis
Consonant duration and CPP and f0 values averaged
over the entire vowel duration were used as the acoustic vari-
ables in the linear discriminant analysis. These variables
were selected as representatives of the acoustic properties of
the consonant, vowel phonation, and vowel f0. Consonant
duration was selected as the consonant cue as previous stud-
ies have primarily shown the perceptual effect of duration
(e.g., Wang, 2011; Gao and Hall�e, 2013; Gao, 2015), and
Wang (2011) has shown that listeners did not use closure
voicing as a perceptual cue for stops. CPP was selected as
the phonation cue as our acoustic results above showed
stronger CPP effects than H1*-H2*. The variables were cen-
tered and scaled before being submitted to the discriminant
analysis.
Table IV summarizes the coefficients for the variables
for the linear discriminant functions as well as the Wilks’slambda, F, and p values for the discriminations. Significant
FIG. 3. CPP results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (vertical lines
indicate 6SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p< 0.05; **: p< 0.01; ***: p< 0.001.
FIG. 4. H1*-H2* results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed
data (vertical lines indicate 6SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p< 0.05; **: p< 0.01; ***: p< 0.001.
J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan 1299
FIG. 6. H1*-H2* and CPP results over
the duration of sonorant onsets for
monosyllables. Symbols represent
observed data (vertical lines indicate
6SE) and lines represent growth curve
model fits using cubic orthogonal poly-
nomials. *: p< 0.05; **: p< 0.01; ***:
p< 0.001.
FIG. 7. H1*-H2* and CPP results over
the duration of sonorant onsets for the
second syllable of disyllables. Symbols
represent observed data (vertical lines
indicate 6SE) and lines represent
growth curve model fits using cubic
orthogonal polynomials. *: p< 0.05;
**: p< 0.01; ***: p< 0.001.
FIG. 8. Normalized f0 results over the duration of the vowels after stops, fricatives, and sonorants for monosyllables. Symbols represent observed data (verti-
cal lines indicate 6SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p< 0.05; **: p< 0.01; ***: p< 0.001.
FIG. 9. Normalized f0 results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent
observed data (vertical lines indicate 6SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p< 0.05; **: p< 0.01; ***:
p< 0.001.
FIG. 5. CPP results over the duration of the vowels after stops, fricatives, and sonorants for the second syllable of disyllables. Symbols represent observed
data (vertical lines indicate 6SE) and lines represent growth curve model fits using cubic orthogonal polynomials. *: p< 0.05; **: p< 0.01; ***: p< 0.001.
1300 J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan
predictors, as indicated by stepwise variable selection, are
given in bold. “Voiceless/modal” was dummy coded as 0.
Therefore, a negative coefficient for a factor indicates that a
higher value for that factor is more likely to lead to a
“voiceless/modal” classification. For monosyllables (non-
sandhi), the only consistent predictor was f0; but for frica-
tives, both CPP and duration were significant as well, and
the stepwise analysis selected f0 first, then CPP, followed by
duration. For the second syllable in disyllables (sandhi), only
the fricatives could be significantly discriminated, and the
stepwise analysis selected duration first, then CPP.
C. Discussion
The acoustic results above indicate that this laryngeal
contrast in Shanghai is primarily a tone contrast in the non-
sandhi context (monosyllables), as although the H1*-H2* and
CPP comparisons between the voiceless/modal and voiced/
murmured categories were generally in the expected direction,
with the voiceless/modal consonants exhibiting numerically
lower H1*-H2* and higher CPP on the following vowel than
the voiced/murmured ones, only the CPP comparison for fri-
catives reached significance under the growth curve analysis;
f0 curves on the vowels after voiceless/modal and voiced/mur-
mured consonants, however, differed significantly on both the
intercept and slope for all three manners except for the slope
for stops. There are indications that the consonants themselves
still played a role in the contrast as the fricatives exhibited a
duration difference, while the sonorants exhibited a CPP dif-
ference based on the contrast. Moreover, the attenuation of
the f0 difference over the vowel after voiceless/modal vs
voiced/murmured consonants also suggests that the f0 differ-
ence, at least in part, stems from the onset consonants. The
LDAs provided the relative weighting of the acoustic cues
from consonant duration, vowel phonation, and vowel f0 and
corroborated the acoustic finding that the laryngeal contrast in
the non-sandhi context is primarily tonal, with secondary cues
from CPP and consonant duration for the fricatives.
In the sandhi context (second syllable of disyllables), the f0difference was neutralized, but the stops gained a voicing differ-
ence despite losing the closure duration difference, and the frica-
tives exhibited both duration and voicing differences. For the
sonorants, however, no difference between the modal and mur-
mured categories was detected in consonant duration, consonant
phonation, vowel phonation, or f0. The LDAs did not encode
the effect of voicing, but confirmed that f0 cannot be used to dis-
criminate the contrast, and that fricatives have enough second-
ary cues in duration and CPP to be differentiated.
These results show that the acoustic cues for the contrast
indeed vary by the manner and position in which the contrast
is realized. In the sandhi position where a phonological pro-
cess presumably neutralizes the main cue for the contrast—
f0, the contrast itself is incompletely neutralized for frica-
tives and arguably for stops, but completely neutralized for
sonorants as far as the measures included here are concerned.
The weakness of this contrast on sonorants hence finds some
support in the results.
Unlike in previous studies (e.g., Cao and Maddieson, 1992;
Ren, 1992; Gao, 2015), the H1*-H2* and CPP results here gen-
erally did not show a significant effect of the laryngeal feature.
For f0, although we showed that it significantly covaried with
the consonant feature in the non-sandhi context—a result shared
by all previous research—we did not find incomplete neutraliza-
tion in the sandhi context indicated by Ren (1992), Chen
(2011), and Wang (2011). There are two potential reasons for
these disparities. One is that, given our speakers were consider-
ably younger than the speakers used in earlier studies, it is possi-
ble that Shanghai is gradually losing the phonation difference,
and the contrast is now primarily cued by tone in the younger
generations (see Gao, 2015; Gao and Hall�e, 2016, 2017, for age
and gender-based differences that support this contention).
Another possibility is that the different results are partly due to
the different statistical methods used. In the linear mixed-
effects-based growth curve analyses, the random effects struc-
ture included not only subject and item, but also subject-by-
voicing interaction. This helps reduce the type I error in hypoth-
esis testing (Barr et al., 2013), in this case, the effect of voicing.
TABLE III. Parameter estimates for the monosyllable f0 analysis. Baseline
¼ voiceless.
Estimate SE t p
Stop Voicing: Intercept �0.805 0.190 �4.228 <0.001
Voicing: Linear 0.533 0.271 1.967 0.068
Voicing: Quadratic 0.330 0.208 1.588 0.137
Voicing: Cubic �0.144 0.140 �1.028 0.314
Fricative Voicing: Intercept �1.180 0.176 �6.699 <0.001
Voicing: Linear 1.153 0.228 5.045 <0.001
Voicing: Quadratic 0.353 0.184 1.917 0.078
Voicing: Cubic �0.228 0.177 �1.283 0.231
Sonorant Voicing: Intercept �0.682 0.244 �2.789 0.019
Voicing: Linear 0.973 0.400 2.431 0.043
Voicing: Quadratic 0.331 0.324 1.022 0.338
Voicing: Cubic �0.175 0.142 �1.233 0.232
TABLE IV. Coefficients for the variables for the linear discriminant functions, as well as the Wilks’s lambda, F, and p values for the discriminations.
Significant predictors (p <0.05) are in bold.
Coefficients Duration CPP f0 Wilks’s lambda F p
Monosyllable (non-sandhi) Stop �0.124 �0.061 21.245 0.684 16.627 <0.001
Fricative 20.604 20.761 21.207 0.303 56.080 <0.001
Sonorant 0.026 �0.156 21.314 0.716 7.402 <0.001
Disyllable (sandhi) Stop �0.973 0.200 �0.723 0.961 1.531 0.210
Fricative 21.464 20.403 �0.041 0.434 30.013 <0.001
Sonorant 2.136 0.490 �0.377 0.998 0.042 0.988
J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan 1301
III. EXPERIMENT 2: PERCEPTION STUDY
A. Methods
The perception study investigated how the different
acoustic cues for the laryngeal contrast are weighted in per-
ception and how the weightings are affected by the manner
and position of the contrast. The stimuli were monosyllabic
and disyllabic words in which the target syllables had a full
cross-classification of three sets of cues—consonant proper-
ties, vowel phonation, and vowel f0. These syllables were
constructed by cross-splicing consonant and vowel portions
of different syllables and superimposing the f0 contour from
one vowel onto another in Praat. For instance, from two base
tokens [pu34] (no. 1) and [bu13] (no. 8), six additional stimuli
(no. 2–no. 7) were constructed, as shown in Table V. Three
monosyllabic pairs, one from each manner, were selected as
the original base tokens—pu34�bu13, fi34�vi13, and me34
�m€e13, and their corresponding disyllabic pairs—
f@n53pu34� f@n53bu13, ke34fi34� ke34vi13, and ly34me34
�ly34m€e13—were selected as the originals for the disyllables.
Therefore, there were 24 monosyllables and 24 disyllables in
total as the perceptual stimuli. There are three main reasons
why we used the cross-spliced stimuli in the perception
experiment instead of the acoustic continua often used in
similar studies. First, this method allows for a complete par-
allel for the investigation of different manners in different
positions. The acoustic-continuum method necessitates the
use of different values along the continuous scale due to dif-
ferent acoustic properties depending on context, and hence
loses some of the parallelism. Second, the manipulation is
easily executable. The acoustic-continuum method may not
allow effective continua to be built due to the small acoustic
differences in some contexts. Third, the method is symmetri-
cal among the three sets of cues and hence makes no
assumption about the importance of any particular one.
The base tokens were selected from a female speaker’s
production data, and a number of considerations went into
the selection of these tokens. First, it was ensured that these
tokens were representative of the overall acoustic patterns
reported in Sec. II. Second, given that the f0 contour was
either stretched or compressed when superimposed onto a
vowel of a different duration, the original syllable pairs were
selected such that their vowel durations were as similar as
possible. Third, after f0 was superimposed onto a different
vowel, H1*-H2* and CPP of the new token were remeas-
ured, and we selected the base tokens for which these mea-
sures were minimally affected by the f0 manipulation. A
summary of the acoustic measures for the 12 base tokens, as
well as when the f0 of the base tokens was switched to that
of the other laryngeal category, is given in Table VI, and all
48 test stimuli are provided as supplemental material online.5
All stimuli were embedded in the same carrier sentence and
auditorily presented to the subjects through headphones for a
two-alternative forced choice (2AFC) task, where they had
to choose on a monitor the Chinese character(s) they heard.6
The entire stimulus list was presented four times, and the
order of the stimuli was randomized each time. Forty-one
native speakers (16 male, 25 female) with an age range of
19–37 yr and a mean age of 24.4 yr participated in the exper-
iment in a quiet office at Fudan University in Shanghai.
TABLE V. Examples of stimulus construction for the perception experiment
from original tokens [pu34] and [bu13].
Stimulus
number C properties V phonation V f0 Method
1. pu34 pu34 pu34 Original
2. pu34 pu34 bu13 Superimpose f0of [bu13] onto [pu34]
3. pu34 bu13 pu34 Cross-splice C
of [pu34] to V of [bu13],
then superimpose f0of [pu34] onto the vowel
4. pu34 bu13 bu13 Cross-splice C of [pu34]
to the V of [bu13]
5. bu13 pu34 pu34 Cross-splice C of [bu13]
to the V of [pu34]
6. bu13 pu34 bu13 Cross-splice C of [bu13]
to V of [pu34],
then superimpose f0 of [bu13]
onto the vowel
7. bu13 bu13 pu34 Superimpose f0 of [pu34]
onto [bu13]
8. bu13 bu13 bu13 Original
TABLE VI. Acoustic measures of the base tokens for the perception experiment as well as when the f0 of the base tokens was switched to that of the other
laryngeal category (given in parentheses). H1*-H2*, CPP, and f0 were the average values over the vowel.
C duration (ms) H1*-H2* (dB) CPP (dB) f0 (Hz)
Monosyllable (non-sandhi) pu34 126 �1.55 (0.75) 16.66 (17.28) 217 (201)
bu13 124 2.72 (1.77) 16.01 (18.24) 201 (217)
fi34 196 �1.18 (1.22) 18.90 (19.28) 229 (191)
vi13 126 3.45 (0.58) 17.24 (18.86) 191 (229)
me34 122 �1.83 (4.19) 22.98 (24.00) 211 (211)
m€e13 118 8.38 (10.03) 17.44 (21.56) 172 (172)
Disyllable (sandhi) f@n53-pu34 57 �1.28 (0.09) 17.38 (19.41) 217 (198)
f@n53-bu13 39 6.12 (6.43) 17.47 (18.92) 198 (217)
ke34-fi34 147 �3.21 (4.79) 19.11 (21.37) 205 (187)
ke34-vi13 70 7.81 (2.03) 20.73 (23.60) 187 (205)
ly34-me34 136 8.37 (8.47) 20.54 (22.42) 192 (194)
ly34-m€e13 99 12.27 (11.53) 20.88 (21.77) 194 (192)
1302 J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan
For each stimulus type defined by manner and position, a
mixed-effects logistic regression was conducted with the sub-
jects’ binary responses as the dependent variable and the voic-
ing specifications of consonant, phonation, and f0 cues as
categorical predictors with random intercept by subject.7 A
non-parametric analysis—the Classification and Regression
Tree (CART) analysis (Breiman et al., 1984)—was also con-
ducted using the rpart package in R to further investigate how
the listeners classified the stimuli based on these cues. CART
is a recursive partitioning technique that outlines the decision
process for a category membership based on categorical pre-
dictors. The splits in a classification tree are selected so that
the descendant subsets are “purer” than the current set, and
the parameters for the splits can be considered as significant
predictors for the classification. For our analysis, we con-
structed the classification trees by using consonant, phonation,
and f0 cues as categorical predictors for the subjects’ response
for each manner and position by using the rpart function. We
then conducted cost-complexity pruning for each tree based
on the relative errors generated by tenfold cross-validation
using the plotcp and prune functions (Baayen, 2008).
B. Results
The accuracy and d0 results for the listeners’ classifica-
tion of the natural tokens are given in Fig. 10. These results
indicate that the subjects had near perfect identification of
the contrast in the non-sandhi context regardless of manner
and in the sandhi context for fricatives. For stops in the san-
dhi context, the identification was weaker, but well above
chance; for sonorants, however, identification was at chance.
The coefficients for the consonants, phonation, and f0 cues
in the mixed-effects logistic regressions for different manners
and positions are given in Tables VII and VIII. “Voiceless/
modal” was dummy coded as 0 for both the response variable
and all the categorical predictors. Therefore, the intercept in the
models indicates the log odds [ln(p/(1 � p))] of the segment
being given a “voiced/murmured” response when the conso-
nant, phonation, and f0 cues all came from the voiceless/
modal category, and the coefficients for consonant, phona-
tion, and f0 indicate the increase of the log odds when these
cues came from the voiced category, respectively. For mono-
syllables (non-sandhi), f0 was the only consistent factor that
significantly affected the response, and its coefficient was
the largest among the three cues for all three manners; but
for stops, phonation also had a significant effect, and for fri-
catives, both the consonant and phonation cues were signifi-
cant as well. For the second syllable in disyllables (sandhi),
all factors contributed significantly to the response for stops
and fricatives, with phonation and consonant cues having the
largest coefficient for stops and fricatives, respectively; for
sonorants, none of the factors was significant. All significant
effects were in the expected direction, i.e., the cues from the
FIG. 10. Perceptual accuracy and d0 for the natural tokens in the perception experiment.
TABLE VII. Parameter estimates for the mixed-effects logistic regressions
for monosyllables (non-sandhi context). Baseline ¼ voiceless.
Estimate SE z p
Stop (Intercept) �5.007 0.429 �11.667 <0.001
Consonant �0.0984 0.222 �0.443 0.658
Phonation 0.945 0.232 4.059 <0.001
f0 6.816 0.396 17.195 <0.001
Fricative (Intercept) �0.945 0.213 �4.428 <0.001
Consonant 2.523 0.204 12.384 <0.001
Phonation 1.126 0.177 6.374 <0.001
f0 2.551 0.205 12.464 <0.001
Sonorant (Intercept) �4.429 0.419 �10.585 <0.001
Consonant 0.284 0.286 0.992 0.321
Phonation �0.284 0.286 �0.992 0.321
f0 7.411 0.450 16.486 <0.001
TABLE VIII. Parameter estimates for the mixed-effects logistic regressions
for the second syllable of disyllables (sandhi context). Baseline ¼ voiceless.
Estimate SE z p
Stop (Intercept) �2.8715 0.298 �9.632 <0.001
Consonant 0.292 0.146 1.996 0.046
Phonation 1.484 0.155 9.577 <0.001
f0 1.015 0.150 6.756 <0.001
Fricative (Intercept) �2.270 0.244 �9.323 <0.001
Consonant 4.517 0.267 16.952 <0.001
Phonation 0.957 0.179 5.343 <0.001
f0 2.406 0.202 11.885 <0.001
Sonorant (Intercept) 0.762 0.224 3.402 <0.001
Consonant 0.057 0.127 0.450 0.652
Phonation �0.221 0.127 �1.734 0.083
f0 0.172 0.127 1.350 0.177
J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan 1303
voiced/murmured category elicited more voiced/murmured
responses.
The CART analyses after pruning are given in Fig. 11.
The only pruning necessary was for fricatives in monosyllables
for which the original tree from the rpart function also included
branches based on phonation. Relative errors generated by ten-
fold cross-validation under different cost-complexity measures
using the plotcp function indicate that the structural complexity
introduced by these branches is not warranted. These branches
were then subsequently pruned using the prune function.
For stops and sonorants in disyllables, only the root
node was obtained, indicating that none of the cues was a
significant factor in the partition. For stops and sonorants in
monosyllables, f0 was the sole significant predictor for the
subjects’ classification (to read the Monosyllable_Stop
graph, for instance: among the 656 tokens with f0 cues com-
ing from voiceless stop onsets, 641 were classified as voice-
less and 15 were classified as voiced; among the 656 tokens
with f0 cues coming from voiced stop onsets, 549 were clas-
sified as voiced and 107 were classified as voiceless); for fri-
catives in monosyllables, f0 and consonant cues contributed
significantly, but their roles differed: f0> consonant; for fri-
catives in disyllables, only the consonant and f0 cues were
relevant, and the former was more important.
C. Discussion
Both the logistic regression and CART analysis of the
perception data showed that f0 was the primary cue that the
listeners relied on in making category judgment for the laryn-
geal contrast in monosyllables (non-sandhi context). For the
second syllable of disyllables (sandhi context), both analyses
showed that the consonant and f0 cues contributed signifi-
cantly to the voicing classification of fricatives. The logistic
regression analysis, however, identified additional significant
predictors: phonation for stops and fricatives in monosyl-
lables, consonant, phonation, and f0 for stops in disyllables,
and phonation for fricatives in disyllables. For a relatively
small dataset with only a few predictors like ours, it seems
that the CART analysis returned a more conservative estimate
of what predictors are significant in the classification. Logistic
regression and CART differ in that the former is able to pro-
vide an estimate of the average effect of a predictor while
accounting for other predictors, whereas the latter’s hierarchi-
cal structure does not allow the net effect of a predictor, in
general, to be estimated (Lemon et al., 2003). Without a pri-ori assumptions about how our perception data would pattern,
it is perhaps worthwhile to consider both analyses to provide
a more comprehensive view of the data.
The perception results were generally consistent with
the aggregate production results: the laryngeal contrast in
question was primarily cued by f0 in the non-sandhi context,
and the f0 cue was able to override conflicting cues in the
consonant or vowel phonation; for the sandhi context, f0became ineffective in stops and sonorants, but still had an
effect on fricative classification. Different manners relied on
different cues, and classification was the most robust for fri-
catives. For stops in the sandhi context, the fact that the
speakers were able to classify the natural tokens at a high
rate indicates the relevance of the consonant cue, but the
effect of the cue was not strong enough to override
FIG. 11. CART analyses for stops, fricatives, and sonorants in monosyllables and fricatives in the second syllable of disyllables.
1304 J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan
conflicting cues from f0 and phonation, if any. For sonorants
in this context, however, both the natural token identification
and the classification of all stimuli demonstrated that there
was simply no reliable cue for the contrast.
It is worth noting that the coefficients in the LDA per-
formed on the acoustic data are not directly comparable with
the coefficients in the logistic regression analysis of the per-
ception data as they mean very different things in the two
analyses (logistic regression was not used for the acoustic
data due to convergence problems). Moreover, the predictors
in the acoustic study were continuous, while those in the per-
ception study were categorical. However, comparisons
among the coefficients within each analysis consistently
point out how the cues are implemented and perceptually
used differently based on manner and position and the
importance of f0 cues in monosyllables and consonant cues
for fricatives in the second syllable of disyllables.
IV. GENERAL DISCUSSION
Both the production and perception results here clearly
show that, at least for the younger speakers that we tested,
the laryngeal contrast in question in Shanghai is primarily
realized as a tone difference acoustically in the non-sandhi
position, and listeners accordingly attend to the f0 cues in
classifying the contrast in this position. However, the fact
that the f0 difference over the vowel diminishes over time
indicates that the voicing/voice quality property of the onset
consonant contributes to the contrast. This is also consistent
with the weakness of the contrast for sonorants, which is a
known crosslinguistic tendency for laryngeal contrasts for
consonants, but would be difficult to explain if the contrast
were purely tonal. Taken together with the acoustic and per-
ceptual results of voicing and f0 cues in tonal and non-tonal
languages elsewhere, the findings are consistent with the
position that the perceptual system is tuned to the distribu-
tion of cues in the particular language.
Our results also shed light on whether certain cues are
inherently better perceptually for a contrast. For instance,
there is some evidence that consonant voicing is better cued
on fricatives than on stops, as in the non-initial position,
although both stops and fricatives exhibited an acoustic dif-
ference in voicing for the contrast, stop voicing did not seem
to be a strong perceptual cue and was not able to override
conflicting cues from the vowel, a finding also reported in
Wang (2011), while fricative voicing was able to stand out
as a cue for the listeners even when conflicting cues were
present. This is potentially because the voicing contrast on
fricatives is cued not only by consonant voicing, but also by
the spectral peak and spectral moments provided by the fri-
cation noise (Jongman et al., 2000). It is also interesting to
note that the voicing difference on fricatives is concomitant
with a larger f0 difference than the voicing difference on
stops in the second syllable sandhi position, as shown in the
growth curve analysis, and the perception results showed
that the f0 difference can be used by listeners. This indicates
that the strength of one cue for a contrast may enhance
another cue for the contrast realized elsewhere.
The presence of phonological tone sandhi had an inter-
esting effect on the acoustic realization and perception of the
laryngeal contrast in question. Although the intervocalic
position is typically a prime position for laryngeal contrasts
for consonants due to the transitional cues that vowels pro-
vide (Steriade, 1997), the fact that this contrast in Shanghai
is primarily cued by f0 in non-sandhi contexts, and the f0 cue
can potentially be lost due to tone sandhi in this position,
makes this a special case. The f0 result on the second sylla-
ble of disyllables indicates that the tonal difference concomi-
tant with the voicing difference of the onset consonant was
indeed neutralized with fricative-onset syllables as marginal
exceptions, but the contrast was only fully lost for the sonor-
ants. For fricatives, there was a voicing and duration differ-
ence between the voiceless and voiced consonants, and the
vowels also differed in the phonation and periodicity mea-
sures; perceptually, both consonant and f0 cues were able to
drown out conflicting cues. For stops, the voiceless and
voiced stops differed in closure voicing in this position; this
voicing difference potentially led to the high d0 score for the
classification of the natural tokens, but was ineffective when
there were conflicting cues. The complexity of the situation
indicates that there is more nuance to incomplete neutraliza-
tion of a phonological contrast, as the “neutralizing” context,
e.g., the non-initial sandhi context, may need to be further
divided up, in this case, by manners of articulation of the
onset consonant.
The weakness of the voice quality contrast for sonorant
consonants was evident in both the production and percep-
tion results. In the non-sandhi position, the contrast was cued
by f0 on the following vowel, and there was a CPP differ-
ence on the consonant itself, but the CPP cue was so weak
that it was not able to compete with conflicting cues in the
perception experiment. In the sandhi position, the sonorants
were the only manner that lost all acoustic cues reported
here between the contrasting pair, and the perceptual results
also showed that there was no discriminability between the
modal and murmured sonorants in this position. These
results, on the one hand, support the contention of Berkson
(2016b) that phonation contrasts tend to be more weakly
cued on sonorants than obstruents, which potentially contrib-
utes to their typological rarity (see also Gao, 2015; Gao and
Hall�e, 2015); on the other hand, they also support the phono-
logical theory of “licensing-by-cue” and its variations
(Steriade, 1997, 2008), which contend that phonological
contrasts are better licensed in contexts of better perceptibil-
ity and more susceptible to loss when the cues are endan-
gered. The complete loss of the laryngeal contrast for
sonorants in the sandhi position in Shanghai is a case in
point. A caveat to the current results is that the acoustic and
perceptual data both come from nasals, and it is possible that
other sonorants, such as liquids, may behave differently,
especially given that nasalization and spread glottis share
similar acoustic consequences of increased amplitude of the
first harmonic and increased bandwidth of the first formant
(Keyser and Stevens, 2006), and have been shown to be per-
ceptually confusable with each other (Klatt and Klatt, 1990).
However, the confusion in the source of an increased first
harmonic reported in Klatt and Klatt (1990) was for a female
J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan 1305
voice whose first harmonic is close to the value of the nasal
pole; and Berkson (2013) showed in her study of breathiness
in Marathi that only males cued breathiness with H1*-H2*,
while females used CPP. This indicates that the confusion
between nasalization and breathiness can potentially be
avoided. Moreover, if the weakness of the phonation cues
for sonorants is entirely due to the confusability between
nasalization and breathiness, then the typological rarity of
phonation contrasts on sonorants, in general, remains unac-
counted for.
Although the results of our perception study by and large
match the results of the acoustic study in the aggregate, we
are not in a position to make generalizations about how the
production and perception of this laryngeal contrast in
Shanghai are related to each other on an individual speaker’s
level as the subjects in the two experiments were two distinct
sets. It is possible that individual subjects tune their perception
to the aggregate input in their environment, but we do not
exclude the possibility that individual subjects’ perception is
disproportionally biased by their own production. It must also
be acknowledged that both our production and perception
studies were conducted with relatively young speakers of
Shanghai, and as previously mentioned, it is possible that the
voicing/voice quality contrast has undergone or is undergoing
restructuring (see Gao, 2015; Gao and Hall�e, 2016, 2017), the
investigation of which requires a design that incorporates
sociolinguistic factors, which our study does not.
Finally, if we consider this contrast to stem from a sin-
gle distinctive feature, then this set of data lays out clearly
the challenges for how the instantiation of this feature in a
particular language can be acquired, as the issue concerns
not only the weighting and integration of multiple cues in
potentially unsupervised learning, but also how this learning
can overcome the contextual dependency of cue weighting,
especially when phonological processes intervene. The cur-
rent work does not provide an answer to a difficult problem
like this, but it does suggest that the learning of phonological
contrast realization is likely guided by the morphophonolog-
ical alternation in the language as well as the distributional
properties of the acoustic dimensions along which the con-
trast manifests itself.
V. CONCLUSION
This paper presents a case study on how a phonological
contrast is cued in multiple phonetic dimensions, both acousti-
cally and perceptually. What is of particular interest is that the
contrast in question—a laryngeal contrast in Shanghai Wu—
is cued differently when realized on different manners (stops,
fricatives, sonorants) and in different positions (non-sandhi,
sandhi). Acoustic results showed that, although this contrast
has been described as phonatory in earlier literature, its pri-
mary cue is in tone, at least in the younger speakers that were
tested. In the non-sandhi position, phonation correlates only
appear on fricative-onset syllables and sonorant consonants;
stops and fricatives have consonant duration cues, and frica-
tives also have a frication voicing cue. In the sandhi position,
tone sandhi neutralizes the f0 difference, but the contrast is
maintained in fricatives by both consonant and vowel
phonation cues, marginally maintained in stops by closure
voicing, and lost in sonorants. The perception results were
largely consistent with the aggregate acoustic results, indicat-
ing that speakers adjust the perceptual weights of individual
cues for a contrast according to contexts. These findings sup-
port the position that phonological contrasts are formed by the
integration of multiple cues in a language-specific, context-
specific fashion and should be represented as such.
ACKNOWLEDGMENTS
We are grateful to Dan Yuan and Zhongmin Chen for
hosting us at Fudan University for data collection, Yifeng Li
and Zhenzhen Xu for serving as our Shanghai consultants,
Kelly Berkson, Christina Esposito, and Goun Lee for
helping us with VoiceSauce, Mingxing Li for helping us
with the linear discriminant analysis, and the University of
Kansas General Research Fund No. 2 301 618 for financial
support. We also thank the Associate Editor Megha Sundara
and four anonymous reviewers for their many insightful
comments, which helped improve both the content and the
presentation of the paper. All remaining errors are our own.
1We focus on the closure instead of the post-release portion of the stops
here as the previous literature in Shanghai has shown that the difference in
release duration between voiceless unaspirated and voiced stops in either
initial or medial position is minimal (Shen and Wang, 1995; Chen, 2011).2The authors reported H2-H1 and F1-H1. These were converted to H1-H2
and H1-F1, and F1 was changed, notationally, to A1, to be consistent with
the rest of the paper.3For spectral measures, we focus on H1*-H2* for two reasons. First,
although different spectral measures have been shown to be effective voice
quality measures in different languages, H1-H2 is the most consistently
used parameter in the literature and is found to be effective in the majority
of languages with phonation contrasts. Gao (2015) and Gao and Hall�e(2017) also found that H1-H2 was the most consistently used acoustic
parameter for the laryngeal contrast in Shanghai by speakers of different
age groups and genders and in different tonal contexts. Second, H1*-A1*,
H1*-A2*, and H1*-A3* were also measured and analyzed for our study,
and they did not reveal additional differences for the contrast in question
not shown by H1*-H2*.4An anonymous reviewer asked whether the word pairs with a nasal coda
behaved similarly to those that are open in the phonation measures due to
the potential confusion between breathiness and nasality reported in the
literature (Klatt and Klatt, 1990: Keyser and Stevens, 2006). We reran the
growth curve analyses for H1*-H2* and CPP on the vowels for the stimuli
without nasal codas, and the statistical patterns were identical to the ones
reported here except that for the CPP in sonorant onsets in the monosyl-
labic (no sandhi) context, the addition of the voicing intercept
[v2(1)¼ 4.451, p¼ 0.035] and the interaction between voicing and the lin-
ear time term [v2(1)¼ 4.522, p¼ 0.033] both significantly improved the
model, with the modal sonorants inducing a greater CPP and a slower CPP
decrease on the following vowel.5See supplementary material at https://doi.org/10.1121/1.5052364 for the
acoustic files used in the perception experiment.6An anonymous reviewer raised the issue of whether any prosodic effects
on the onset consonants (e.g., as documented in Chen, 2011) could have
influenced the perception results. In the production, the speaker read all
items in the same carrier sentence, effectively putting the items in a focus
position. In the perception study, the listeners also listened to the same car-
rier sentence and therefore performed the identification in the same focus
position. Therefore, the entire study can be conceived as an investigation
of this laryngeal contrast in focus position.7Additional analyses that included random slopes by subject for each factor
were also conducted for different manners in the two positions. Models
that included the random slopes for all factors all failed to converge.
Models that included a subset of the random slopes were attempted as
1306 J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan
well, and there was no consistent random slope structure that converged
for all manners and positions. We therefore opted to report the models
with random intercept by subject only.
Abramson, A. S., and Lisker, L. (1985). “Relative power of cues: F0 shift
versus voice timing,” in Linguistic Phonetics, edited by V. Fromkin
(Academic, New York), pp. 25–33.
Abramson, A. S., Nye, P. W., and Luangthongkum, T. (2007). “Voice regis-
ter in Khmu’: experiments in production and perception,” Phonetica 64,
80–104.
Andruski, J., and Ratliff, M. (2000). “Phonation types in production of pho-
nological tone: The case of Green Mong,” J. Int. Phon. Assoc. 30, 37–61.
Aoki, H. (1970). “A note on glottalized consonants,” Phonetica 21, 65–75.
Baayen, R. H. (2008). Analyzing Linguistic Data—A Practical Introductionto Statistics Using R (Cambridge University Press, Cambridge, UK).
Barr, D. J., Levy, R., Scheepers, C., and Tily, H. J. (2013). “Random effects
structure for confirmatory hypothesis testing: Keep it maximal,” J. Mem.
Lang. 68, 255–278.
Bates, D., Maechler, B., Bolker, B., and Walker, S. (2015). “Fitting linear
mixed-effects models using lme4,” J. Stat. Software 67, 1–48.
Berkson, K. (2013). “Phonation types in Marathi: An acoustic inves-
tigation,” Ph.D. dissertation, University of Kansas.
Berkson, K. (2016a). “Durational properties of Marathi obstruents,” Indian
Linguist. 76(3-4), 7–25.
Berkson, K. (2016b). “Production, perception, and distribution of breathy
sonorants in Marathi,” in Formal Approaches to South Asian LanguagesVol. 2, edited by M. Menon and S. Syed (Open Journal Systems 2.4.6.0,
University of Konstanz, Konstanz, Germany), pp. 4–14.
Blankenship, B. (2002). “The timing of nonmodal phonation in vowels,”
J. Phonetics 30, 163–191.
Boersma, P., and Weenink, D. (2012). “Praat: Doing phonetics by computer
(version 5.3.14) [computer program],” http://www.praat.org/ (Last viewed
5 July 2012).
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984).
Classification and Regression Trees (Wadsworth, Belmont, CA).
Brunelle, M. (2009). “Tone perception in Northern and Southern
Vietnamese,” J. Phonetics 37, 79–96.
Brunelle, M. (2012). “Dialect experience and perceptual integrality in pho-
nological registers: Fundamental frequency, voice quality and the first for-
mant in Cham,” J. Acoust. Soc. Am. 131, 3088–3102.
Cao, J.-F., and Maddieson, I. (1992). “An exploration of phonation types in
Wu dialects of Chinese,” J. Phonetics 20, 77–92.
Chao, Y.-R. (1967). “Contrastive aspects of the Wu dialects,” Language 43,
92–101.
Chen, M. Y. (1970). “Vowel length variation as a function of the voicing of
the consonant environment,” Phonetica 22, 129–159.
Chen, Y.-Y. (2011). “How does phonology guide phonetics in segment-f0interaction?,” J. Phonetics 39, 612–625.
Chomsky, N., and Halle, M. (1968). The Sound Pattern of English (Harper
and Row, New York).
Clayards, M., Tanenhaus, M. K., Aslin, R. N., and Jacobs, R. A. (2008).
“Perception of speech reflects optimal use of probabilistic speech cues,”
Cognition 108, 804–809.
Clements, G. N. (2009). “The role of features in phonological inventories,”
in Contemporary Views on Architecture and Representations inPhonology, edited by E. Raimy and C. Cairns (MIT Press, Cambridge,
MA), pp. 19–68.
Davis, K. (1994). “Stop voicing in Hindi,” J. Phonetics 22, 177–193.
de Krom, G. (1993). “A cepstrum-based technique for determining a har-
monics-to-noise ratio in speech signals,” J. Speech Hear. Res. 36,
224–266.
DiCanio, C. (2014). “Cue weight in the perception of Trique glottal con-
sonants,” J. Acoust. Soc. Am. 135, 884–895.
Dinnsen, D. A., and Charles-Luce, J. (1984). “Phonological neutralization,
phonetic implementation, and individual differences,” J. Phonetics 12,
49–60.
Dmitrieva, O., Jongman, A., and Sereno, J. (2010). “Phonological neutrali-
zation by native and non-native speakers: The case of Russian final
devoicing,” J. Phonetics 38, 483–492.
Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., and Mehler, J. (1999).
“Epenthetic vowels in Japanese: A perceptual illusion?,” J. Exp. Psych.:
Human Percept. Perform. 25, 1568–1578.
Dutta, I. (2009). Acoustics of Stop Consonants in Hindi: Voicing,Fundamental Frequency and Spectral Intensity (Verlag Dr. M€uller,
Saarbr€ucken, Germany).
Esposito, C. M. (2010a). “Variation in contrastive phonation in Santa Ana
Del Valle Zapotec,” J. Int. Phonetic Assoc. 40, 181–198.
Esposito, C. M. (2010b). “The effects of linguistic experience on the percep-
tion of phonation,” J. Phonetics 38, 306–316.
Esposito, C. M. (2012). “An acoustic and electroglottographic study of
White Hmong tone and phonation,” J. Phonetics 40, 466–476.
Flege, J. E., and Wang, C. (1989). “Native-language phonotactic constraints
affect how well Chinese subjects perceive the word-final /t/-/d/ contrast,”
J. Phonetics 17, 299–315.
Francis, A. L., Kaganovich, N., and Driscoll-Huber, C. (2008). “Cue-specific
effects of categorization training on the relative weighting of acoustic cues
to consonant voicing in English,” J. Acoust. Soc. Am. 124, 1234–1251.
Gao, J.-Y. (2015). “Interdependence between tones, segments and phonation
types in Shanghai Chinese: Acoustics, articulation, perception and
evolution,” Ph.D. dissertation Universit�e Sorbonne Nouvelle–Paris III,
Paris, France.
Gao, J.-Y., and Hall�e, P. (2013). “Duration as a secondary cue for perception
of voicing and tone in Shanghai Chinese,” in Proc. of Interspeech 14,
Lyon, France, pp. 3157–3162.
Gao, J.-Y., and Hall�e, P. (2015). “The role of voice quality in Shanghai tone
perception,” in Proc. of ICPhS 18, Glasgow, Scotland, UK, paper no. 448.
Gao, J.-Y., and Hall�e, P. (2016). “Sociolinguistic motivations in sound
change: On-going loss of low tone breathy voice in Shanghai Chinese,”
Papers Hist. Phonology 1, 166–186.
Gao, J.-Y., and Hall�e, P. (2017). “Phonetic and phonological properties of
tones in Shanghai Chinese,” Cahiers de Linguistique Asie Orientale 46,
1–31.
Garellek, M., and Keating, P. (2011). “The acoustic consequences of phona-
tion and tone interactions in Jalapa Mazatec,” J. Int. Phonetic Assoc. 41,
185–205.
Garellek, M., Keating, P., Esposito, C. M., and Kreiman, J. (2013). “Voice
quality and tone identification in White Hmong,” J. Acoust. Soc. Am. 133,
1078–1089.
Gordon, M., and Ladefoged, P. (2001). “Phonation types: A crosslinguistic
overview,” J. Phonetics 29, 383–406.
Halle, M., and Stevens, K. (1971). “A note on laryngeal features,” Q.
Progress Rep. Res. Lab. Electron. (MIT) 101, 198–213.
Hall�e, P., and Best, C. (2007). “Dental-to-velar perceptual assimilation: A
cross-linguistic study of the perception of dental stopþ/l/ clusters,”
J. Acoust. Soc. Am. 121, 2899–2914.
Hanson, H. M., Stevens, K. N., Kuo, H.-K. J., Chen, M. Y., and Slifka, J.
(2001). “Towards models of phonation,” J. Phonetics 29, 451–480.
Hillenbrand, J. M., Cleveland, R. A., and Erickson, R. L. (1994). “Acoustic
correlates of breathy vocal quality,” J. Speech Hear. Res. 37, 769–778.
Holmberg, E., Hillman, R., Perkell, J., Guiod, P., and Goldman, S. (1995).
“Comparisons among aerodynamic, electroglottographic, and acoustic
spectral measures of female voice,” J. Speech Hear. Res. 38, 1212–1223.
Holt, L. L., Lotto, A. J., and Kluender, K. R. (2001). “Influence of funda-
mental frequency on stop-consonant voicing perception: A case of learned
covariation or auditory enhancement,” J. Acoust. Soc. Am. 109, 764–774.
Huffman, M. K. (1987). “Measures of phonation in Hmong,” J. Acoust. Soc.
Am. 81, 495–504.
Jakobson, R., Fant, G., and Halle, M. (1952). Preliminaries to SpeechAnalysis (MIT Press, Cambridge, MA).
Jongman, A., Wayland, R., and Wong, S. (2000). “Acoustic characteristics
of English fricatives,” J. Acoust. Soc. Am. 108, 1252–1263.
Keyser, S. J., and K. N. Stevens (2006). “Enhancement and overlap in the
speech chain,” Language 82, 33–63.
Khan, S. D. (2012). “The phonetics of contrastive phonation in Gujarati,”
J. Phonetics 40, 780–795.
Kim, H., and Jongman, A. (1996). “Acoustic and perceptual evidence for
complete neutralization of manner of articulation in Korean,” J. Phonetics
24, 295–312.
Kingston, J. (1992). “The phonetics and phonology of perceptually moti-
vated articulatory covariation,” Lang. Speech 35, 99–113.
Kingston, J., Diehl, R. L., Kirk, C. J., and Castleman, W. A. (2008). “On the
internal perceptual structure of distinctive features: The [voice] contrast,”
J. Phonetics 36, 28–54.
Klatt, D. H., and Klatt, L. C. (1990). “Analysis, synthesis and perception of
voice quality variations among male and female talkers,” J. Acoust. Soc.
Am. 87, 820–856.
J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan 1307
Kuznetsova, A., Brockhoff, B., and Christensen, H. (2016). “Tests in linear
mixed effects models,” available at https://cran.r–project.org/web/packages/
lmerTest/index.html (Last viewed August 3, 2018).
Laver, J. (1980). The Phonetic Description of Voice Quality (Cambridge
University Press, Cambridge, UK).
Lemon, S. C., Roy, J., Clark, M. A., Friedmann, P. D., and Rakowski, W.
(2003). “Classification and Regression Tree analysis in public health:
Methodological review and comparison with logistic regression.” Ann.
Behav. Med. 36, 172–180.
Lisker, L. (1986). “‘Voicing’ in English: A catalogue of acoustic features
signalling /b/ versus /p/ in trochees,” Lang. Speech 29, 3–11.
Llanos, F., Dmitrieva, O., Shultz, A., and Francis, A. L. (2013). “Auditory
enhancement and second language experience in Spanish and English
weighting of secondary voicing cues,” J. Acoust. Soc. Am. 134,
2213–2224.
Massaro, D. W. (1987). “Psychophysics versus specialized processes in
speech perception: An alternative perspective,” in The Psychophysics ofSpeech Perception, edited by M. E. H. Schouten (Martinus Mijhoff,
Boston), pp. 46–65.
Massaro, D., and Cohen, M. (1983). “Phonological context in speech
perception,” Percept. Psychophys. 34, 338–348.
McMurray, B., Cole, J. S., and Munson, C. (2011). “Features as an emergent
product of computing perceptual cues relative to expectations,” in WhereDo Phonological Features Come From?: Cognitive, Physical andDevelopmental Bases of Distinctive Speech Categories, edited by G. N.
Clements, and R. Ridouane (John Benjamins, Amsterdam/Philadelphia),
pp. 197–235.
Mikuteit, S., and Reetz, H. (2007). “Caught in the ACT: The timing of aspi-
ration and voicing in Bengali,” Lang. Speech 50, 247–277.
Miller, A. L. (2007). “Guttural vowels and guttural co-articulation in
Juj’hoansi,” J. Phonetics 35, 56–84.
Mirman, D. (2014). Growth Curve Analysis and Visualization Using R(CRC Press, Boca Raton, FL).
Newman, R. S. (2003). “Using links between speech perception and speech
production to evaluate different acoustic metrics: A preliminary report,”
J. Acoust. Soc. Am. 113, 2850–2860.
Parker, E. M., Diehl, R. L., and Kluender, K. R. (1986). “Trading relations
in speech and nonspeech,” Percept. Psychophys. 39, 129–142.
Port, R., and Crawford, P. (1989). “Incomplete neutralization and pragmat-
ics in German,” J. Phonetics 17, 257–282.
R Core Team (2014). “R: A language and environment for statistical com-
puting (version 3.1.0),” (R Foundation for Statistical Computing, Vienna),
available at http://www.R-project.org/ (Last viewed October 10, 2017).
Raphael, L. J. (1972). “Preceding vowel duration as a cue to the perception
of the voicing characteristic of word-final consonants in American
English,” J. Acoust. Soc. Am. 51, 1296–1303.
Ren, N.-Q. (1992). “Phonation types and stop consonant distinctions:
Shanghai Chinese,” Ph.D. dissertation, University of Connecticut, Storrs.
Repp, B. H. (1983). “Trading relations among acoustic cues in speech per-
ception are largely a result of phonetic categorization,” Speech Commun.
2, 341–361.
Shen, Z.-W., and Wang, W. S. (1995). “Wuyu zhuoseyin de yanjiu—Tongji
shang de fenxi he lilun shang de kaol€u” (“A study of voiced stops in thje
Wu dialects—Statistical analysis and theoretical considerations”), in
Wuyu Yanjiu (Studies of the Wu Dialects), edited by E. Zee (New Asia
Books, Hong Kong), pp. 219–238.
Shue, Y.-L., Keating, P., Vicenik, C., and Yu, K. (2011). “VoiceSauce: A
Program for Voice Analysis,” available at http://www.ee.ucla.edu/~spapl/
voicesauce/ (Last viewed November 1, 2015).
Shultz, A. A., Francis, A. L., and Llanos, F. (2012). “Differential cue
weighting in perception and production of consonant voicing,” J. Acoust.
Soc. Am. 132, EL95–EL101.
Sj€olander, K. (2004). “The Snack Sound Toolkit,” available at http://
www.speech.kth.se/snack/ (Last viewed March 2, 2018).
Steriade, D. (1997). “Phonetics in phonology: The case of laryngeal neu-
tralization,” UCLA Work. Pap. Phonetics 3, 25–146.
Steriade, D. (2008). “The phonology of perceptibility effects: The P-map
and its consequences for constraint organization,” in The Nature of theWord: Essays in Honor of Paul Kiparsky, edited by K. Hanson, and S.
Inkelas (MIT Press, Cambridge, MA), pp. 151–180.
Stevens, K. N. (1977). “Physics of laryngeal behavior and larynx modes,”
Phonetica 34, 264–279.
Stevens, K. N. (2002). “Toward a model for lexical access based on acoustic
landmarks and distinctive features,” J. Acoust. Soc. Am. 111, 1872–1891.
Stevens, K. N., and Keyser, S. J. (2010). “Quantal theory, enhancement, and
overlap,” J. Phonetics 38, 10–19.
Toscano, J. C., and McMurray, B. (2010). “Cue integration with categories:
Weighting acoustic cues in speech using unsupervised learning and distri-
butional statistics,” Cogn. Sci. 34, 434–464.
Traill, A., and Jackson, M. (1988). “Speaker variation and phonation type in
Tsonga nasals,” J. Phonetics 16, 385–400.
Venables, W. N., and Ripley, B. D. (2002). Modern Applied Statistics withS, 4th ed. (Springer, New York).
Wang, Y.-Z. (2011). “Acoustic measurements and perceptual studies on ini-
tial stops in Wu dialects—Take Shanghainese for example,” Ph.D. disser-
tation, Zhejing University, China.
Warner, N., Jongman, A., Sereno, J., and Kemper, R. (2004). “Incomplete
neutralization of sub-phonemic durational differences in production and
perception of Dutch,” J. Phonetics 32, 251–276.
Wayland, R., and Jongman, A. (2003). “Acoustic correlates of breathy and
clear vowels: The case of Khmer,” J. Phonetics 31, 181–201.
Weihs, C., Ligges, U., Luebke, K., and Raabe, N. (2005). “klaR analyzing
German business cycles,” in Data Analysis and Decision Support, edited by
D. Baier, R. Decker, and L. Schmidt-Thieme (Springer, Berlin), pp. 335–343.
Xu, B.-H., and Tang, Z.-Z. (1988). Shanghai Shiqu Fangyan Zhi (ADescription of the Urban Shanghai Dialect) (Shanghai Educational Press,
Shanghai).
Xu, Y. (2005–2013). “ProsodyPro.praat,” available at http://www.phon.ucl.
ac.uk/home/yi/ProsodyPro/ (Last viewed November 1, 2015).
Zee, E., and Maddieson, I. (1980). “Tones and tone sandhi in Shanghai:
Phonetic evidence and phonological analysis,” Glossa 14, 45–88.
Zhu, X.-N. (1999). Shanghai Tonetics (Lincom Europa, M€unchen).
Zhu, X.-N. (2006). A Grammar of Shanghai Wu (Lincom Europa, M€unchen).
1308 J. Acoust. Soc. Am. 144 (3), September 2018 Jie Zhang and Hanbo Yan