MUSICAL EVIDENCE FOR SYLLABIFICATION OF
HIGHLY MORAIC STRUCTURES IN ENGLISH
by
Jenica Jessen
A Senior Honors Thesis Submitted to the Faculty of The University of Utah
In Partial Fulfillment of the Requirements for the
Honors Degree in Bachelor of Arts
In
Linguistics
Approved:
______________________________ Abby Kaplan Thesis Faculty Supervisor
_____________________________ Patricia Hanna Acting Chair, Department of Linguistics
_______________________________ Aaron Kaplan Honors Faculty Advisor
_____________________________ Sylvia D. Torti Dean, Honors College
April 2017 Copyright © 2017
All Rights Reserved
ii
ABSTRACT
This study uses musical data as evidence for syllabication patterns for native
English speakers. Our research seeks evidence from musical pitches in songs by
American singer-songwriters that syllables with a diphthong and a liquid in their rime
undergo bi-syllabification at a rate contrastive to other syllables. The study concludes that
variations exist between individuals, some of whom have a contrast between extremely
heavy syllables and others of whom do not. Furthermore, the study addresses the
influence of part of speech on vowel production and thus syllabification, concluding that
certain diphthongs are reduced to monophthongs within function words but not content
words.
iii
TABLE OF CONTENTS
ABSTRACT ii
INTRODUCTION 1
LITERATURE REVIEW 2
METHODS 4
ANALYSIS 10
CONCLUSIONS 19
REFERENCES 20
INTRODUCTION
In most cases, both linguists and speakers tend to demonstrate strong instincts
about how words should be syllabified—but not always. Words with a diphthong
followed by a liquid present a unique challenge. While many speakers judge words such
as “fire” or “smile” as having one syllable, many others judge them to have two, and even
more speakers have trouble making judgements in the first place. Additionally, not all
speakers produce the same number of syllables as they might judge an utterance to
contain. Further complicating the matter is the difficulty of determining articulatory
properties of syllables—there is strong evidence that they are a cognitive unit of
organization, but little indication that they can be identified in physical phenomena.
One exception, however, is in the production of music. Generally speaking, song
closely mimics the rhythms of speech, meaning that syllable count may be reflected in
musical patterns. Thus, in order to determine how many syllables exist in a word like
“fire”, our study investigates how many pitches are used to sing it when it occurs in
music created by American singer-songwriters. We present evidence that this
methodology can be used to accurately determine syllable counts.
Our analysis indicates that there is a strong variation from individual to
individual; while some people consistently produce “fire” with one syllable, others
consistently produce it with two. We also investigate the influence of part of speech on
the production of these words, concluding that in some cases, vowel pronunciation and
thus syllabification may be affected by a lexical item’s status as a function word (such as
a conjunction or pronoun) or a content word (such as a noun or verb).
1
LITERATURE REVIEW
English syllables consisting of diphthongs followed by liquids have been analyzed
in a variety of different ways. Lavoie and Cohn (1999) interpret these structures as
“sesquisyllables”—structures containing three moras. The diphthong contains two moras
(one for each vowel) and the liquid contains a third. They further argue that three-mora
syllables are dispreferred in English, making this configuration is inherently unstable.
Thus these words are represented as either two separate syllables or one “superheavy”
syllable, depending on the situation.
Further research from Cohn and Tilsen (2015) indicates that speakers who tend to
produce these words with two syllables also perceive them as having two syllables, while
those who produce them with only one perceive them with only one. This suggests that
the variations in pronunciation are structurally conditioned; that is, they are due to the
underlying representation of the word. For some people, the underlying representation of
a word like “fire” is one syllable, while for other people it is two.
Our research attempted to gather more information on this phenomenon, through
analysis of song lyrics, text setting, and musical pitches. Previous study done by Palmer
and Kelly (1992) indicates that stress and prosodic structure of spoken words are tightly
correlated with rhythm and meter when the text is set to music, a conclusion supported by
the work of Dell and Halle (2005), Rodriguez-Vasqeuz (2010), and Sui (2013). Since
syllables are the basic unit that carry stress in speech and notes are the basic unit that
2
carry meter in music, a general assumption for text setting in many languages is that each
note carries one syllable.
Since by default one note equals one syllable, we might expect that a word like
“fire” will be sung across two notes if the speaker represents it as two notes, and it will be
sung with only one note if the speaker represents it with a single syllable. However,
actual text setting is not quite so simple. As described in Schellenberg 2016, English does
not always obey the “one note per syllable” principle. While many languages (such as
German or Russian) strongly prefer to limit each syllable to a single note, English allows
melisma, or the spreading of a single syllable across multiple notes. Thus a word like
“all” might be drawn out across several pitches, even if it is underlyingly represented by
only a single syllable.
Another important phenomenon to take into account is that the natural rhythm of
speech may be distorted for artistic purposes when set to music. Schellenberg 2012 uses
evidence from Chinese to argue that language does not always determine musical
behavior. If a songwriter wishes to violate a property of the chosen words—for example,
by mapping a word containing a high tone onto a note with a low pitch—they are free to
do so. This suggests that when the text and the music come in conflict, the music will
often win, further complicating the problem of using musical works to study linguistic
properties.
Although these issues make it difficult to draw simple conclusions, they can be
mitigated, allowing for evidence to be extracted from musical data. One important
strategy is to gather large amounts of data in order to reduce the potential influence of
artistic license—although a singer with a one-syllable underlying representation of a
3
certain word may choose to draw that word out across multiple notes, it is highly unlikely
that such behavior will be repeated across every song created by the singer. Additionally,
statistical analysis must take into account the presence of repetitive words; a line that is
repeated over and over in the chorus may skew the data if the words it contains are
counted as an independent token each time. Finally, obvious distortions for artistic effect
should be ignored. For example, if the word “all” is spread out across ten different
pitches, this is almost certainly evidence for creative license rather than a ten-syllable
underlying representation.
METHODS
Data for this project was gathered from the works of 12 American singer-
songwriters, defined as people who compose the music for, write the lyrics of, and
perform their own songs. This group was chosen in order to ensure that the target words
would be the product of one person’s intuitions; we wanted to eliminate the possibility of
another person’s underlying representation influencing the outcome.
4
The chosen artists—listed in Figure 1—had a variety of backgrounds, with
birthdates ranging from 1941 to 1980, birthplaces located in ten different states, and work
produced in several different genres of music. However, the chosen artists were not
highly diverse, since the set of eligible artists with sufficiently sized bodies of work was
somewhat limited; all but two were male and all but one were white.
Figure 1: Artists Researched
Artist Name Birth Year Location Gender Race Main Genres
Bob Dylan 1941 Minnesota Male White Folk, Blues, Country
James Taylor 1948 North Carolina
Male White Folk Rock, Country
Bruce Springsteen
1949 New Jersey Male White Rock
Billy Joel 1949 New York Male White Soft Rock, Pop
Stevie Wonder 1950 Michigan Male African-American
Soul, Jazz, Funk, R&B
Suzanne Vega 1959 California Female White Alternative Rock, Folk
Ben Folds 1966 North Carolina
Male White Alternative Rock, Pop
Beck Hansen 1970 California Male White Alternative Rock
John Mayer 1977 Connecticut Male White Pop, Rock, Blues, Folk
Ryan Tedder 1979 Oklahoma Male White Alternative Rock, Pop
Ingrid Michaelson
1979 New York Female White Folk, Indie
Conor Oberst 1980 Nebraska Male White Folk, Indie, Pop
5
The data gathered consisted of words that occurred within the songs of these
artists. Although the total syllable count of the chosen word was not taken into account,
each target word had primary stress on the final syllable, with one of the following rimes:
• [aiɹ] (as in “fire”)
• [ail] (as in “file”)
• [ain] (as in “fine”)
• [aim] (as in “time”)
• [aɹ] (as in “far”)
• [iɹ] (as in “fear”)
• [al] (as in “fall”)
• [il] (as in “feel”)
The rimes [aiɹ] and [ail] were the primary targets of study, since they contained the
diphthong [ai] followed by a liquid. [ain] and [aim] were included as a control group
since nasals are also highly sonorant consonants, and [aɹ], [iɹ], [al], [il] were also
included as a control group because they each contain a component vowel of the target
diphthong and one of the target liquids. Despite the presence of sonorous consonants and
constituent vowels, however, we expect all words in the control group to be pronounced
with one syllable. (Although the researchers were all in agreement on this point, Lavoie
and Cohn (1999) argued that words such as “feel”, “fear”, and “fine” were also trimoraic
and thus should be as difficult to syllabify as “fire”. Our intuitions strongly contradict this
analysis. Additionally, we will see that the results suggest [aiɹ] may have multiple
syllables for some speakers, while [il], [iɹ], and [ain] do not demonstrate this effect.)
The lyrics of each song were searched for target words, and then the four
members of the research group listened to each word within the song in order to
determine how many pitches were sung by the artist. Each researcher coded the entire
works of three artists, as well as a randomly selected subset of the other nine artists. The
result was that each token was coded by two researchers (the one who specialized in the
6
artist and a randomly selected second researcher) in order to ensure accuracy. If their
judgements didn’t match, the entire research group listened to and discussed the token in
order to reach a consensus. Of our 6,498 total tokens, 1,373 tokens (or 21%) required
discussion.
Polymorphemic words which ended in the target rime (such as “higher” or “I’ll”)
or any of the control rimes (such as “we’re” or “we’ll”) were analyzed independently and
placed in a separate dataset, in order to control for the potential influence of the
morpheme boundary. In order to investigate vowel distortion and part of speech, tokens
of the word “while” were also coded for pronunciation and lexical category. Additionally,
tokens of “I’ll” from the polymorphemic dataset also were coded for pronunciation. Due
to time constraints, the tokens in this dataset were only coded by a single researcher (the
author).
When the data was analyzed, all tokens with more than four pitches were top-
coded—that is, coded as having four pitches. This was done for a number of reasons.
First of all, we assumed that tokens with many pitches were obvious products of artistic
license rather than bizarre underlying representations. Since these tokens would be
reflecting artistic choices rather than linguistic intuitions, we reasoned that there was
ultimately no meaningful difference between 6 pitches and 7, 7 pitches and 8, or even
higher numbers. We also realized that it was difficult for the research team to judge these
tokens accurately; reaching a perfect consensus on the number of notes contained in a
highly melismatic token could be quite challenging. Out of our dataset, 108 tokens were
coded as having four or more pitches, or 1.7%.
7
In order to handle repetitions, we considered a variety of options. One possibility
was to include all repetitions of a certain token, treating each instance as an independent
production with equal validity to all other productions. This approach, however, would
severely skew the dataset if each token was produced with an unusual effect (such as
shortening or melisma); an example of the problem is Bruce Springsteen’s song “Streets
of Fire”, where the word “fire” is repeated eleven times in the chorus with approximately
three to four pitches for each token. Another option was to keep only the first instance of
a token, treating it as the original and assuming all others were merely imitations.
However, we noticed that in some cases a pitch change would appear or disappear for a
certain repetition over the course of a song, making it difficult to treat each token as a
perfect copy of the previous one. (In the “Streets of Fire” example, one instance of the
chorus had “fire” produced with only two pitches, while another had it produced with
ten.) Another option was to take the average number of pitches for all appearances of a
token within a song, which could possibly correct for both issues; in some cases,
however, that would lead to a token being represented by a fraction of a pitch, and could
lead to a single strangely produced token skewing the entire set.
Our eventual choice was to take a random token from each set of repetitions and
to discard the others. While this did not take into account the possibility for a token’s
production to shift over the course of a song, it did prevent the possibility of skewed data
due to averaging. Additionally, an analysis with randomly selected tokens did not greatly
differ from a preliminary analysis with averages across a set of repetitions, leading us to
conclude that either approach would be acceptable. Thus we chose a random token from
each set of repetitions and eliminated the others from the analysis.
8
ANALYSIS
Syllabification of [aiL]
Our analysis consistently supported the hypothesis that some people clearly
pronounce highly moraic words such as “fire” with one syllable, while others clearly
pronounce them with two. The graphs in Figure 1 outline the number of syllables for each
individual singer, for target syllables [aiɹ] and [ail], vowel-liquid combinations (VL), and
[ai] combined with nasals. Impressionistically, [aiɹ] seems to be behaving differently than
the other rimes. Some singers show a clear distinction; the average number of pitches
used to sing [aiɹ] by Stevie Wonder is approximately two, while words in the control
group use slightly more than one. (Many artists have averages slightly above one because
of melisma.) John Mayer sings an average of three pitches per [aiɹ] rime and slightly over
one pitch for control words, suggesting that his underlying representation has more
syllables for [aiɹ] than for related words. However, some singers (such as James Taylor)
use only a single pitch for every word studied. Others (such as Bob Dylan and Ben Folds)
have less clear patterns.
Additionally, these graphs do not take into account the potential for melisma or
higher than expected numbers for the control group—Ryan Tedder seems to sing every
word with an average of one and a half pitches. They also do not address reliability—for
example, Stevie Wonder has 53 tokens of [aiɹ] while Billy Joel only has 15, suggesting
that we can be far more confident drawing conclusion from this graph for Stevie Wonder
than we can for Billy Joel.
9
Figure 1: Average Number of Pitches Per Rime
10
Further analysis corrects for these influences by fitting a separate linear regression
model for each artist, which predicts the number of pitches in each token from fixed
effects of rime type and year of composition, as well as a random effect to control for
stylistic variation among songs. Figure 2 shows the relative baseline for each artist, set by
the number of syllables used for the [VL] and [aiN] control words. For instance, Billy
Joel (WMJ) consistently sings one syllable more for the target [aiɹ] words then he does
for the control words, while James Taylor (JVT) sings exactly the same number of
syllables for target words as he does for control words. Impressionistically, it appears that
seven artists have a clear distinction between [aiɹ] words and control words (arranged to
the left of the dotted line). Five artists did not appear to have a strong distinction
(arranged to the right of the dotted line). Interestingly, only one artist, Beck Hansen,
showed a similar distinction for [ail] words, suggesting that the two liquids are treated
differently. (This might be due to [ɹ] being more sonorous than [l].)
Figure 2: Baseline Analysis
*Ingrid Michaelson did not have a statistically significant number of [ail] tokens for this analysis.
11
The next graph calculates the odds ratio, with a logistic regression model for each
artist, which predicts whether a rime will be sung with a single pitch or multiple pitches.
This is because it is not fully clear whether it matters that an artist chooses to use four
pitches rather than three, or three pitches rather than two, when they decide to assign
more than one pitch to a syllable. Despite the use of a binary variable, this graph bears
several similarities to the previous one, strengthening the evidence for the analysis that
some singers have a clear distinction between [aiɹ] and other rimes.
Figure 3: Odds Ratio Correction
This analysis strongly suggests that seven artists (Ben Folds, Billy Joel, John
Mayer, Beck Hansen, Ryan Tedder, Suzanne Vega, and Ingrid Michaelson) have an
underlying representation of 2 syllables for [aiɹ] rimes, while the other five (Conor
Oberst, Stevie Wonder, Bruce Springsteen, James Taylor, and Bob Dylan) have an
underlying representation of 1 syllable.
12
Although the study only had twelve subjects, making it difficult to draw broad
conclusions about population-wide patterns, there are some impressionistic trends that
might be fruitful for future research. For example, age might make a difference in
whether these words are realized with one syllable or two; as visible in Figure 4, all seven
artists with a contrast between [aiɹ] and other words were born in 1949 or later, and four
of the five artists without a contrast for [aiɹ] were born 1950 or earlier. It is possible that
language changes over time have caused a more widespread realization of “fire” as two
syllables. Additionally, there may be a dialectal component—the three artists from the
Midwest (Bob Dylan, Stevie Wonder, and Conor Oberst, born in Minnesota, Michigan,
and Nebraska respectively) did not have a contrast for [aiɹ], while both artists from
California (Suzanne Vega and Beck Hansen) and three artists from the Northeast (Ingrid
Michaelson, John Mayer, and Billy Joel, from New York, Connecticut, and New York)
did contrast [aiɹ] with the control group. (A possible counterexample, however, is Bruce
Springsteen. Born in New Jersey in 1949, he does not display a contrast.)
13
Figure 4: Artists, Birth Years, Birthplace, and [aiɹ] Contrast
Artist Name Birth Year Location [aiɹ] Contrast Bob Dylan 1941 Duluth, MN None
James Taylor 1948 Chapel Hill, NC None Bruce Springsteen 1949 Long Branch, NJ None
Billy Joel 1949 Hicksville, NY [aiɹ] contrast Stevie Wonder 1950 Saginaw, MI None Suzanne Vega 1959 Santa Monica, CA [aiɹ] contrast
Ben Folds 1966 Winston-Salem, NC [aiɹ] contrast Beck Hansen 1970 Los Angeles, CA [aiɹ] contrast John Mayer 1977 Bridgeport, CT [aiɹ] contrast Ryan Tedder 1979 Tulsa, OK [aiɹ] contrast
Ingrid Michaelson 1979 New York City, NY [aiɹ] contrast Conor Oberst 1980 Omaha, NE None
“While”, Contractions, and Parts of Speech
Further investigation was conducted into the target word “while”. Researchers’
intuitions suggested that “while” varied in pronunciation based on part of speech; for
example, it might be realized as [wail] in a sentence such as “I haven’t seen you in a
while” but as [wæl] in a sentence such as “I saw you while I was at the store”. It was
hypothesized that the diphthong [ai] was used for uses of “while” as a noun, but this was
relaxed to [æ] for uses of “while” as a conjunction.
Each token of “while” in the dataset was coded for pronunciation and part of
speech, on top of the data previously gathered. Figure 5 indicates the number of tokens
retrieved for each pronunciation and part of speech. (A small handful of tokens that used
other pronunciations, such as [wɑl] or [wɛl], were excluded.) This table shows that
14
instances of “while” as a noun were overwhelmingly pronounced as “[wail]” (87% of the
time), while instances of “while” as a conjunction were overwhelmingly pronounced as
“[wæl]” (97% of the time).
Figure 5: Tokens of “While” by Pronunciation and Part of Speech
Noun Conjunction [wail] 57 8 [wæl] 2 66
Furthermore, the data supported the hypothesis that noun version of “while” was
produced with the same number of pitches that a speaker used to produce target words
like “file”, and the conjunction version of “while” was produced with the same number of
pitches that a speaker used to produce control words like “fall”. Figure 6 shows a selected
number of artists with their productions for all [ail] rimes, [ail] pronunciations of “while”,
all [Vl] rimes, and [æl] pronunciations of “while”. (Not all artists are represented here,
since not all of them produced significant numbers of “while” tokens; Ryan Tedder had
none at all. This table includes all artists who produced ten or more tokens of “while”.)
15
Figure 6: Average Number of Syllables for Various Rime Types
16
In order to further investigate the phenomenon, a few artists were also analyzed
for their use of polymorphemic words with the target codas. The vast majority of these
were contractions such as “I’ll”, “I’m”, or “we’ll”, with a handful of suffixed words such
as “liar” or “higher”. The procedures followed to code these were the same as were
followed for the rest of the dataset, although due to time constraints each token was only
coded a single time by one researcher. It is thus somewhat more difficult to draw firm
conclusions from these than it is for the rest of the dataset.
Figure 7: Contractions and Polymorphemic Words Artist Average number of pitches Percentage of targets sung
with more than one pitch Ben Folds (BSF) 1.03 1.8% Bruce Springsteen (BJS) 1.16 19.4% Ryan Tedder (RTD) 1.05 5.2% Suzanne Vega (SNV) 1.02 2.4% Billy Joel (WMJ) 1.08 9.2%
Interestingly, almost all of these tokens were sung with only a single pitch,
despite the morpheme boundary. This was even true for tokens of “I’ll” in cases where
the artist has a two-syllable representation for [aiɹ] words (as did Ben Folds, Billy Joel,
Ryan Tedder, and Suzanne Vega). Analysis of the pronunciation of “I’ll” revealed that
the vast majority of the time, it was pronounced as [æl] rather than as [ail], as shown in
Figure 8.
17
Figure 8: Number of “I’ll” Tokens by Pronunciation Artist [æl] [ail] WMJ 12 5 BJS 27 1 RTD 12 2 BF 14 3 SNV 36 1
This provides further support for the hypothesis that speakers are likely to reduce
complex rimes like [ail] when they occur in function words, such as conjunctions and
pronouns, but not when they occur in content words such as nouns and verbs. The fact
that English has almost no instances of function words that end in the rime [aiɹ] (with the
possible and highly arguable exception of “why’re”) might also suggest that this pattern
is dispreferred, particularly if it’s true that [ɹ] behaves differently (and is more sonorous)
than [l]. It’s possible that [aiɹ] function words are unacceptable, and [ail] is pushing the
boundaries of acceptability, leading to it being reduced by the speaker into a
monophthong represented by only a single syllable.
18
CONCLUSIONS
Although speaker judgements for words like “fire” and “smile” can be unreliable,
music can be used to investigate individual intuitions. While this type of analysis is
difficult to use for the investigation of broad patterns, since it can only be used with a
small set of subjects, it can provide strong evidence for the syllable judgements of certain
individuals.
We conclude that some speakers have a clear distinction between [aiɹ] rimes and
other syllables, while some speakers do not. For example, we determined that artists like
Billy Joel and John Mayer exhibit such a contrast, while artists like James Taylor and
Bob Dylan were clearly lacking one.
An open question is what factors are related to this contrast. While our data
suggests that birth year might be correlated with this contrast, firm conclusions are
impossible from such a small dataset. Other possible factors include dialectal or regional
differences. Further research is necessary to determine what factors might predict
whether a certain person has this contrast or not.
We also have determined that part of speech is strongly correlated with the
pronunciation of certain rimes; diphthongs in complex rimes are often reduced to
monophthongs when they occur in function words, but are preserved when they occur in
content words. This behavior appears to be widespread across speakers. Further research
in this area might include an investigation of other diphthongs and codas to determine
whether (and in what circumstances) they are reduced.
19
REFERENCES
Cohn, A. C. (2003). Phonological Structure and Phonetic Duration: The Role of the
Mora. Working Papers of the Cornell Phonetics Laboratory, v. 15, 69-100.
Cohn, A. C., & Tilsen, S. (2015). Relation between syllable count judgments and
durations of English liquid rimes. Cornell University.
Dell, F., & Halle, J. (2005). Comparing Musical Textsetting in French and in English
Songs. Typology of Poetic Forms. Paris.
Lavoie, L. M., & Cohn, A. C. (1999). Sesquisyllables of English: The Structure of
Vowel-Liquid Syllables. International Congress of Phonetic Sciences, (p.
University of California). San Francisco.
Palmer, C., & Kelly, M. H. (1992). Linguistic Prosody and Musical Meter in Song.
Journal of Memory and Language 31, 525-542.
Rodriguez-Vasquez, R. (2010). Text-setting Constraints: A Comparative Perspective.
Australian Journal of Linguistics Vol. 30, No. 1, 19-34.
Schellenberg, M. (2012). Does Language Determine Music in Tone Languages? .
Ethnomusicology, Vol. 56, No. 2.
Schellenberg, M. (2016). Influence of Syllable Structure on Musical Text Setting. Poster
session presented at the 15th Conference of Laboratory Phonology, Cornell
University
Sui, Y. (2013). Phonological And Phonetic Evidence For Trochaic Metrical Structure In
Standard Chinese (Dissertation). University of Pennsylvania.
20
Name of Candidate: Jenica Jessen
Birth date: December 5th, 1994
Birth place: Burley, Idaho
Address: 2108 W. Marblewood Dr. Riverton, UT, 84065