computational analysis of melodic contour and body movement

Tejaswinee Kelkar

Computational Analysis of MelodicContour and Body Movement

Thesis submitted for the degree of Philosophiae Doctor

Department of MusicologyFaculty of Humanities

RITMO Center for Interdisciplinary Studies in Rhythm Time andMotion

2019

© Tejaswinee Kelkar, 2019

Faculty of Humanities, University of Oslo

All rights reserved. No part of this publication may bereproduced or transmitted, in any form or by any means, without permission.

Cover: Hanne Baadsgaard Utigard.Print production: 07-Media Oslo.

to ajji

Acknowledgements

Writing this thesis has been an unbelievable challenge, and a great source ofenjoyment for me. I would like to take this opportunity to thank everyonewho enabled me to spend three years thinking about melodic contour and theirrelation to the body - this act itself is an immense privilege.

Thank you Alexander for letting me do this project, and always showing thelight in all of the different projects that I was able to get involved in. Readingyour work itself is always revealing, clarifying, and inspires a lot of traction andhope for me to engage with music in deeper, and more meaningful ways.

Thank you so much to Rolf Inge for motivating this work philosophically,and for teaching me the ways to approach the cores of research problems. Everyconversation with you has always been extremely motivating and calming, andfull of parables and things to remember for life.

Thank you very much Anne for being the director of the center of dreams,and such an inspiration. Thank you also for all your kindness. Thank you Peter,Målfrid, and Ancha for all your positivity and making all of this work possible.Thank you Stan for your encouragement. Thank you Bipin sir for teachingme some valuable lessons early on during my masters that have guided methroughout. Thank you Bruno for the encouragement throughout the projects.

I would like to thank all the participants of these experiments for theircontributions, insights, and thoughtful reflections on the experiments. Last butnot least, I would like to thank each and every one of my colleagues both at IMVand RITMO for building a research environment where we thrive on discussion,learning from each other, and leaning on each other.

Thank you to my parents. It is comical to thank you because I owe everythingto the way you brought me up. Thank you especially Aai for talking to me everysingle day and always being there. Nachi and Kadambari, I get motivated foranything at all thinking about you. Thank you Udit for being such a pillar ofstrength and support. Rajvi my love you are always here and to Malathi forbeing the biggest remote support. Thank you very much to Chitralekha for theexcellent work with copy-editing, and Dayita for the detailed feedback.

Ragnhild, my darling, for your presence in my life. I don’t remember younot being there in it at all. Victoria for teaching me a way to live here, to speak,I am really thankful for being able to share so much with you! Tore, for yourinsight, your nuanced and calm way of being a friend. Aine and Derek, thankyou for adopting me and always being there. Mari, my first ever friend here,and a mentor for so much else. Sanskriti, thank you for all of your music, andyour friendship, it has been invaluable to me and will always be. Ulf, for yourhumor and presence and hjørnekontoret and chicken. Ingrid, thanks for beingthe stjerna. Thank you Kayla for babysitting me this year, and to Lucas. Thank

iii

Acknowledgements

you very much to Charles, Victor, and Olivier for guidance and help. Thanks tomy compatriots co-warriors of thesising: Emil, Gui, Stephane, Bjørnar, Kjell-Andreas, Marek, Benedikte, Agata, and Merve. Ingeborg, for your friendship,and my Marius. Thanks to dear friends from the music / kunst miljø Andreas,Petrine, Karoline, Kjetil, Åsmund, and Martin. I would like to thank Deepakdada, Wadegaokar guruji, Manas Vishwaroop, Kamod Arbedwar, Achal Yadav,Suyash Medh, Kelcey Gavar, and James Bunch, I will always owe you a lot.

Tejaswinee KelkarOslo, November 2019

iv

List of Papers

Paper I

Kelkar, T., & Jensenius, A. R. (2017). Exploring melody and motion featuresin “sound-tracings”. In Proceedings of the 14th Sound and Music ComputingConference (pp. 98-103). Aalto University.

Paper II

Kelkar, T., & Jensenius, A. R. (2017, June). Representation strategies intwo-handed melodic sound-tracing. In Proceedings of the 4th InternationalConference on Movement Computing (p. 11). ACM.

Paper III

Kelkar, T., & Jensenius, A. (2018). Analyzing free-hand sound-tracings ofmelodic phrases. Applied Sciences, 8(1), 135.

Paper IV

Kelkar, T., Roy, U., & Jensenius, A. R. (2018). Evaluating a collection of Sound-Tracing Data of Melodic Phrases. In Proceedings of the 19th InternationalSociety for Music Information Retrieval Conference, Paris, France. (pp. 74-81)

v

Contents

Acknowledgements iii

List of Papers v

Contents vii

List of Figures xi

List of Tables xiii

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Research Objective and Research Questions . . . . . . . . 31.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.1 Open Science . . . . . . . . . . . . . . . . . . . . 71.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Melody 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Speech Melody . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Prosody . . . . . . . . . . . . . . . . . . . . . . . 112.2.2 Intonation . . . . . . . . . . . . . . . . . . . . . . 112.2.3 Model for a Speech–Song Spectrum . . . . . . . . 12

2.3 Musical Melody . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 Pitch . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.2 Organization . . . . . . . . . . . . . . . . . . . . 162.3.3 Recognizability . . . . . . . . . . . . . . . . . . . 192.3.4 Succession . . . . . . . . . . . . . . . . . . . . . . 232.3.5 Shape . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Other Aspects of Melody . . . . . . . . . . . . . . . . . . . 262.4.1 Voice and Melody . . . . . . . . . . . . . . . . . . 26

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Auditory and Motor Imagery 313.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.1 Terminology . . . . . . . . . . . . . . . . . . . . . 313.2 Auditory Imagery . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.1 Auditory Scene Analysis . . . . . . . . . . . . . . 32

vii

Contents

3.2.2 Ecological Pitch Perception . . . . . . . . . . . . 333.2.3 Conceptual Shapes and Music Theory . . . . . . 33

3.3 Gestalt and Shape . . . . . . . . . . . . . . . . . . . . . . . 343.3.1 Melodic Gestalt . . . . . . . . . . . . . . . . . . . 343.3.2 Contour Continuity and Gestalt . . . . . . . . . . 35

3.4 Motor Imagery . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.1 Phenomenology of Shape . . . . . . . . . . . . . . 373.4.2 Neural Correlates of Phenomenology of Shape . . 383.4.3 Lines and Movement . . . . . . . . . . . . . . . . 38

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Body Movement 414.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Sound-Tracing . . . . . . . . . . . . . . . . . . . . . . . . . 414.3 Embodied Music Cognition . . . . . . . . . . . . . . . . . . 43

4.3.1 Early Research on Music Embodiment . . . . . . 434.3.2 Recent Research on Music Embodiment . . . . . 444.3.3 Auditory Working Memory . . . . . . . . . . . . 454.3.4 Multimodality . . . . . . . . . . . . . . . . . . . . 464.3.5 Ecological Psychology . . . . . . . . . . . . . . . 464.3.6 Conceptual Metaphors . . . . . . . . . . . . . . . 47

4.4 Music Related Movement . . . . . . . . . . . . . . . . . . . 484.4.1 Gestural Imagery . . . . . . . . . . . . . . . . . . 484.4.2 Gesture and Musical Gestures . . . . . . . . . . . 50

4.5 Action–Sound Perception . . . . . . . . . . . . . . . . . . . 524.5.1 Ideomotor Theory . . . . . . . . . . . . . . . . . 534.5.2 Motor Control . . . . . . . . . . . . . . . . . . . 54

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Data Sets and Experiments 575.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2 Stimulus Set . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2.1 Descriptions of the Musical Styles . . . . . . . . . 585.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3.1 Experiment 1: Melodic Tracings . . . . . . . . . 605.3.2 Experiment 2: Melodic Tracings, Tracing

Repetitions, Singing Melodies . . . . . . . . . . . 615.4 Data Sets Used . . . . . . . . . . . . . . . . . . . . . . . . 625.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6 Methods 656.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 656.2 Sound Analysis . . . . . . . . . . . . . . . . . . . . . . . . 65

6.2.1 Computational Representation and Features ofMelody . . . . . . . . . . . . . . . . . . . . . . . 65

6.2.2 Symbolic Approaches . . . . . . . . . . . . . . . . 67

viii

Contents

6.2.3 Music Information Retrieval . . . . . . . . . . . . 696.2.4 Extraction of Pitch Contours and Contour Analysis 706.2.5 Pitch Detection Algorithms . . . . . . . . . . . . 716.2.6 Contour Analysis Methods . . . . . . . . . . . . . 72

6.3 Motion Analysis . . . . . . . . . . . . . . . . . . . . . . . . 736.3.1 Tracking Data through Infrared Motion Capture

Systems . . . . . . . . . . . . . . . . . . . . . . . 746.3.2 Details for the Experiments in the thesis . . . . . 756.3.3 Post Processing . . . . . . . . . . . . . . . . . . . 756.3.4 File Formats for Motion Capture Files . . . . . . 766.3.5 Feature Extraction . . . . . . . . . . . . . . . . . 786.3.6 Toolboxes for MoCap Data . . . . . . . . . . . . 79

6.4 Motion-Sound Analysis . . . . . . . . . . . . . . . . . . . . 806.4.1 Visual Inspection and Visualization . . . . . . . . 816.4.2 Statistical Testing . . . . . . . . . . . . . . . . . 816.4.3 Machine Learning and Artificial Intelligence . . . 826.4.4 Time Series Distance Metrics . . . . . . . . . . . 846.4.5 Canonical Correlation Analysis (CCA) . . . . . . 856.4.6 Deep CCA . . . . . . . . . . . . . . . . . . . . . . 876.4.7 Template Matching . . . . . . . . . . . . . . . . . 87

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7 Conclusions 897.1 Research Summary . . . . . . . . . . . . . . . . . . . . . . 89

7.1.1 Paper I . . . . . . . . . . . . . . . . . . . . . . . 897.1.2 Paper II . . . . . . . . . . . . . . . . . . . . . . . 907.1.3 Paper III . . . . . . . . . . . . . . . . . . . . . . 917.1.4 Paper IV . . . . . . . . . . . . . . . . . . . . . . 92

7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 927.3 General Discussion . . . . . . . . . . . . . . . . . . . . . . 95

7.3.1 Verticality . . . . . . . . . . . . . . . . . . . . . . 957.3.2 Imagery . . . . . . . . . . . . . . . . . . . . . . . 967.3.3 Voice . . . . . . . . . . . . . . . . . . . . . . . . . 977.3.4 Body . . . . . . . . . . . . . . . . . . . . . . . . . 987.3.5 Cultural Considerations . . . . . . . . . . . . . . 99

7.4 Impact and Future Work . . . . . . . . . . . . . . . . . . . 1007.4.1 Research in Melody and Prosody Perception . . . 1007.4.2 Search and Retrieval Systems . . . . . . . . . . . 1007.4.3 Interactive Music Generation . . . . . . . . . . . 1017.4.4 Future Work . . . . . . . . . . . . . . . . . . . . 102

Bibliography 103

Papers 120

I Exploring melody and motion features in “sound-tracings” 121

ix

Contents

II Representation Strategies in Two-handed Melodic Sound-Tracing 129

III Analyzing free-hand sound-tracings of melodic phrases 135

IV Evaluating a collection of Sound-Tracing Data of MelodicPhrases 159

Appendices 169

A Details of the Experiments 171A.1 Marker Labeling . . . . . . . . . . . . . . . . . . . . . . . . 171A.2 List of Melodic Stimuli . . . . . . . . . . . . . . . . . . . . 172A.3 File Formats for Symbolic Music Notation . . . . . . . . . 173

B List of implemented functions and features 175B.1 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 175B.2 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . 175B.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

B.3.1 Motion Features . . . . . . . . . . . . . . . . . . 175

x

List of Figures

2.1 Speech–Song spectrum. In this figure, I have tried to representthe many forms between speech and song. Most categories andtheir place on this spectrum are not assigned without debate; thisfigure is only indicative. . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 A melodic entity might be stable, as discussed above, to severalkinds of variation. Which kinds of variations are most relevantdepends upon the musical culture within which a similarityjudgment is to be made. . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 A melodic entity can be recognized as belonging to one of severalmelodic frameworks. A framework might mean different thingsaccording to the musical context, as illustrated. . . . . . . . . . 21

2.4 Range of melodic and sound contours from previous studies fromSeeger (1960); Schaeffer et al. (1967a); Śarma (2006); Hood (1982);Adams (1976) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5 On the left, we see Palestrina’s Iubilate deo universa terra" psalmverses in neumes first published in 1593 in Offertoria totius anni,no. 14. It is argued that neumatic notation is derived fromcheironomic hand gestures indicating changes in pitch. On theright, is an illustration of how neumes evolved into mensurationnotation, and finally as modern notation. . . . . . . . . . . . . . 26

2.6 Tibetan Yang notation from silkscreen prints, ca. nineteenthcentury. This form of notation goes back to the sixth century(Collection, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 A and B represent the property of isotropy in this illusion; thetriangle is perceived despite scaling and rotation. In C and D,with the change of an angle, we perceive the third edge of thetriangle to be slightly curved. This alludes to the properties ofsmoothness and locality. The lines at the corners are still straightlines, but the effect is of a curved edge instead. . . . . . . . . . . 36

4.1 Solfa gestures are used to help memorize intervals in a major scale.This method has been used quite often to teach students of singing. 51

4.2 Similarities between the action–fidgeting model and the posture-based motion planning theory. . . . . . . . . . . . . . . . . . . . 53

xi

List of Figures

5.1 A stimulus set containing 16 melodies was used for two motioncapture experiments, resulting in three data sets: one for melody–motion pairs, the second for repetitions in sound-tracings, andthe third for singing back melodies after one hearing. . . . . . . 57

5.2 The 16 melodies used as the stimulus set for all the experimentsin the thesis come from four different music cultures and containno words. The X-axis represents time, and Y-axis represents pitchheight in MIDI notation. . . . . . . . . . . . . . . . . . . . . . . 58

5.3 Experimental flow for experiment 1. . . . . . . . . . . . . . . . . 605.4 Flow of Experiment 2. The repetitions and sung sections are

included in this experiment. . . . . . . . . . . . . . . . . . . . . 61

6.1 An illustration of different levels of data handling and analyses inthe experiments. The three types of data each have their own signal 66

6.2 Pictures from the FourMs MoCap Lab where the experiments areconducted. On the left is the lab with the cameras and speakersmounted as shown, on the right is a participant wearing reflectivemarkers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.3 An example of a post-processed motion capture stick figure. Adetailed list of marker labels can be found in Appendix B. . . . 76

6.4 An illustration of the flow of mocap data. The mocap files areexported from QTM, imported into python. Normalized files arere-exported, and analyzed in python for obtaining features. . . . 77

6.5 Visualizations of quantities of motion. Visual inspection . . . . 826.6 A representation of the CCA algorithm used; fa and fb represent

the two different data sets—melodic and motion features,respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A.1 The 16 melodies used as the stimulus set for all the experimentsin the thesis come from four different music cultures and containno words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

xii

List of Tables

6.1 The features extracted from the motion capture data to describehand movements. . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2 Quantitative motion capture features that match the qualitativelyobserved strategies. QoM refers to quantity of motion. . . . . . . 80

xiii

Chapter 1

Introduction

Melody, moving downstream.A string of bargesJust lit against the blue eveningThe fog - giving each light - a haloMoving with the river but not the drift

a little faster perhaps,or is it slower?

A singingSung if it is sung quietly

within the scored crashing and thealmost inaudible hum impinging

upon the river’sseawardness.

- Denise Levertov (Levertov, 1983)

1.1 Introduction

Many people think of melodies as having contours. The association betweenmusical melodies and the visual representation of a contour—an arch, a rainbow,a zigzag line, a circle, and so on—appears to be united in our minds, andrepresents an essential quality of melodic identity. Lessons in songwriting oftenteach students how to diversify their thinking by creating contrasts with contours,although there are few formal methods to analyze contours, and they are mostlybased on the analysis of sheet music.

‘Contour’ also has signification beyond the visual representation of music.Be it the evocative nature of optimistic arch-like melodies of theme songs ofDisney princesses, the use of large arch-like leaps of a sixth or above in the’princess’ songs, ranging from Somewhere Over the Rainbow from the Wizard ofOz to Mulan’s Reflection, or the linear contours in Alban Berg’s compositions, asdiscussed by Perle,p. 86, and the descending contours of many lament songforms;melodic contours have signification for us beyond just being musical artifacts.Representations of contours on paper range from neumes and squiggly linesdrawn above musical notations to remind the reader of the contours of a musicalphrase, to representing these phrases with hand movements during improvisation,as is done in, for example, the Hindustani musical tradition.

In this thesis, I investigate why we think of melodies as contours, and howthis differs from symbolic or score-based representations of melodies. A primaryobjective of this thesis is trying to understand how people move to melodies. I

1

1. Introduction

come to this from the perspective of a vocalist, which is my entry point to theseresearch questions.

1.2 Motivation

I started thinking about melodic contours and the embodiment of music whenI studied Hindustani vocal music as a child. While improvising, musicianswould use elaborate hand gestures to accompany melodic phrases. When I wasnew to this musical genre, it seemed as though only experts were allowed touse gestural improvisation upon mastering the style. However, regardless oftheir expertise, different singers use various styles of gestural elaboration thatcommunicate improvised melodic phrases to the audience, through their hands,heads, and facial expressions. These movements not only serve a communicativefunction, but they also assist the process of singing through the manipulation ofresonance centers, and facilitating breath supply. During lessons, my teacherswould illustrate nuances in the melody with hand movements, which helpedme understand the phrasing faster than in their absence, an experience thatis well documented by several learners of this style (Pearson, 2016). Overtime, movement-based representation seemed the only way to understand theintricacies of melodic phrasing.

When I learned operatic music, however, this particular method of usingvisual shape to understand melody was completely absent. I also noticed that thebody was used differently in this cultural context. The idea of cultural body usehas been explained in research (Kimmel, 2008,p.77). The vocabulary of gesturesand visual metaphors for melodic shapes that are integral to Indian music aresimply not the ones used in operatic singing. Instead, melodies were visualizedusing shapes that related to the resonance centers in the body activated indifferent vocal transitions. Students like me, in trying to replicate the bodymovements from classical Indian music to operatic singing, ended up influencingthe articulation of the voice in this new style. I wondered why this mightbe—what is the relationship between the singer’s body and the melody? Howis melodic imagery related to abstract shapes? What are the different ways inwhich melodic contour is enumerated, understood, and used? Furthermore, Iwas interested in the intersubjectivity in the experience of these shapes. How dopeople differ in their conceptualizations of melodic shape, and why?

Experiences like this with my training as a vocalist led me to read research onthe use of hand movements in Hindustani music, and their semiotic and cognitiveimplications. While researching this topic for my master’s thesis, I came acrossthe semiotics of these gestures and their pedagogical function. I analyzed videorecordings of performers singing the same set of ragas, to understand their specificuse of hand gestures to represent rhythm, vowels, and phrase termination.

Although my masters’s research project on this subject was restricted toIndian music, the concept of melodic shape is not. Visual representations ofmelodies and melodic contours are found in places ranging from ancient notationforms, such as neumes, to modern music visualizers; melodic shape appears to

2

Research Objective and Research Questions

be a robust concept that requires further investigation in parallel with researchon how the body is used to express this motor imagery. I have investigated theseshapes in this thesis, using a set of vocal melodic stimuli, and have asked peopleto respond to the music physically, using actions. Inevitably, this subject dealswith how we remember and imagine melody, which in turn, is also influenced bythe affective content of melodic inflection. I have brought together the aspectsof contour dealing with motor imagery, melodic memory, and melodic affect inthis thesis by presenting articles that deal with each in different ways.

I have used real sound recordings as stimulus material. Much research done inthe are of melodic contour perception and analysis involves the use of isochronousmelodies using MIDI and symbolic representation. However, by using isochronousand symbolic melodies, we lose out on a range of information that contributes tomelodic contour perception. In order to avoid this, I have used signal processingmethods and continuous motion capture to reflect on the multimodal nature ofmelodic contours. Ultimately, the goal is to learn more about melodic, and byextension, pitch perception.

1.3 Research Objective and Research Questions

The primary research objective of this thesis is to:

Understand the role of embodiment in melodic contour perceptionusing the sound-tracing experimental methodology.

The following postulates form the basis of the questions explored in this thesis:

1. Melodies have contours or lines.

2. Melodies and lines are thought of as having movement.

3. This movement is imagined and can be represented physically.

4. We learn about melodic perception by studying these movements.

From this primary objective, the following research questions emerge:

RQ 1: How do listeners represent melodic motion through bodymovement, and how can we analyze motion representations of melodiccontours?

The primary objective of this thesis is to understand melody through motion.I explore the idea that the vocabulary for melodic undulation is tied to thelanguage of describing melodies. When we ask people to represent these contourswithout the use of language, what do we find, and what do these findings suggestabout contour perception?

To fulfil this objective, I asked people to “draw” these contours in air, andrecorded them using motion capture technology. Motion traces were thenanalyzed and cross-compared to reveal patterns in the movements.

3

1. Introduction

In this thesis, I have referred to people’s intentional body movements asmovement, and the data captured from these movements as motion, from theiruse in systems for motion capture. I believe that the term movement intrinsicallyreferences intentionality rather than motion. As such, I have been consistentto this distinction between movement, as it is performed, and motion, as datagathered from body movements, which could pertain to motion with or withoutintentionality. The word trajectories is used to represent the motion traces inmotion data, pertaining to body parts that are tracked using markers.

RQ 2: What are the characteristics and applications of motionrepresentations of melodic contours?

Since contour representations rely on imagining contour movements as shapes, itis only possible to reflect upon phrasal shapes in prospectively or retrospectively,trying to anticipate or remember the memory of such a shape. Melodic shapesmay also be context dependent, and differences in vocal styles and genres caninfluence tracings. This information is typically lost when we transcribe melodiesdown to discrete pitches, and play them back from, for example, a MIDI sequence,which is a common way of conducting contour-related experiments.

RQ 3: How can we test if motion related to melodic contours isconsistent: a) within participants, and b) across participants?

Comparing the motion representations of contours of several people to findcommonalities between these contour representations, I analyzed if the motionrepresentations of participants can be modeled individually, representingconsistency of a mental model of melodic contour.

Another sub-objective is to develop technology for the analysis of sound andmovement pairs with each other. This involves creating toolboxes and librariesthat facilitate the analysis of sound and motion. The work done to achieve thisobjective includes the creation of a stand-alone library to analyze motion capturedata from sound-tracings and various features from those data.

RQ 4: Can we build a system to retrieve specific melodies based onsound-tracings?

Could our understanding of sound-tracings and melodic embodiment be used tobuild systems that would be able to retrieve specific or similar melodies? Thisproblem might help add to both the gamut of interactive interface literature aswell as music information retrieval applications.

Can the answers to research questions from 1-3 can provide data that canhelp build a retrieval system to explore this question. Such a system wouldrequire learning between two different paired modalities. For this, it would benecessary to model tracing-based representations and variations well.

4

Approach

1.4 Approach

Interdisciplinarity

I draw on research in music cognition to investigate the multi-modal interactionsbetween music and movement, and contribute to the computational methodsfor analysis. The aim has been to combine these approaches, which are alreadyinterdisciplinary, to deal with music information using state-of-the-art algorithmsand tools, and to investigate both their fit for the human transliteration of data,and also the other way around. Melodic grammar and perception has beenmodelled in a number of ways algorithmically to perform tasks that are quitesimple for humans to perform, such as identifying melodic similarity.

Owing to its interdisciplinary approach, this thesis handles terminology fromthree domains: auditory perception, motor action, and abstract imagery andits geometry. The central idea is that melodic contour perception gives rise toshape imagery, which is realizable through motor action. In other words, thereis something movement-like about melodies, and we can think of them as havinga geometric structure.

There are three levels of sound and movement representations handled in thisthesis: physical, digital, and perceptual. Physical representations of sound andmovement are included in the data from direct recordings. Digital representationsof sound include transcriptions of melodies and transformation of contour datainto the symbolic domain. Perceptual descriptors of sound and motion includecomputational models that mimic perceptual qualities, such as ‘smoothness’ ofmovement and ‘loudness’ of sound.

More specifically, for auditory descriptors, I refer to three different levels ofconcepts: the acoustic, psychoacoustic, and musical. Acoustic analysis includesfeatures that are calculated mathematically; for example, the energy of a soundsignal or its spectral features. Psychoacoustic features can also be approximatedthrough computational methods, such as perceptual loudness of a sound signal.Musical features of a sound stimulus are different from these, and may beembedded within a musical culture; for instance, cadence or specific intervals.They may also be psychoacoustic approximations that are understood as havinga ‘musical’ quality, such as melody or intervals.

This research deals with human body motion, or the ‘response’ materialfor the perceived qualities in melody, as data that could be purely physical orperceptual. Analysis of these data involves, for example, the notion of an effector,which refers to any body part that might be involved carrying out a movement(hands, legs, the head), or an object held by or attached to the body, such as awand. Whether body movement is measured using cameras, motion capture, andso on, also dictates how the data are obtained, and what these data can showus. In this thesis, I mainly present motion capture data, which gives precise 3Dpositions using infra–red markers. Some measures calculated from these data arephysical, such as ‘Quantity of Motion’, whereas others are based on modelingmovement perception, like smoothness. As such, this is comparable to acousticand psychoacoustic features. The details of motion features are explained in

5

1. Introduction

Chapter 6.Despite how far research on melodic modeling and pitch perception has

progressed in recent decades, there is a lot more about melodic contour perception,and its entanglements with speech, that we do not yet comprehend or have beenstill trying to model (Schmuckler, 2004). The key tenets of embodied musiccognition explain how we might understand nuances of sound perception whenwe understand how the body reacts to sound and music in the environment, asseveral studies about beat perception and bodily entrainment have shown.

Through this thesis, I have approached melody in a similar way. What canour knowledge of embodiment add to our understanding of melodic perception,and how can we use it to inform music information related systems? How canwe improve interactive interfaces for the creation of music using this knowledge?

Nymoen et al. (2013) have explored the ideas of “active listening”, wherewe control listening to musical stimuli using our bodies, by using devices totrack body motion. This can help us actively control for example, the speed ofplayback, triggering samples, and so on using movement. In research involvingboth movement and music, these elements have often been treated separately(Müller, 2007). Even if we do understand music-related movement well, thequestion of how it informs music analysis could still be answered and understoodin ways that allow us to explore the embodied nature of melodic perception. Forinstance, by designing experiments that pay attention to our natural instinctsfor representing music in the context in which it is heard with our bodies. Orin other words, to incorporate embodied listening into the practice of musicanalysis is a goal of this work.

Limitations and Scope

I have focused on analyzing melody motion data pairs in various different waysthat best illustrate spontaneous melody-motion associations. The broad resultsof this study imply that metaphoric thinking is natural to most participants,regardless of their explicit experience with movement and melodic motion. Iexplain the details of movement metaphors in Chapter 4.

Even though this thesis touches on theoretical perspectives in speech–prosody,and contour perception in speech, I am unable to get into the experimentalanalysis of melodies across speech and music due to time constraints. Still, I findit important to mention a range of studies in this domain, because exaggeratedcontours of speech-melody form an important part of our early experienceswith melodic contour. But the extent to which these experiences contributeto cognition of ‘musical’ melodies is widely debated (Patel, 2010; Zatorre andBaum, 2012).

The melodic stimuli used in this thesis are from four different music-cultures,being classical vocalise, scat singing, Hindustani classical singing, and the Samijoik. Despite this, this work does not involve cross comparisons of melodicgrammars within these music-cultures. However, I do discuss implications ofthe participants having prior knowledge about the use of the body in someof these genres. But the experiments are not intended to highlight ‘cross–

6

Thesis Outline

cultural’ differences; they are not designed to be able to comment on them, andsocioeconomic and geocultural factors are outside the scope of this thesis.

Some findings of this thesis are also relevant to understanding genderdifferences in a movement analysis context. While it could have been interestingto comment on gender differences in music related movement, it is outside thescope of this thesis to discuss these findings in light of gender theory. I will focusinstead on the findings from the data analysis that are directly connected to theresearch questions.

1.4.1 Open Science

All the articles published in this thesis are open access. In addition, the datasets and codes for running all experiments have been documented and releasedonline. In this way, This makes the work comply to the principles of open access,and open data. The papers, code, data, and descriptions have been released onmy website at

http://tejaswineek.github.io

1.5 Thesis Outline

The thesis consists of two parts. The first part is an introduction to the theoreticalframeworks, research motivations, methods, and experiments conducted. Thesecond part is a collection of papers published in various peer-reviewed journalsand conference proceedings. In the summary section of the thesis, I introducethe articles and elaborate the key findings of the research.

The main problems posed in this thesis relate to body movement and melody.Chapters 2, 3, and 4 offer an overview of the theoretical motivations and keyconcepts surrounding the interaction of music and motion, specifically melodiccontour. Chapter 5 describes the data sets and experiments conducted. Chapter 6presents the main frameworks or the disciplinary areas that inform the work inthis thesis, including analysis methods and technologies used. Chapter 7 providesa summary of key results in each of the appended articles, and additional resultsthat were not included in the articles for various reasons. In addition, Chapter 7offers a discussion of how the research questions raised in this introduction havebeen answered, and presents applications of the research done, elaborating onfuture work.

All papers published in this thesis are openly available, not just on theuniversity website; they are written as open access articles. The data sets arealso publicly available and released at links mentioned in Chapter 5, and on mypersonal website. Appendix 1 describes details of the experiments including thestimuli, and details of motion capture. The code written for the analysis is alsoopen source and accessible on my personal website, and described in Appendix2.

7

Chapter 2

Melody

Actually, almost any notecan be played if

there is a melodic shape to the line.- Bob Mintzer (2004,p. 24)

2.1 Introduction

The above quotation summarizes how many musicians think about improvisation:music as melodic shapes rather than as notes. Jazz improvisation is often acombination of several practiced ‘licks’ and phrases in the repertoire of animprovising musician, which are played in different combinations, and overdifferent scales. I find this interesting because melodies are considered both, acluster of intervals as well as a contour unfolding over time, as if the contourproperties overrule the effects of the intervals. But do they?

Think about a familiar melody, say, Twinkle Twinkle Little Star. Whenremembering a melody, many people recall it ‘as a whole’, at a faster pacethan when they would actually sing it. However while speeding through themelody, they do not distort the durations of the notes in the melody (Andrewset al., 1998). People also recall the melody by imaginining singing it under theirbreath, but the speeding up aspect is most interesting to me. In mental recallof melody, people are less likely to have a clear image of the actual intervals,and may be more invested in the act of singing through the contour. To me,this demonstrates many properties of melodies in our imagination: they arecompressible and expandable, and they are transposable to any key and octave.A melody is accessible to us as one holistic object, with its contour, form, andrhythm embedded in the melodic entity or phrase. This property of resilienceis reflected in our ability to tolerate badly sung renditions of known melodies.We are capable of smoothing over details in a melody when we are listening to,for instance, children trying to repeat a melody that they do not know well. Assuch, Mintzer’s quotation rings true—the melody is what it is, as long as theline holds its shape. So, either the melodies themselves are robust, or we areforgiving of melodic distortions, or both.

Exaggerated contour explorations are particularly common in infant babble—a slow exploration of the apparatus of enunciation, from vocables and vowels tonon-speech sounds. Children repeat melodic contours and melodic phrases overand over during play. These infant melodies are essential to the developmentof speech and hearing. Specifically, contour acquisition is equally importantto understand emotional nuances of speech as it is for remembering musicalmelodies. Whether our nuanced understanding of contours is developed for

9

2. Melody

either speech or music is not a question this thesis explores; throughout history,though, researchers have wondered about the crossovers between speech melodiesand song, song-like speech, and speech-like song in various musical cultures.

Classic examples of the speech–song illusion include Sometimes Behave SoStrangely by Diana Deutsch, where a broken replay of these words somehowmakes the phrase sound like a tonal melody, to the point that if the phraseis encountered within a speech excerpt, it automatically sticks out as a ‘song’(Deutsch et al., 2008, 2011).

The idea of speech–song is not new, especially to the internet generation,as speech and interview excerpts can be easily transformed into catchy songsby mobile applications. Every time a fragment (such as the ‘So Strangely’ one)that we learn to hear as a tonal melody plays, we switch to the song mode oflistening on hearing the first syllable, even before we hear the whole melody.

This property of recall for familiar melodies makes them a lot like objectsor icons—identifiable from the onset and resilient to variability. This holisticperception of a melody is what I have chosen to work on as the theoretical goalof this thesis. Why is it that melodies are understood as a whole, and we are ableto think of them as stable shapes, while simultaneously understanding how thesemelodies unfold in time. These properties are similar to the phenomenologicalunderstanding of geometry of shapes, about which I go into detail in the nextchapter.

In this chapter, I introduce the central theme of this thesis: melody. InSection 2.2, I show the interconnections between speech melody and melodiccognition, and how we sometimes hear speech melodies as having musical qualities.In Section 2.3, I explain five essential characteristics of melodies and how theyrelate to melodic contour. I also discuss how melodic entities are establishedin various music cultures, the peculiarity of vocal melodies, and the connectionbetween verticality and melody. I detail what specifically about contour is ofinterest thereafter, in Section 2.5.

2.2 Speech Melody

Is there melody in speech? The ups and downs in speech intonation acrossdifferent languages are studied in detail in linguistics as prosody and intonation.Prosody refers to the suprasegmental properties of speech, such as the modulationof voice pitch, the durations and stresses of syllables, and fluctuations of loudness;the pitch–curve–like properties are studied more frequently as intonation. Theconnection between speech melody and musical melody has always been at theforefront of discussions on the definitive aspects of melodies. To quote Bolinger,“Since intonation is synonymous with speech melody, and melody is a termborrowed from music, it is natural to wonder what connection there may bebetween music and intonation.” (Bolinger and Bolinger, 1986, p.28)

Speech is often also accompanied by gestures: by movements of the hands,head, or body (McNeill, 1992). The relationship between speech melody and

10

Speech Melody

co-speech gestures will be discussed in detail in Chapter 4. Here, I will discusscontours of speech melodies in more detail.

The systematic study of speech accents and speech contours across differentlanguages may not directly be related to the study of contours in musical melodies.However, intonation contours are studied using fundamental pitch extractionand by annotating high and low points in order to discuss contour families indifferent languages (Ladd, 2008; Bolinger and Bolinger, 1986; Wittmann, 1980).I would like to draw upon this approach for discussing musical melodies. Butmusical melodies are mostly studied in the form of pitch transcriptions andsymbolic notation, and traditionally, contour analysis is not usually researchedunless the research is explicitly about melodic contour.

2.2.1 Prosody

The word ‘prosody’ can be traced back to ancient Greek. A combination oftwo words that meant ‘towards song’, it was used to mean ‘song sung to music’(Nooteboom, 1997). Prosody generally refers more to the timing-related elementsof speech, rather than pitch levels. For example, in a rhyming poem, the timingof a metrical foot is used to give rhythmic sense to spoken words, as seen inpoetry from Shakespeare and Kabir, to Matsuo Basho and Kendrick Lamar. Inclever poems, such as Jabberwocky, Lewis Caroll plays with pseudo-words tomake poetic sense, so long as the prosodic context is maintained. Using event-related-potentials (ERPs) in the brain, Pannekamp et al. (2005) found thatprosodic, and not segmental cues are responsible for phrase-boundary detectionin language.

Another interesting example is that of auctioneers’ speech. Studies done onauctioneers and their rapid speech reveal that they use a programmatic languagesubset, that requires rehearsal of fast utterances in order to be able to reproducethat speech quickly. Vowels and syllables fuse with each other, in a phenomenonreferred to as coarticulation (Kent, 1977), which gives rise to a new speech thatalmost sounds intelligible but requires familiarity and training to comprehend.Even though the contents of their speech are rooted in the basic principles ofthe operating language (Kuiper and Haggo, 1984).

The above examples show how our perception of an utterance as ‘melody’does not require that it belongs to a scale or tonality framework. We are capableof understanding and appreciating speech utterances as melodies, using contourprofiles to guide us.

2.2.2 Intonation

In linguistics, intonation refers to three different levels of understandingphonological organization (Ladd, 2008,p. 1-6): suprasegmental, referring topitch, stress, and quantity; post-lexical, which refers to pitches, whole phrases,or sentences; and linguistically structured, referring to how sentence- and phrase-level intonational features interact with the variable states of the speaker (forexample, degree of arousal and so on). A four-level structure containing linguistic

11

2. Melody

segmental, linguistic suprasegmental, paralinguistic, and kinesic features isdescribed in spoken communication (Wittmann, 1980). This is split into proximaland distal attributes. The study of ‘distal’ attributes of language, such asintonation, is often considered to be suprasegmental and above. This means thatintonation conveys the meaning in a certain set of contexts, rather than words.For example, the way in which we understand that a statement is a question,even if we do not hear the words, is because of the intonation contours codifiedin the spoken contour even in the absence of a question word.

There is usually a consensus on intonation curves across native speakersof a language. Despite this, large variations in intonation are understood asdialects of any given language across its geographical spread. We even perceivepseudo-language, or imitations of language, as passable based on intonationcontours. In a research article, Mehler and Dupoux (1992) found that babies asyoung as four days were able to distinguish between intonation patterns in theirmother tongue and a foreign language. Mora suggests that “discourse intonation,the ordering of pitched sounds made by a human voice, is the first thing we learnwhen we are acquiring a language." (Mora, 2000, p.149). Intonation, however, isnot explicitly musical or a speech melody with a musical purpose. In essence,three properties are said to separate speech melodies from musical melodies(Patel, 2010):

1. Declination: The presence of fixed sentence-level contour structures forspoken languages, which are different for different languages.

2. Tonality: We perceive tonal relationships in most musical melodies, butnot in speech melodies.

3. Diversity of linguistic intonation: Speech melodies may contain a largernumber of ‘intervals’ than musical melodies.

Micro-inflections in intonation are essential to understanding emotional affect.Picking up on a friend’s mental state even before they have articulated itfor themselves, questioning someone’s enthusiasm based on the tone of theiraffirmation, and so on, are just some examples. This topic has gained muchtraction lately, especially in the era of ‘smart’ voice assistants such as AmazonEcho, Google Home, and others. Computational analysis of affect perceptionin speech melodies will probably become more important in the future, inapplications such as robotic caregiving.

2.2.3 Model for a Speech–Song Spectrum

I would like to propose a model for analysing a range of genres, and musical andspeech forms, to understand the many forms of melodic and poetic utteranceson a continuum between speech and melody in Figure 2.1. If we consider theextremes of the spectrum to be ‘full speech’ and ‘full song’, then several formslie in between. I have tried to classify the forms more related to rhythmicity andprosody, such as poetic forms with a fixed number of syllables, several forms of

12

Musical Melody

chanting that contain rules for syllable pronunciations, and perhaps an overallcontour framework.

Figure 2.1: Speech–Song spectrum. In this figure, I have tried to represent themany forms between speech and song. Most categories and their place on thisspectrum are not assigned without debate; this figure is only indicative.

Somewhere in the middle are the operatic forms, such as recitative, whichis written as musical scores, but are dialogues by the characters, propelling thestory, and Schöenberg’s Sprechstimme. Then there are the more song-like rapmelodies, with contour directions that must be obeyed. Rhythm alone does notpropel rap music, but the contours of the phrases are also important to thestyle. Laments in some traditions are based on singing through pitch contoursover and above ‘hitting pitches’ (Tolbert, 1990). To the very right are musicalmelodies. Melodies without text specially use the voice idiomatically as a musicalinstrument, are placed to the very right. The proposed model is not perfect,but it is the start of what we could think of as a spectrum from speech to song,and it is interesting to discuss how some phrases, upon repetition, move to theextreme ends of this spectrum.

2.3 Musical Melody

Melody is often described as the ‘salient’ or ‘hummable’ musical line, often foundin higher pitch registers than the ‘rest’ of the music. Despite this operationaldefinition, we hear melody in many sounds around us, such as birdsong, poetry,and repeated fragments of speech. It is interesting how melodies, with theirpitched and rhythmic identities, are often described as lines; in turn, these linesare described as having contours, which means that they have an outline, orcontour, that unfolds over time. I would like to stress, in particular, that I ammainly interested in the holistic nature of melodic perception. In discussionsabout melodic typologies, especially contour typologies, musicologists often focuson pitch classes and pitch relationships. However, linguists generally use fewercategories of pitch levels to describe linguistic intonation. So, how is musicalmelody studied?

In a chapter titled ’Pitch and Pitch Structures’, in the book EcologicalPsychoacoustics, Schmuckler (2004) endeavors to provide a framework for pitch

13

2. Melody

perception from an ecological perspective. This approach draws primarily fromthe work of J. J. Gibson from the 1950s, and has found multiple applications,particularly in explaining visual perception and movement. An approach withecological psychoacoustics encourages us to study perceptual properties at thebehavioral level. Schmuckler states that in order to adapt this approach intoa meaningful study on pitch perception, we might focus our attention insteadon the apprehension of pitch objects—perhaps study melodic objects and howwe hear them. In this thesis, I find it important to study melodies ‘as theyare heard’ from different cultures, and as they are vocalized or sung, instead oftrying to reinterpret them as isochronous sequences.

Additionally, it is important to locate the experiments in this thesis at theright level in the hierarchy of musical importance. Many experiments on pitchperception take a bottom–up approach, studying the perception and cognition ofsingle tones and the organization of scales in experimental conditions. However,as Schmuckler points out, “The alphabet (an alphabet of pitch materials towhich the rules for creating well-formed patterns are applied), however, is oftendefined in terms of tonal sets, implicitly building tonality into the serial patterns"(Schmuckler, 2004, p.282). Although tonality determines the well-formedness ofmelodies, it certainly does not reflect the ways in which, for example, childrenplay with melodies.

In general, melodies are understood as the salient, linear, and hummablemonophonic abstractions of musical expression. In polyphonic and homophonicmusic, melody is often written as the topmost voice. The Concise OxfordDictionary of Music defines melody as “A succession of notes, varying in pitch,which has an organized and recognizable shape.” (Kennedy et al., 2013). Thekey takeaways from this definition are the following words: 1. pitch, 2. organized,3. successive, 4. recognizable, and 5. shape.

2.3.1 Pitch

Melody, in its everyday definition, is said to comprise pitches. Pitch has aneveryday description—the “stuff music is made up of"—and is defined as beingthat attribute of music which can be ordered on a scale from low to high (Fuchs,2010, p.71). Pitch and melody rely on each other for their definitions circularly.Using the words ‘low’ and ‘high’ to describe pitch sets the stage for consideringpitch perception as directional, and moreover, as having spatial orientation. Manylanguages describe pitch in terms of dullness and brightness, suggesting that thisperception of brightness corresponds to the periodicity, and by extension, thefrequency of sounds (Shayan et al., 2011). Modeling pitch successfully requiresan understanding of how pitch is perceived, and how it behaves in differentmusical contexts.

Frequency and Pitch Models

Frequency—mathematically defined as the number of repeating cycles of a regularsignal; and pitch, are related but different. Pitch is a psychoacoustic component,

14

Musical Melody

while frequency is a physical measure. Psychoacoustic components rely onperception to be realized. This means that pitch does not exist without a listener.Studies have revealed much about pitch perception, including the auditoryperceptual scale for pitch discrimination, or just-noticeable-difference, whichvaries in the range of human hearing.

In and of itself, a pitch model can be a mathematical abstraction of pitchprocessing, a physical model of the hearing apparatus, or the description ofneural firing in response to pitch stimuli. What we get from a model depends onwhat it is built for, and what we wish to obtain from it. De Cheveigne describesmodels and what they are useful for:

A very broad definition [of a model] is: a thing that representsanother thing in some way that is useful. This definition also fitsother words such as theory, map, analogue, metaphor, law, etc., ...“Useful” implies that the model represents its object faithfully, andyet is somehow easier to handle and thus distinct from its object.Norbert Wiener is quoted as saying: “The best material model of acat is another, or preferably the same, cat.” I disagree: a cat is noeasier to handle than itself, and thus not a useful model. Model andworld must differ. (de Cheveigne, 2005, p.3)

Since pitch is a psychoacoustic (not a physical) phenomenon, some haveargued that it is impossible to model pitch without a mind. However, modelsthat closely approximate how pitch is abstracted are used for various applications.Most models for pitch estimation annotate absolute pitch, but as humans, weseem much better at approximating relative pitch.

Computational models of pitch perception rely on an understanding of theapparatus of pitch perception, which is curious in the case of auditory perception.Pitch perception remains stable despite missing fundamentals, or even whenthe bottom-most fundamental partial is missing from the spectral analysis.Pitch is perceived as stable over varying factors, such as amplitude, duration,spectra, and duration of stimulus. Melodies are also stable over a range of factors.Transpositions do not, for example, throw us off—we can identify melodic phrasesin a wide range of transpositions. Moreover, melodic phrases remain unchangedwhen played on a range of instruments, and sometimes, distortions of scale andintonation do not disrupt the identification of melody. Lastly, we are able torecognize a melody as ‘the same’ across a large range of time variations. Thismeans that we are able to perceive structural embellishments as external tomelodic identity. Thus, we are seemingly able to construct a skeletal schema formelodic identity that is extremely robust.

Pitch Perception

We understand speech intonation, and melody through variations in pitch. Thestudy of pitch perception, in trying to understand the local effects that melodiccontexts have on pitch, incorporates a wide range of questions that encompassthe breadth of our hearing spectrum. Being able to abstract a fundamental

15

2. Melody

pitch or an approximation of a fundamental frequency from a wide range oftimbres and spectral shapes seems to be a unique property of human hearing.Physiological models of pitch deal with the shape and biological properties ofthe cochlea, and the coding of tonotopy in the cortex. Algorithmic models try tocompute pitch using time- or spectrum-based signal processing methods. Pitchperception and melodic contour are closely related, but some differences wesignificant.

Melodic contour is clearly present in both music and language perception,but it is hard to find an inclusive definition of melody that applies to bothlanguage and music. Broadly, contour is defined in the same work as “a melody’spattern of ups and downs of pitch over time without regard to exact intervalsize" (Patel, 2010, p.99). Experimental research has suggested that contours area lower-level perceptual feature, in that we acquire it in early childhood (Trehubet al., 1984). This research also shows that infants are sensitive to directionalchanges in melodies. In 1994, Dowling et al. experimented with identificationof unfamiliar or unknown melodies, to understand the role that contours playin their recognition. Participants in Dowling’s study also used contour andintonation distractors, which were similar stimuli to the target melody. It wasreported that contour-distractors were more often confused to be the targetmelody than intonation-distractors. Early ethnographic research on melodiccontour types focused on identifying contours in different musical cultures (Boerand Fischer, 2011), and mapping the frequency of contours in contour typologies.

Mysteries of melodic contours are relevant to this discussion. Contours are acoarse-level feature that we acquire very early in childhood is well known. Infantdirected (ID) speech contains highly exaggerated contour profiles comparedto adult directed (AD) speech. These experiments show that while ID andAD speech do not differ in prosodic shape, the contour profiles themselvescontain a high level of emotional exaggeration in ID speech (Trainor et al., 2000).Acquiring coarse categories for prosodic meaning is important for the verbal,and by extension melodic, development of children. Melodic contours also playan important role in emotion detection in speech. In his research, Ross studieda case with clinical difficulty in processing affective speech prosody (Bell et al.,1990; Ross, 1993). Huron also argues for, on the one hand the co-occurrenceof musical acuity and social development in genetic disorders such as Wilson’sdisease; and on the other end, the connection between autism and proclivity toabsolute pitch perception, and difficulty in ‘getting into’ music, is also observed.Melodic perception is essential, thus, to understanding affect in speech andmusic.

2.3.2 Organization

The organization of pitch and pitch structures is a key factor in determining the‘musicality’ of pitched material. The organization of pitch structures includes,broadly, the following components: key, tonality, temperament, scale, andgrammar.

16

Musical Melody

Key and Tonality

A musical key refers to the perception of adherence of a piece of music to asingle tone, around which the scale and other notes in the music seem to revolve.In tonal music, this single tone is the tonic, and the property of adherence isdescribed as tonality. The tonic, or the key center also appears to be the stablepitch level; when there is a sense of ‘resolution’ in the music.

Tonality refers to the arrangement of pitches in a hierarchical order ofperceived relationships and stabilities. It also refers to an understanding of astable ‘key’. When a melody is constructed in a tonal framework, tonality dictateshow and where a melody rests, and how its constituent notes are interrelated.A large amount of research done in this area is on the framework of phrasalgrammars makes up. Many experiments on octave perception have tried tounderstand the perceptual organization of pitch classes.

Temperament

The intonational relationships between different tones in a melodic scale arereferred to as temperament. Intonation deals with the ratios between differentnotes in an octave. In equal temperament for example, octaves are divided into12 equal parts, so that every note is more or less equally out of tune. Cuddy(1982) designed experiments to identify interference of logarithmic and lineartemperaments on contour perception in absolute and non-absolute pitch listeners.She found that contour and temperament affected the recognition of melodies,but that listeners apprehend contour even when they encounter unexpectedintervals.

Scale

Scale refers to the arrangement of intervals in tonal melodies. In westernmusic, major and minor scales are most commonly used; however, this is notrepresentative of most music in the world, which features a large diversity ofintonations and scales.

If the mode of presentation of a melody is changed—for example, HappyBirthday is played in a minor key—we can discern it as the same melodic sequencebut with a different ‘flavor’. An iconic melody changes its identity more if thecontour is dramatically different than if the scale or mode of presentation isaltered. In the latter case, the melody retains its contours. Studies by Dowling(1978, 1972) show that contour is the principle factor for identifying melodies.Dowling also shows that inversions, retrogrades, and other melodic operationsmake the same melody harder to recognize, indicating that contour, or thetime-unfolding properties of pitch, take precedence over scale in the recognitionof melodies.

17

2. Melody

Grammar

Expectations of tonality rely on our exposure to certain musical grammars.‘Probe tone’ experiments are often used to test tonal expectations; for instance,an incomplete melody is completed using a probe or question tone, and participantratings help us understand how the question tone is perceived in the context ofthe melody (Tillmann et al., 2000; Tillmann and Bigand, 2010). Cross-culturalprobe-tone experiments have helped reveal the extent of cultural learning forphrase completion (Curtis and Bharucha, 2009), while neurological activationstudies involving probe tone experiments have also identified brain-areas trackingtonality (Janata et al., 2002). Some studies have also found that the tonic isnot uniquely stable in major-mode melodies (Curtis and Bharucha, 2009; Westand Fryer, 1990), and the dominant and subdominant were also rated equallyhigh in some cases. Eerola et al. (2002) tested two sets of melodies with atotal of 40 from two sources, with 27 isochronous sequences (Eerola et al., 2002).They selected factors from probe tone experiments that are known to influencemelodic expectations. Isochronous melodies created using a generative modelbased on ’typical transition probabilities’ were tested in the experiments askingparticipants for continuous ratings of the predictability of melodies.

To model phrasal grammars of western tonal music, several important ideashave been proposed, but most of these models are for a particular musical style,culture, or time period. For example, Generative Theory of Tonal Music (Lerdahland Jackendoff, 1987) deals with a linguistic analysis approach to western tonalcomposition. Melodic grammars are also proposed for understanding specificcomposers or their work, such as for Bach chorales (Baroni and Jacoboni,1978). Schenkerian analysis, originally to analyze tonal music has been modeledcomputationally modeled as context free grammar (Temperley, 2011). Aninfluential model for melodic grammar that also incorporates ideas from musiccognition as a whole is Narmour’s model of melodic expectancy (Narmour, 1992),also called the Implication Realization or the IR model. The core fundamentalsof this model are that any melodic interval that is not perceived as closed, is animplicative interval, (Schellenberg et al., 2000, p.296), while between the followingtone and the second tone after the implicative interval is the realized interval. Thetheory claims that these implications result from five perceptual predispositionsthat we learn from exposure to music: registral direction, intervallic difference,registral return, proximity and closure. As such, the Narmour (1992)’s IR modelis the most generalized model of melodic grammar, that has been used toinvestigate cross-cultural melodic expectancy (Krumhansl et al., 2000; Pearceand Wiggins, 2006).

The aforementioned features: key, tonality, temperament, scale, and grammarare organizational features of melodies. The recall and recognizability of melodiesis often studied through experiments in music psychology.

18

Musical Melody

2.3.3 Recognizability

We recognize melodies that have a wide range of variabilities, but what is thebaseis of our recognition? It has been shown through research in psychology thatscale and contour influence our memory of melodies. Explicitly tonal melodiesare generally easier to remember than atonal melodies (Dowling, 1978; Vuvanand Schmuckler, 2011; Bod, 2002). Research also supports the enhancement ofmelodic memory when the stimuli are vocal (Weiss et al., 2012).

Dowling (1978) proposed a model to understand how melodies are stored inlong- and short-term memory, using stimuli with scale and contour variations.The first component is the perceptual–motor schema of the musical scale. Thesecond component, melodic contour, is shown to function independent of pitchinterval sequences in memory. Dowling also underlines the importance of contourwhile repeating melodies from unfamiliar scales.

Melodic Identity

What do we call a recognizable melody? In the preceding sections, I havepresented research studies on melodic contour that have primarily been conductedin the West, with western classical music as the main source material. Whilediscussing these studies, I have explained how contour identity helps us recognizemelodies, despite variations. But how much do melodic properties have to varybefore the melody becomes unrecognizable as the original? It turns out that thisdepends upon musical style or musical culture. Cambouropoulos (2001) discussesthis in relation to Quine’s observation about the identity of any object in adiscourse. Quine states that objects that are indistinguishable from each other ina given discourse are identical for that discourse (Quine, 1950). Cambouropoulos,extending this discussion to melody states that a melodic phrase might beidentical only to itself if pitch is most important in the musical context, if weimagine a theoretical context where no variation is accepted in melodic identity.But if instrumentation is most important, then the same pitches played ondifferent instruments might not be recognized as the same melody.

I would like to define a melodic entity here as a melodic phrase that we canidentify in repeated hearings, and recognize its belonging to a larger melodicframework; for instance, a song. I define melodic framework as a collection ofmelodies, including composition rules in some cases, that represent an identifiablestyle. A symphonic piece that features a thematic melodic entity might representa framework. Other examples of melodic frameworks include grammaticalarrangements of melodies, such as in a raga or makam; a tune family, such asthose found in Irish folk music; or a style of improvisation, such as in an era ofjazz. Elements of personal style can also be recognized as melodic frameworks,while other symbolic references may be attributed to melodic frameworks; forexample, a general descending pattern might represent sadness in some contexts.A melodic entity could belong to one of many melodic frameworks, as I haveillustrated in Figure 2.3.

I make a distinction here between melodic phrase and melodic entity. A

19

2. Melody

Melodic EntityDuration Variation

Ornament SmoothingTransposition

Ornament Addition

Scale,Intonation Variation

Rhythmic Variation Text Variation

Figure 2.2: A melodic entity might be stable, as discussed above, to severalkinds of variation. Which kinds of variations are most relevant depends uponthe musical culture within which a similarity judgment is to be made.

melodic phrase can be defined as a melodic unit that has a self-contained quality.That is, a melodic phrase can be understood as independent. Every melodicphrase may or may not be a melodic entity, as I have described above, butthis purely is related to the use of a phrase in a particular context, and not itsstructural properties.

In Figure 2.2, I elaborated upon variations that are tolerable in most musicalcultures across repeated hearings of the same melodic entity. Most often,variations in the transposition of melodies and in the simplification of ornamentsdo not affect the recognition of a melodic entity. In some cases, melodies withlyrical variations might be treated as essentially the same. Melodic entities withchanges in rhythm and duration may be treated as versions of the same. Finally,melodic entities may be treated as the same even with the addition of ornamentsand embellishments.

Based on the assumption that melodic entities have acceptable variations, asdescribed in Figure 2.2, we can understand which cultural or functional contextstolerate which of these variations. I present some examples below. Even thoughpractitioners often study the melodic contours of music through annotated noteson paper, the majority of musical cultures in the world rely on learning ‘tunes’as a key part of the tradition. This means that melodic entities are subject to alarge number of variations, and their integrity is largely determined by whetherpracticing musicians decide is or is not the same melody. James Cowdery, writingabout Irish folk tunes, puts it succinctly:

“How should we characterize this entity? The problem is academicand not practical: a folk musician is content to call “The Blackbird”a tune. The scholar, however, realizing that all musicians play it

20

Musical Melody

Melodic Entity

Tune Family Musical Piece

Scale Grammar Family Musical Genre

Composer's Style Other Codification

Melodic Framework

Figure 2.3: A melodic entity can be recognized as belonging to one of severalmelodic frameworks. A framework might mean different things according to themusical context, as illustrated.

slightly differently and that many never even play it identically twice,eventually comes to the same impasse that Gavin Greig described.Like him, we must ask, “Then where is the tune?”” (Cowdery,1990, p.44)

While discussing tunes that are orally transmitted, Bronson (1951) notesvariations in American folk music traditions. Other more computational modelshave also tried to describe variations in melodic entity, such as Bohak and Marolt(2009) calculating folk song variations, and Volk et al. (2007) using rhythmsto identify various folk songs. Savage and Atkinson (2015) also analyze tunefamilies using sequence alignment methods.

While a tune in orally transmitted folk melodies can have these types ofvariations on melodic entity, what happens if the melodic variation is restrictedby the lyrical style? An example of this can be found in Cantonese Opera.Cantonese is a tonal language, which means that all vowels have pitch levels, orcontours, that distinguish them from each other. Cantonese has six different toneshapes, that are categorized into two main families: ‘light’ and ‘dark’. Withinthese categories, there are three shapes for rising, falling, and flat tones. Thisaffects the melodic formation of sung poetry in Cantonese, and has a direct effecton melodic contour (Yung, 1991).

Compositionally, Cantonese opera uses a limited number of ‘tunes’ toaccompany different texts. The structural identities of these tunes are discussedin terms of melodic contour in books on Cantonese opera (Yung, 1989); Yungcompares melodic and vowel contours, remarking that melodic contours in thesearias often follow the contours of the vowels.

Contours are an important part of learning improvisation in the two largest

21

2. Melody

classical music traditions in India—Hindustani and Carnatic music. A frameworkcalled the raga framework in these traditions, provides scale material as wellas melodic grammar to an improvising musician. There are hundreds of ragasthat have individual scale and grammar combinations. Establishing a raga’sidentity relies on the use of scales, which often have different notes in ascendingand descending contours. Rules and ornamental features, that can be looselydescribed as a compositional grammar, must be remembered while presentingeach individual raga. There are many instances of different ragas accomodated inthe same scale, where the distinguishing factor is the unique melodic grammar.

Within the Hindustani tradition, there are four recognizable contour typesmentioned in theory. Together, they are called varnas. The four contour typesare ascending, descending, flat, and varying. Although descriptions of thesecontour types exist, they are usually included in the study of ragas as generalguidelines rather than as formal analytical methods.

The accompanying hand gestures in improvised singing have been studiedusing different approaches, from anthropological to cognitive (Clayton and Leante,2013; Pearson, 2016; Paschalidou et al., 2016; Kelkar, 2015). Studies reveal thatcontour properties and contour metaphors play a role in our cognition of melodicnuance, and in communicating this nuance during improvisation.

Melodic identity can be defined in various ways as described, andcomputational applications in which people evaluate melodic similarity usuallyare within one of the above contexts. For example, we might have a task to lookfor variations of folk tunes, the ground-truth data is a tune-label, while anothertask might be related to identifying a raga based on pitch contours.

Melodic Similarity

In their 2005 paper, Hoffman et al. review a large number of music informationretrieval (MIR) related papers on melodic similarity. They classify the approachesinto: 1. contrast models, 2. distance models, 3. dynamic programming, and 4.transition matrices.

A more recent paper by Cheng et al. (2018) evaluates parameters in arecurrent neural network (RNN) for melodic similarity analysis. The authorscompare string edit distance, n-gram based measures, alignment-based methods,and RNN features using cosine distances for RNN parameters. The authorsalso note that changes in scale and duration do not impact melodic similarity.Instead, similarity measures appear to be invariant to musical transformationssuch as changes in pitch and tempo (Müllensiefen et al., 2004; Urbano et al.,2012). For this, the authors state that using symbolic music as opposed to audiosignal might be simpler.

Dalla Bella et al. (2003) found that people are generally good at recognizingfamiliar melodies on hearing only a few notes. This happens regularly whilelistening to the radio or music from any unfamiliar source—we do not wait forthe whole melody to unfold to identify it. The authors found that with just fiveto seven notes, both musicians and non–musicians are generally able to identifythe correct melody.

22

Musical Melody

2.3.4 Succession

How are we able to separate melodies into different phrases, and what definesa phrase? Several models have been proposed to automatically segment longmelodic sequences into phrases. The idea of melodic succession relies on theseparation of a stream of continuous melodies into constituent phrases. However,all these phrases are not remembered equally, nor are equally important, and wehave several concepts to define and determine the phrases that matter more tous, such as a theme, a motif, an idea, a leitmotif, and so on.

White (1960) applied 12 types of distortions on familiar melodies to findout which of these distortions was most effective in obscuring melodic identity.Many of the distortions are common compositional tools used to expand andgrow thematic material. White found that linear transformations were leastlikely to disrupt identification, while nonlinear transformations and temporalreversal were most disruptive. This indicates that succession is a core elementof melodic cognition. He also found that melodies were just as easily identifiedfrom the first six notes as they were from the first 24 notes.

Frankland et al. (2004) present experiments that compare the empiricalparsing of melodies predictions, derived from the four grouping preference rulesof GTTM (generative theory of tonal music) (Lerdahl and Jackendoff, 1987). Theexperiment was to ask participants to annotate melodic segments, and they foundwithin-subject consistency of boundary placements across three repetitions.

Pearce et al. (2010) present a model for a phrase segmentation task, andsummarize the musicological and psychological background to the task, andreview existing computational models. They propose a new model for phrasesegmentation called IDyOM, based on statistical learning and informationdynamics analysis. In another paper, Pearce et al. (2008) introduce a melodicsegmentation model based on an information dynamics analysis of melodicstructure. The performance of the model is compared to several existingalgorithms that predict annotated phrase boundaries in large corpuses of folkmusic.

Other phrase segmentation models include: GTTM (Lerdahl and Jackendoff,1987), the local boundary detection model (LBDM) (Cambouropoulos, 2001),Temperley’s grouper model (Temperley, 1999), phrase structure preferencerules (PSPRs) (Temperley, 2011), Tenney and Polansky’s temporal Gestaltmodel (Tenney and Polansky, 1980), melodic density model by Ferrand et al.(2003), and Bod’s supervised learning approach to analyze grouping in melodicboundary detection (Bod, 2002). IDyOM or Information Dynamics of Musicmodel (Cambouropoulos, 2006) accounts for musical segmentation from a givenmelodic ‘surface’. The beginning and ending points of prominent repeatingmusical patterns influence the segmentation of a musical surface; the discoveredpatterns are used to determine probable segmentation points in the melody.

23

2. Melody

2.3.5 Shape

The fifth defining attribute is melodic shape. I will discuss the phenomenologicalunderstanding of a shape later in Chapter 3. Melodic shape is often initiallyunderstood in the context of pitch direction. Henceforth, I will use shape tomean melodic contour. Melodic contour is described as the general shape ofa melody or a melodic phrase (Kennedy et al., 2013). Some words that areused interchangeably with contour are shape, configuration, outline, or simplyup–down movement. For instance, Adams (1976) uses melodic configurationand outline to mean melodic contour. In this thesis, I will refer to the shapeproperties of melodies as melodic shape and melodic contour. Some contourtypologies defined in this work are compiled in Figure 2.4, based on previousresearch (Hood, 1982; Adams, 1976; Seeger, 1960; Schaeffer et al., 1967b; Vasant,1998). It is evident that all these contour shapes are represented as lines thatmove primarily from left to right, and the direction of pitch is representedvertically.

Seeger

Schaeffer

Varna

Hood

xx xy xyy xyx

Impulsive Iterative Sustained

Ascending Descending Stationary Varying

Arch Bow Tooth Diagonal

AdamsRepetition Recurrence

Figure 2.4: Range of melodic and sound contours from previous studies fromSeeger (1960); Schaeffer et al. (1967a); Śarma (2006); Hood (1982); Adams (1976)

Pitch and Verticality

Pitch is often discussed as rising and falling. However, the up–down metaphorfor pitch is not present in every language. In fact, several languages use otherbinaries to represent pitch—bright–dark, thin–thick, small–large, light–heavy,and so on—as described by Shayan et al. (2011). Some studies have investigatedcross-modal pitch correspondences by studying pre-verbal infants who tended tolook longer at visual stimuli in which an object moving up corresponded with asound that rose frequency (Walker et al., 2014).

24

Musical Melody

Rusconi et al. (2006) proposed the SMARC effect (Spatial–MusicalAssociation of Response Codes) to identify our tendency to associate high pitchand the upward direction together. This was investigated in a cross-comparisonbetween English and Catalan speakers; in the experiment, the words for spatialelevation and pitch description were congruent and incongruent (Fernandez-Prieto et al., 2017). Results showed a longer response time for speakers whenthe word and spatial elevation direction were incongruent. Even though infantsprefer congruent stimuli—that is, a rising pitch and a rising motion, and a fallingpitch and a falling motion are more engaging—it takes longer developmentallyto be able to discriminate between ascending and descending pitch stimuli. Inwork by Stalinski et al. (2008), people in five age groups (children aged 5, 6, 8,11, and adults) were given three-note sequences to identify a difference whereonly the middle tone was higher or lower. Performance improved from the five-to the eight-year-olds, reaching the adult level at eight years. For all age groups,noting whether two pitch profiles were the same or different was easier thandirection identification. The study also reports that, “Twenty-two percent of thechildren gave verbal or gestural responses to the examples that were consideredcorrect. However, only 5 percent of the children used the traditional music terms"up and down" or "high and low" correctly." (Hair, 1977).

These studies seem to show that although the association of pitch and verticalheight might precede language acquisition, the embedding of pitch descriptorsas vertical motion establishes this further. However, we must note that most ofthese studies were carried out using synthetic ‘pitch’, usually sine tones or MIDInotes, played on a piano. As such, matters can get complicated once we enterthe domain of melodic phrasesk, especially those sung.

Phrases themselves are categorized in many languages and cultures as havinginherent contour properties, which means that their internal pitch relationshipsare interpreted as shapes. As such, we may ignore some notes in favor ofpreserving structure. For example, a phrase with a long appoggiatura maysimply be interpreted as bow shaped due to the weight of the last tonic landing,even if it is small in duration. On the other hand, a vibrato with a constantlychanging pitch parameter of even up to a tone might be interpreted simply as asingle note—a stable horizontal line-shaped contour. This means that in additionto pitch properties, some kind of a weighting of pitch values based on context isessential to understand the shapes of melodies.

Notation

In the earliest notation systems in western classical music, tiny squiggly linesabove texts indicated the melodic contours that were to be sung to the text, inorder to remember the melodies of the songs. Gradually, this is said to haveevolved into an indicator line, which set the stage for modern notation as shownin Figure 2.5. This indicator line later developed to five several lines on thestaff–each position on or between the lines is related to fixed pitches. Ornamentalsymbols are still notated using shapes; for example, the trill, mordant, glissando,all have shapes that look like representations of the contour shape.

25

2. Melody

Figure 2.5: On the left, we see Palestrina’s Iubilate deo universa terra" psalmverses in neumes first published in 1593 in Offertoria totius anni, no. 14. Itis argued that neumatic notation is derived from cheironomic hand gesturesindicating changes in pitch. On the right, is an illustration of how neumesevolved into mensuration notation, and finally as modern notation.

Modern notation with staff lines, and by extension MIDI-based notation,forms the majority of data sets for musical content retrieval and analysis. Contouranalysis typologies are also based on notation, with discrete pitch and durationvalues. However, not all music can be transcribed into staff notation. Moreover,modern composers within the classical tradition have relied heavily on graphicalnotation to communicate musical ideas. My favorite example of this is JohnCage’s vocal piece, Aria, where the notation is just a collection of flowing colorfulsquiggly lines that are to be interpreted by a singer. We also find the use ofcontour-based notations in Tibetan chants, as shown in Figure 2.6 (Collection,2019). Material in these chants is often rhythmically stretched.

2.4 Other Aspects of Melody

So far, I have explored the idea of melody by its defining aspects of pitch,organization, recognizability, succession, and shape. There are other aspects thatinfluence melodic perception that I would also like to highlight. Primarily, thisincludes the special place of vocal melodies, and vocal learning in the cognitionof melodies.

2.4.1 Voice and Melody

The definition of melody centers on our ability to sing—in essence, the ability tocollapse a large quantity of acoustic and musical information into one hummableline. When the melodic line is clearly differentiated from other instruments, thistask is trivial. But there are several musical cases where the melodic line isshared among many instruments, two or more melodic lines exist, or the timbreof the melody unclear and so on.

26

Other Aspects of Melody

Figure 2.6: Tibetan Yang notation from silkscreen prints, ca. nineteenth century.This form of notation goes back to the sixth century (Collection, 2019)

.

An early study observing spontaneous singing by children was conducted by(Moorhead and Pond, 1941, p.2). Some of their main findings are interesting tobegin this discussion with:

“(a) Experimentation with vocalizations, and with songs was thenorm and that children do not organize their music in conventionaltonalities as adults do, (b) children’s songs were plaintive comparedto the songs adults would have them sing, (c) chant appeared underconditions of freedom-alone or in a group, (d) physical activity wasdirectly related to rhythmic chant, (e) solo chants were like heightenedspeech in which the music conformed to the words, group chants werein duple meter and the words conformed to the rhythmic structure,(f) children explored instruments first before melodic and rhythmicpatterns emerged, (g) instrumental improvisation was characterizedby asymmetrical meter, followed by duple and triple meters, thensteady beat.”

Comparison studies of spontaneous singing in children are hard to carry outbecause of various differences between children’s musical and verbal abilities,regardless of age. Spontaneous songs are also harder to keep track of and record.

27

2. Melody

However, human beings need exposure, play, and experimentation to developtheir vocal apparatus. This is explained by the vocal learning hypothesis.

Vocal Learning Hypothesis

The main claim of the vocal learning hypothesis is that human beings (as well assongbirds) learn from a process of vocal imitation, relying immensely on auditoryfeedback. This is in contrast with contextual learning in vocal communication,where we learn to identify or react to sounds as a result of our experiences withthem. Early vocalizations have little to do with mature song or speech, but formthe building blocks of language cognition and musicality (Hansen, 1979).

Research has also demonstrated that we are better at remembering vocalmelodies than instrumental melodies (Trehub et al., 1984; Weiss et al., 2012,2015). This hypothesis has been tested time and again with speakers of differentlanguages; it has been replicated across age groups, levels of expertise, and so on.This implies that the voice is functionally and biologically significant. Studiesalso suggest that there is greater pupil dilation in response to vocal stimuli,showing increased attention or arousal (Weiss et al., 2016).

Yet another interesting aspect of the voice is the margin of error we toleratein the precision of vocal intonation. Dubbed as the ‘vocal generosity’ effect, itdescribes the tolerance of intonation differences in comparison with the tolerancefor other instruments (Hutchins et al., 2012). Perhaps the note positions inexaggerated vibratos are a by-product of such an effect.

Subvocal Rehearsal

Subvocal rehearsal, or the vocal imitation of speech and sound, is an importantaspect of the auditory working memory. Studies have shown that suppressingsubvocal rehearsal directly affects our capacity to remember sequences (Rizzolattiand Craighero, 2004; Jacquemot and Scott, 2006; Kohler et al., 2002). Recentdevelopments in electromyography have also enabled us to build systems torecognize subvocal speech (Jorgensen and Binsted, 2005). I offer more on this inthe section on auditory working memory in Chapter 4.

Birdsong

The study of birdsong changed in the 1950s, after World War II, when the soundspectrograph became available (Marler and Slabbekoorn, 2004,p. 1). Since then,it has become possible to represent the fundamental frequencies in birdsongusing ‘sonograms’. One reason that birdsong is relevant to music and melodiccognition is because both birds and humans are ‘vocal learners’. This means thatsongbirds require auditory input to learn ‘normal’ birdsong.

There is, however, a critical difference. In a 2016 paper, (Bregman et al., 2016)measured the ability of starlings to recognize pitch sequences that remove spectralstructure. Humans are generally good at recognizing transpositions of melodiesas instances of the same melodies. However, this paper finds that spectral shapes,

28

Discussion

rather than melodic contours, predicted whether starlings recognized melodies.Shannon (2016) argues that this indicates that birdsong is more analogous tospeech perception rather than music, although it is clear that humans and birdsmay have evolved to develop an acuity for melodic perception, albeit in differentways.

2.5 Discussion

Several studies have remarked on the nature of signification in melodic contourin a number of song styles. Mangaoang et al. (2018) write, in their analysis ofPhilippine disco-opera, how the contours of the heroine’s song are constructedto remind one of the Disney princess songs. Tolbert (1990) also writes aboutKarelian lament songs, and talks about descending, terraced contours as being acontour-motivic device to communicate grief in these songs. Researchers haveproposed and demonstrated the links between music perception and the bodytime and again (Leman, 2008; Godøy, 2003; Gritten and King, 2006, 2011; Godøy,2010; Jensenius et al., 2006), as I will discuss in the following chapters.

In the sections above, I have established that contours are a feature uniqueto melodies in the following ways:

• it is important for us to remember and index melodies

• it is a low-level feature acquired early in childhood

• it is closely related to prosody and speech–affect comprehension, and

• contour stability is closely related to the perception of melodic stability.

Perception of pitch and melody are highly interrelated, and I want to understandmelodic contour and its perception through the body in this thesis. I think asystematic study of this can shed light on cross-modal mechanisms, and help usunderstand pitch perception in the context of melody. The objective is, thus,to focus on the perception of dynamic pitch. In this chapter, I hope to haveestablished that melodic perception is the ecological site of studying dynamicpitch perception. As such, I have demonstrated some peculiarities of melodiccognition to suggest that melodies are understood as holistic entities.

2.6 Summary

In this chapter, I have presented a review of melodies as speech, and then musicalmelodies. I have described their essential properties, and how melodic contourplays an important role in melodic recognition and identification in both speechand music; although studying speech contours and musical contours typicallyhas different approaches. Identifiable melodies in speech and music laid out ona spectrum demonstrate the various levels between the binary of speech andmusical melodies. Melodic entities, and their belongingness to various melodic‘families’ is culturally determined, as a result of which, their similarity and

29

2. Melody

categorization also have various flavors. Vocal melodies are special, and melodiccharacter is contingent on the ‘hummability’ of melodic lines. Finally, I havehighlighted how acknowledging the cross-model imagery of contour as shape cancontribute to our understanding of melodic perception. In the next chapter, Iwill discuss auditory and motor imagery, going further into this cross-modalconnection.

30

Chapter 3

Auditory and Motor ImageryShape without form, shade without colour,Paralysed force, gesture without motion;

...Between the motion

And the actFalls the Shadow

- The Hollow Men, T. S. Elliot, 1925

3.1 Introduction

In The Hollow Men, one of the most famous poems of the twentieth century, T.S. Elliot emphasizes the gaps between the ideas of shape and form, gesture andmotion, idea and reality, and motion and action, saying that between these ideaslies uncertainty. Having shape and gesture without form and motion, lackingintentionality, becomes a characteristic of the ‘hollow’ men. According to thepoem, it is the act and quality of movement that defines one’s "human-ness", orhow much of a person one is.

The central aim of this thesis is to explore the multimodal relationshipbetween melody and motor imagery, by discussing the cognition of melodiccontour as a trace. In order to discuss the various aspects of this relationship, Iwould like to first discuss how humans perceives shapes within the visual andmotor modalities, and why the notion of shape is essential to cognition.

Imagery, or the reproduction of multimodal sensations in the mind in theabsence of stimuli, forms the basis for the theoretical framework used in thischapter. Psychologists have been studying mental imagery from the time of theAncient Greeks to the Modernists (Kosslyn et al., 2006, p.4), although muchof the research in this area has dealt with visual imagery rather than othermodalities. In this chapter, I will focus my attention on auditory and motorimagery, and establish the context for auditory and motor cognition within thephenomenology of imagery.

3.1.1 Terminology

The following terms occur frequently while discussing visual and motor imageryin relation to contours:

1. Line: Contrary to the notion of a line in Cartesian geometry, where ’line’refers only to straight lines, line here simply refers to a continuous path ofpoints. In this sense, a melodic line refers to the continuous arrangementof notes in succession.

31

3. Auditory and Motor Imagery

2. Shape: Shape generally refers to a geometric arrangement—an autonomousentity accompanied by a mathematical definition. However, when referringto melodic shape, I allude more to the shape-like properties of individualmelodies, as invoked in a geometrical sense.

3. Trajectory: In the context of motion capture studies, trajectory in thisthesis refers to the path that a particular marker traces in space. As such,it is different from a line because it refers to the actual motion history of amoving agent, while a line refers more to the intended trajectory.

4. Contour : The contour properties of melodies include the visual and gesturalimages invoked by directional features.

3.2 Auditory Imagery

It is a very common experience to be able to hear a tune in one’s head, or to hearsomebody’s voice in one’s head. Just as the ‘mind’s eye’ is said to represent visualimagery, what we see when we are imagining something visually, the ‘mind’sear’ is a term used for auditory imagery, and when applicable, musical imagery(Godøy and Leman, 2010). Many studies have demonstrated the activation ofthe auditory cortex when the mind is engaged in auditory imagery (Zatorre andHalpern, 2005). Recalling melodic excerpts engages the auditory cortex in asimilar way, as has been demonstrated earliest through PET scans of the brain(Halpern and Zatorre, 1999).

Contour representations and auditory imagery have been explored inexperimental studies on the recall of melody (Reisberg, 2014, p.2). The debateson mental imagery revolve around the nature of internal representations ofmelodies, and how these representations are accessible to us.

3.2.1 Auditory Scene Analysis

It is often said that unlike the eyes, the ears have no lids. As such, we areconstantly hearing what is happening in our environment; we can locate ourselvesin space; and we understand the density of moving objects around us, the soundsthey create, and their material properties. We can focus auditory attention ona specific stream within the spectrum of sounds in our hearing. In his book,Auditory Scene Analysis, Bregman (1994) attempts to explain how we segregatean ‘auditory scene’ (unfiltered auditory information that we hear at any time)into separate ‘auditory streams’. For example, we are able to segregate thesound of someone speaking from the noise of a fan in the room. Further, wecan distinguish a known voice from a cacophony of many simultaneous speakers.At the same instant, we can also identify the speaker, and perhaps even theirtemporary physiological state; for example, we know from their voice if they arejoyful, or are tired. Bregman’s analysis is about the auditory stream, and notexplicitly about musicality, but it illuminates analysis of musical hearing as anauditory stream.

32

Auditory Imagery

3.2.2 Ecological Pitch Perception

To address pitch perception from an ecological perspective, we have to questionas to what end contour memory might have been selected for. What advantagesdoes it give us? The way we perceive a fundamental pitch in a complex soundappears related to how the partials in a sound are distributed in terms of loudness.When we hear strong or loud harmonics, we tend to recognize the lowest loudestpartial as the fundamental. A musical illusion called ‘the missing fundamental’shows how we perceive a fundamental when the harmonics are strong, even if thefundamental partial is removed from the spectrum—our cognition compensatesfor this missing element.

Isolated tone perception and melodic perception are to be understood asseparate because of many temporal and perceptual factors. First, upon rapidpresentation, individual tones seem to fuse together to form iconic melodicmotifs that we remember much better than series of tones whose succession isartificially broken or disturbed. As de Cheveigne writes, absolute pitch in humansis rare; although it can be acquired with a lot of training. The computationallyharder and more abstract task of analyzing relative pitch, however, is actually anatural ability. We are naturally good at relative pitch perception, and intervalperception. However, computational models of pitch detection based on relativepitch need to be developed (de Cheveigne, 2005).

Melody relies on pitch as the main variational material, or changes in pitchover time are primarily what makes a melody. Even though this is true,we perceive a plethora of features from melodic stimuli that influence pitchperception. For example, our perception of pitch is different in sounds withsimultaneously increasing pitch and loudness, and where this relationship isinverse. However, it is noteworthy that in debrief interviews of experimentsasking participants to trace melodies, participants claim to be tracing pitch, anddo not mention timbre, dynamic envelopes and other sound features. Howeverthis is a circular argument. If we accept pitch as the aspect in melodic musicthat moves, then we have to rethink how pitch extraction algorithms tend tomodel the fundamental frequency as pitch.

3.2.3 Conceptual Shapes and Music Theory

Godøy (1999) suggests that thinking of musical sounds as shapes is a valuableway to develop a holistic approach to the analysis of musical sounds. He stressesthe importance of holistic approaches to musical sound and musical cognitionwhich is more than traditional music theory offers. Godøy defines conceptualshapes as the spectral distribution and temporal evolution of various aspects ofa musical object.

In ‘Knowledge in music theory by shapes of musical objects and sound-producing actions’, Godøy (1997, p. 90) discusses the notion of shapes in musicalobjects, and how this idea of shape could be used to understand the holisticperception of musical objects. Here, shape is invoked as in common parlance; thedrawing of dynamic envelopes is one example, as are wave forms used to represent

33


different sounds. Godøy suggests that thinking in shapes is already the everydayworking method of accessing useful categories in various domains; within soundand music, for example, there are sound envelopes, topological shapes such ashelices, and graphical analyses of compositional structures. Melodic shapes, ashas been explained in the previous chapter, are ubiquitous too, and could beexplored analysis-by-synthesis (Godøy and Leman, 2010, p.243).

This thesis examines melody as the site of execution of our cognition ofcontour. Speech intonation is another area where our cognition of contour isuseful to us—it aids our understanding of languages, dialects, affect, and so on.As such, melodic music and melodies are not primarily used in research on thecognition of linguistic contours.

3.3 Gestalt and Shape

Mathematician and phenomenologist René Thom writes, “the first objective isto characterize a phenomenon as a form, as a ‘spatial’ form. To understandmeans then, first of all, to geometrize" (Thom, 1983, p.6). Thus, we first presentmost models and architectures geometrically, to give them spatial meaning. Toarrange them in relation to one another is to clarify the conceptualization ofthe model itself, whether it relates to abstract concept visualization or moreconcrete and data-driven methods. This is perhaps why we draw figures andconceptual maps to elaborate on concepts in conversation, and most analyticalmodels have a visual or a geometric component.

3.3.1 Melodic Gestalt

Christian von Ehrenfels, one of the founders of Gestalt theory, uses melodies asa starting point to illustrate how Gestalt perception might function, as (Lemanand Schneider, 1997,p. 46) has described. Ehrenfels writes:

Let us suppose, on the one hand, that the series of tones t1, t2, t3,. . . tn on being sounded, is apprehended by a conscious subject Sas a tonal Gestalt (so that the memory–images of all the tones aresimultaneously present to him); and let us suppose also that the sumof these n tones, each with its particular temporal determination,is brought to presentation by n unities of consciousness in sucha way that each of these n individuals has in his consciousnessonly one single tone-presentation. Then the question arises whetherthe consciousness S, in apprehending the melody, brings more tohis presentation than the n distinct individuals taken together.(Von Ehrenfels, 1988, p.85)

Gibson (1960, p.698), while elaborating upon ecological psychology, also refersto the use of melodic stimuli in psychology experiments :

“The Gestalt psychologists pointed out that a melody is perceived,but they never suggested that a melody was a stimulus. The notes

34

Gestalt and Shape

of the melody were taken to be the stimuli. But what about thetransitions between notes, or the ‘transients’ of acoustical engineering?Are they stimuli? The investigators of speech sounds seem to thinkso, but the auditory literature of sensation is vague on this question.And if a short transition is a stimulus, why not a long transition ortemporal pattern?”

In the Gestalt framework, continuity plays a big role in perception. Thismeans that we are likely to perceive as salient, sensory events or objects that havethe semblance of continuous motion, rather than those that jump haphazardlyfrom state to state.

Bod (2002) argues with empirical evidence that Gestalt principles might notbe sufficient to propose a comprehensive theory of memory, and that several ofthese principles are flouted when tested for segmentation, or grouping boundaries.For example, they find that segmentation boundaries can exist even when thereis no note change. Further, boundaries can often also appear before or afterlarge pitch intervals than right at those intervals, suggesting that large pitchintervals do not necessarily indicate phrasal shifts.

3.3.2 Contour Continuity and Gestalt

When our visual system encounters gaps along the edges of incomplete shapes,it is able to fill in the gaps, perceiving subjective contours of objects that areabsent in the details of the stimulus.

Mathematical models of ‘mental gap-filling’, or contour perception ofincomplete shape patterns, are modeled on the following operative principles:

1. Isotropy: The contours that are filled in are insensitive to rotation,translation, and scaling of the figures.

2. Smoothness: Except for corners, the contour apprehension is smooth, ordifferentiable at least once.

3. Minimum curvature: The filled in contours have a minimum curvature.

4. Locality: The filled-in contours are operative only to edge shapes, and notentire edges in cases of slight deviations.

An illustration of some of these concepts is made in Figure 3.1. Modelslike this can help us understand how contour filling in the mind’s eye validatesGestalt principles. The models explained here clearly refer to actual and notmetaphorical shapes. However, similar principles should hold true for melodiccontour as well.

I wish to make a distinction between the geometric and motion aspects ofmelodic memory and imagery on the basis of the extent and types of distortionsthat we are able to tolerate when it comes to melodic objects. For example, solong as the inter-event durations within a melody remain the same, we are ableto retain the stability of a melodic object even when the tempo is dramatically

35


Figure 3.1: A and B represent the property of isotropy in this illusion; thetriangle is perceived despite scaling and rotation. In C and D, with the changeof an angle, we perceive the third edge of the triangle to be slightly curved. Thisalludes to the properties of smoothness and locality. The lines at the corners arestill straight lines, but the effect is of a curved edge instead.

different. Further, so long as the relative contour remains constant, we are ableto recognize a melody even if it is transposed heavily. Contour properties arethus resilient to temporal compression, expansion, and spectral change in ourminds.

In his early writings on shape and morphe, Aristotle discusses grouping thesame ‘types’ of objects by their apparent shapes. In Aristotle’s phenomenologyof shape, these groupings are used as a measure of salience, and as a way todiscriminate between different categories by appearance (Long, 2007). Aristotledistinguishes between the shapes of objects and their formative material, oressence; as two distinct ideas. ‘Dogness’, ‘catness’, and other properties of livingbeings are also invoked primarily by their external shapes.

3.4 Motor Imagery

Mental imagery is largely described as recreations of mental or sensory experiencesin the mind, without external stimuli; for example, imagining someone’s face,or being able to ’hear’ a piece of music you like in your ‘mind’s ear’. There

36

Motor Imagery

have been debates in the cognitive sciences about whether mental imageryis representational or purely experiential for many years. However, neuralimaging studies have successfully demonstrated that we engage the same partsof the brain involved in the perception of a stimulus, and imagining the samestimulus. In other words, when we hear a song in the mind’s ear, we engage theauditory cortex. Enactive theories of cognition explain perception as somethingan organism is actively ‘doing’, instead of being merely passive (Clarke, 2001).Motor imagery, referring to the mental representations of movement, could beconsidered necessary to planning and execution of actions (Jeannerod, 1995).

Visual and motor imagery are thus related. On registering visual informationacross time, we get a sense of the process of creation of the objects, and theiraffordances. Our ability to respond to visual stimuli, in part, uses this ability ofours according to this view.

3.4.1 Phenomenology of Shape

In common parlance, when referring to shapes, we mean circles, triangles,rectangles, and other common geometric figures that we learn about as children,as the building blocks of our first experiments with geometric abstraction. Ingeometry, shapes have definite forms that are describable by mathematical rules.However, when we say something is circular, for instance, we do not usuallymean a perfect circle. Thus, our own perceptions of mathematically strict shapescan be adaptable to varying contexts. For example, in everyday parlance, everyinstance of drawing an imperfect circle, is understood as being the same shape.

The stability of shapes in our minds is explained through the study ofmorphogenesis. In (Bourgine and Lesne, 2010, p.2), the authors emphasize that tounderstand a shape requires an understanding of its formation or morphogenesis.In this view, such things as fractals, sand dunes, and waves are also shapes, inthat they have well understood geometric forms and morphogeneses. In otherwords, if we understand how something is formed, we can tolerate variationsin that shape. Our perception of human and nonhuman shapes might also bedifferent. Natural formations, such as sand dunes or waves in the ocean, mightbe perceived more as ‘living shapes’, than man-made formations, such as cubes.Gestalt psychologists use melody as an example to explain what we think of asshape and contour. I explain this more in Chapter 4.

An accessible example to illustrate this point is the use of diagrams inmathematics and computer science. The virtual or unseen is often represented,especially in mathematical education, through movement metaphors and abstractrepresentations of mathematical objects. In studies involving mathematics andgesture, authors have pointed to aspects of virtual shape apprehension thatwould be impossible without understanding the geometric relationships betweenobjects, and movements that describe the morphogenesis of abstract visualgeometry.

The connections between the spatial representations, illustrated by gesturingwhile teaching mathematics, have been written about in detail by Châteletwho remarks, ‘The virtual requires the gesturing hand’ (Châtelet, 1993, p.15).

37


Châtelet insists that the gesture and the diagram are incomplete without eachother, and that they each participate in the provisional ontology of the other(De Freitas and Sinclair, 2012). Châtelet suggests that diagrams are ‘physico-mathematical’ beings.

In this discussion, this notional, phenomenological shape is of interest to us inthe context of the experimental work that has been conducted in this thesis. Theidea of musical shape is multimodal, but quite accessible, as I have demonstratedin the chapter on melody. However, understanding how this shape is executedas an action merits a discussion on the phenomenology of shape abstraction andthe perception of movement within static, abstract visuo-gestural objects.

3.4.2 Neural Correlates of Phenomenology of Shape

Human beings live in dynamic environments and are in constant motion. Toreceive and process visual information, our eyes make tiny saccadic movementsseveral times in a second, and our heads turn. Despite the jerky eye movements,we can successfully integrate images of objects into stable entities, and apprehendthe motion properties of objects. Another property of shape perception andimagery is our ability to scale, rotate, and transform objects. According toBiederman (2013), object invariance, also referred to as shape constancy, is themost striking feature of shape perception.

Shape constancy is modeled using shape percepts, or geons, that are proto-shapes or shape-images; we are able to break down objects that we visuallyperceive into a finite number of simpler shapes Biederman (1987). In his originaltheory, Biederman tries to understand object categorization through the processof geometrical deconstruction in our visuo-motor system. He proposes a set ofvolumetric geons (geometric ions) that serve as non-accidental primitives forshape perception, and by extension, object perception.

Shape-based object representation, and its relation to visual perception, hasbeen the focus of several benchmark studies (Kobatake and Tanaka, 1994). Thesefindings have been modeled using neural networks in other studies (Hummel andBiederman, 1992). In a review paper, Biederman (2013) describes theories ofshape-based object representation in the inferior temporal cortex and the lateraloccipital complex. Lesion and functional Magnetic Resonance Imaging (fMRI)studies of the brain have determined that these two areas exhibit a high degreeof invariance to geon representation.

3.4.3 Lines and Movement

How do perceived shapes in musical objects relate to the phenomenology of shapeperception itself? Lines that we observe in static images, also seem to have adirection of movement. They seem to emerge from somewhere, and go in anotherdirection. Even when not explicitly specified, through, for example, arrowheads,we tend to think of lines as traversing different visual spaces. Wallach (1935)noted , that without context, the motion direction of a line itself is ambiguousIt’s also important to note that lines, in this context doesn’t refer to the notion of

38

Summary

lines in Euclidean geometry, but lines as we recognize them in visual perception,which is lines that we perceive to pass between objects. But in the presenceof context, the motion direction of a line is determined by the shape of theaperture that it traverses. A simple example of this is the barberpole illusion,where diagonal lines appear to move perpendicularly due to the rotation of thebarberpole, either upwards or downwards. This has been later experimentallyanalyzed by Wuerger et al. (1996).

If we were perceiving melodies as lines, the aperture, or canvases uponwhich we imagine them have a bearing on our perception on the direction ofmotion, and therefore would change the way in which we trace the perceivedmelodic lines themselves. This is why understanding shape perception and shapephenomenology is critical in trying to understand sound tracings. There hasnot been a lot of literature that deals with this explicitly in the context ofsound-tracings.

The experiments conducted as a part of this thesis rely on cross-modalconversions between melodic lines and movement. This movement is differentfrom rhythmic musical entrainment in that the synchrony of gestural imagery isless obvious. In the papers that comprise this thesis, we see that people havedifferent associations and strategies to respond to sound with movement.

3.5 Summary

In this chapter, I have elaborated on the principles of auditory and motor imagery,which is our ability of being able to see and hear ‘in our minds’. Imagery formsone of the foundational concepts behind picturing melodic contours as lines,and being able to execute the line as movement. Our experience in the world,with overlapping stimuli, and intertwined visual and auditory scenes is discussedin Ecological Psychology, which is an important theoretical model for thisthesis. I outlined how melody as shape is a foundational concept in gestaltpsychology. Recent work on motor–mimesis and gestural sensations in auditionare also discussed. Several theories for shape perception are outlined, and thecognitive dimensions of shape-perception from a phenomenological perspectiveare explained. The experiments in the thesis lean on the methodology of ‘sound-tracing’, which starts the next chapter, proceeding to a discussion of bodymovement, and how embodiment informs music cognition.

39

Chapter 4

Body Movement... that mind–body split, you know?

The head is good, body bad. Head is ego, body id.When we say “I," – as when Rene Descartes said,

“I think therefore I am," – we mean the head.And as David Lee Roth sang ...“I ain’t got no body."

– Levine (2002)

4.1 Introduction

The mind–body split is a problem that has long engaged philosophers andresearchers interested in consciousness. In the traditionalist view of cognitivescience cognition is seen as something that happens in the mind, while actionhappens in the body. In this we take embodied cognition as the point of departure,which takes the body with its active and perceptual capacities as the startingpoint. In such a perspective, music listening is also grounded in behavior.

Drawing upon this perspective, Clarke’s early work on embodied cognition inmusic demonstrates the motivations to consider motion as an important aspectof music perception (Clarke, 2001). In recent years, several important works onembodied music cognition, both empirical, and theoretical have been publishedas we will see in this chapter.

This thesis combines embodied cognition and music perception, with analysistechniques borrowed from music information retrieval, biomechanical control, andinteractive music interfaces. In this chapter, I talk about the key frameworks inmy study of human body motion and its fundamental role in cognition in general,and speech and music cognition in particular. These include key frameworks inecological psychology, auditory scene analysis, embodied music cognition, andgestural imagery related to musical movement. The core methodology used inthe experiments for my work are rooted in sound-tracing, and so I will beginthis chapter discussing it first.

4.2 Sound-Tracing

Sound-tracing is an experimental paradigm that is used to investigate spatialrepresentations of sound. In a way, our perceptual system acts as a transducer to‘translate’ between two modalities. Sound-tracing studies analyze participants’spontaneous renderings of melodies to movement, capturing their instantaneousmultimodal associations. Typically, participants are asked to “draw” (or “trace”)a short musical excerpt in air as they listen to it. Several studies have beencarried out using digital tablets as the transducer or medium of recording the

41

4. Body Movement

data (Godøy et al., 2006; Glette et al., 2010; Godøy et al., 2005; Jensenius, 2007;Küssner, 2013; Roy et al., 2014; Kelkar, 2015). One restriction associated withusing tablets is that the size of the rendering space is limited. Furthermore, asthe dimensionality does not evolve over time, it represents a narrow bandwidthof possible movements.

Melodic sound-tracings, or manual renderings of sound-as-shape is not a taskthat is taught or rehearsed. A theoretically robust model of sound-tracing mustconsider the cognition of a melodic entity as a whole. By applying metaphors ofmotion, on a purely linguistic level, we often ascribe physical object propertiesto sonic features. Roughness or grain, for example, is the textural property ofa rigid body, which is available to us from our experiences of touching roughobjects. As an existing metaphor, it is readily transferable to our perception oftexture in a sound object.

An alternative to tablet-based sound-tracing is full-body motion capture.This may be seen as a variation of ‘air performance’ studies, in which participantstry to imitate sound-producing actions of the music they hear (Godøy et al.,2005). Nymoen et al. (2011) carried out a series of sound-tracing studies focusingon hand movements across many studies (Nymoen et al., 2013), elaboratingseveral feature extraction methods to be used within the larger sound-tracingmethodology.

A substantial amount of work in music perception and embodimentconcentrates on rhythmic entrainment, tapping, and other time-synchronousresponses. Approaches to melodic shape cognition that take an explicitlyembodied perspective are fewer. Kussner (2014), investigates many propertiesof sound-tracings alongside their musical attributes. A significant takeawayfrom this work is the differences in representation between musicians and non–musicians’s tracings with modeling hyperparameters. Non-linearity of mappingwas also reported in the tracings. Participants mapped an attribute to a physicalaxis, but they also end up representing feelings or other characteristics that themusic invokes.

The experiments conducted for this thesis deal with motion capture datafrom participants, who trace the shapes of short melodies while listening tothem for the second time. This experimental paradigm of sound-tracing is adevelopment based on several theoretical constructs regarding body movement,and the relationships between music and action.

In my view, the interactions here are between audition, movement, andgeometry. That pitched melodies can be thought of as lines is the interaction ofaudition and geometry; however, their execution through movements of the handsand body represents an interaction of geometry and movement with time—itinvolves manual control and action planning. In essence, sound-tracing involvestranslating the perception of auditory stimuli into an action that has its owngeometry, and sometimes even a recognizable shape.

An action or a series of actions performed as spontaneous rendering of soundscan be described as tracings. We suppose that these movements reflect music-related gestural sonic imagery, and try to understand how the dynamics of controlcan influence motor action in sound-tracings. By definition, sound-tracing is

42

Embodied Music Cognition

a many-to-one mapping methodology; mapping parameters can vary and areregulated or defined by several occurrences in the music.

4.3 Embodied Music Cognition

Embodied music cognition is a theoretical paradigm grounded in theunderstanding that the body is critically important to our perception of music.The theoretical framework draws from embodied cognition, in which perceptionis explained as a feature of embodiment, and where bodily interaction is seenas central to perception. Embodiment in musical contexts refers not only toactive motor responses such as tapping to the beat, but also to motor constraintsand actions that actively frame how we perceive musical stimuli (Lesaffre et al.,2017). In this model, action and perception form a feedback–feedforward loopwhere action and perception simultaneously inform and frame each other.

The fundamental claim of the embodied music cognition paradigm is thatbodily involvement is crucial in human interactions with music, and therefore,also in our understanding of that interaction (Leman et al., 2018). Leman pointsout that there is no explicit theory that is in direct opposition to embodiedmusic cognition which claims that music and the body are not related in musiccognition. However, this embodied music cognition framework allows us to makeexplicit the role of the body in perceiving music. Music perception, in thisparadigm, is considered a reconstruction of our bodies’ interactions with theworld. The notion that sound is closely related to physical materials and actionsis central to understanding this concept. We learn early on that sounds havesources. These sources are present in our physical environment, and this physicalworld informs our ideas about acoustic expectations to a great degree.

4.3.1 Early Research on Music Embodiment

Historical work on embodied music can be traced back to music in the contextof music accompanying exercise (Truslit, 1938), but empirical investigationsof embodiment are much newer. One of the first methods to systematicallystudy kinesthetic representation in relation to music and literary works wascalled Schallanalyse; it was developed by Eduard Sievers. He distinguishedtwo classes of curves—general or ‘Becking’ curves and specific or filler curves(Taletfiillcurven), which described types of musical movement based on dynamicexpression (related to loudness), and voice type (Shove and Repp, 1995, p.67).Becking and Truslit distinguish between three basic types of movement curves:‘open’, ‘closed’, and ‘winding’ (Shove and Repp, 1995, p.71). These curves arenot conducting movements—they are supposed to be executed with outstretchedarms and are a means of portraying dynamics in space; the speed of the movementand the musical tension affects the curvature of the motion path.

In 1977, Clynes developed a notion of essentic forms, which are dynamicshapes that characterize basic emotions (Clynes, 1977). He developed anapparatus called a sentograph, which captures finger pressure in two directions

43

4. Body Movement

while listening to music. athe study showed that subjects produced different‘sentographs’ while listening to music that portrayed different feelings. Hedeveloped a computer program that enabled him to play music with differentagogic (related to note-stretching) and dynamic patterns. Later, in 1992, Toddbegan studying the motion of the whole body rather than just the limbs orfingers. He found evidence that motion and music were interrelated based on howthe ventromedial and lateral systems that control posture and motion, respondto music. His study of motoric expression in the form of ‘expressive body sway’is significant.

4.3.2 Recent Research on Music Embodiment

Movement that accompanies music is understood now as an importantphenomenon in music perception and embodied cognition (Leman, 2008).Research on the close relationship between sound and movement has shedlight on the understanding of action as sound (Godøy, 2003) and sound asaction (Jensenius, 2007; Jensenius et al., 2006). Interaction with music is abodily activity, influencing perception and even contributing to it. Cross-modalcorrespondence is a phenomenon with a tight interactive loop—the body is amediator that engages in both perceptual and performative roles (Gritten andKing, 2006, 2011). Some of these interactions cause motor cortex activation evenwhile simply listening to music (Molnar-Szakacs and Overy, 2006). This has ledto empirical studies on how music and movements of the body share a commonstructure that affords universal emotional expression (Sievers et al., 2013).Mazzola and Andreatta (2007) have also worked on a topological understandingof musical space and the topological dynamics of musical gesture. Buteau andMazzola (2000) analyzed contour similarity models as motivic topologies forspecific excerpts, proposing a motivic evolution tree (MET), modeling changesin contour dynamics. These models are implemented in the Rubato software(Mazzola and Zahorka, 1994).

Studies on Hindustani music show that singers use a wide variety ofmovements and gestures during spontaneous improvisation (Clayton and Leante,2013; Clayton et al., 2005; Rahaim, 2012). These movements are culturallycodified; they are used in the performance space to aid improvisation andmusical thought, and they convey musical information to listeners. Performersalso use a variety of imaginary ‘objects’ with diferring physical properties toillustrate their musical thought. Some other examples of research on bodymovements and melody include Huron’s studies on how height to which eyebrowsare raised while singing are a cued response to melodic height (Huron andShanahan, 2013). There are also studies suggesting that arch-shaped melodiesespecially have biological origins that are related to motor constraints (Tierneyet al., 2011). According to the motor constraint hypothesis, melodic contoursthat have an arch shape is the most energetically efficient to produce by thevocal apparatus due to the increase and decrease in subglottal pressure duringvocalization (Savage et al., 2017, p.332).

44


Auditory memory plays a key role in the discussion of melodic perception.The interplay between time scales of auditory ‘units’ and how we rememberthem is discussed below.

4.3.3 Auditory Working Memory

Working memory is the system responsible for the temporary storage andsimultaneous manipulation of information in the brain (Schulze et al., 2018).People distinguish between working memory (WM) and short-term memory(STM); WM is the system responsible for temporary storage and simultaneousmanipulation of information in the brain (Schulze et al., 2018), while short-termmemory (STM) is used primarily for temporary storage. Demonstrably, it hasbeen shown in experiments that information being processed in the WM isimportant in higher functional planning, and for long-term memory (LTM). WMfunctioning with respect to auditory cognition is studied in a number of ways,using verbal, tonal, and speech stimuli.

Baddeley and Hitch’s model of working memory, first proposed in 1974(Baddeley and Hitch, 1974), is typically used in connection with music andlanguage. The following concepts are useful in this discussion of Baddeley’smodel of working memory:

1. Visuo-spatial sketchpad: A sort of whiteboard for the mind that plays arole in physical action planning and control.

2. Episodic buffer: This was added as a final explanatory component, as lateas 2000 (Baddeley, 2000), to explain the transfer of temporal objects fromthe WM to the LTM.

3. Phonological loop: The phonological loop is the STM store and articulatoryrehearsal location in the WM. This enables phonological as opposed tovisual storage; the loop also engaged during subvocal rehearsal, whichinvolves motor activation, and even in micro-movements of the vocalapparatus while engaging in memory storage or recall. When subvocalrehearsal is not possible in, for example, articulatory suppression conditions,verbal material is not remembered as well.

(Mora, 2000, p.150) explains the importance of singing to memorize something,where melody acts as a path towards remembering. This is also related tothe phenomena of having a song stuck in your head. She quotes, “Murphey(1990) defines the ‘song-stuck-in-my-head’ phenomenon as a melodic Din, asan (in)voluntary musical and verbal rehearsal. Murphey also hypothesizes thatthe Din could be initiated by subvocal rehearsal. So, for example, we are ableto rehear mentally the voice and words of a person with whom we have hadan argument. Similarly, while reading the notes taken in a lecture, we willprobably rehearse the lecturer’s voice, while at the same time we can mentallyvisualize the place from which s/he was talking and even her/his gestures orbody movements."

45

4. Body Movement

4.3.4 Multimodality

That we perceive the world using more than one sensory modality simultaneouslyseems to be the norm rather than the exception. This idea is described as‘multimodality’—we perceive the world through several stimulus modes at thesame time. When we see an action being performed, we expect to receiveinformation about the action through many different sensory modalities—we seesomething, we expect a sound to accompany it, we can infer its touch, and soon. Particularly with sound, a range of experiments have shown that significantsensory illusions can occur as a result of our expectations. An example of this isthe McGurk effect is an illusion involving the auditory and visual modalities,in which different, aligned visual and auditory speech stimuli, give rise to an‘overriding’ effect of either the visual or auditory stimulus. For example, if wehear just the audio stream of a voice saying ‘Ga’, we apprehend the syllableaccurately; however, when it is played back with a video of someone speaking ‘Da’,our visual system dominates (Rosenblum et al., 1997). Another example of this,dubbed the ‘cocktail party effect’, is an imaginary cocktail party situation, withseveral guests speaking simultaneously. We are able to understand someone’sspeech better if we are looking at them directly (Arons, 1992). This seeminglysimple effect is highly complex from the auditory cognition perspective.

For this thesis, audition and motor action are the areas of neural processingin special focus. For a long time, the motor and sensory cortices were thoughtto function separately from each other. However, a number of studies havedemonstrated the role of the sensory and motor cortices in perception tasks(Cheung et al., 2016). Robust neural activity in the motor cortex is observedwhile listening to speech sounds.

The motor theory of speech perception, proposed in the 1960s, suggests that“objects of speech perception are the intended phonetic gestures of the speaker"(Liberman and Mattingly, 1985, p.2). Studies have since confirmed these findingsusing various methods. A review article published in 2010 compiles variousstudies on speech perception (Pulvermüller and Fadiga, 2010), and shows thatin action studies, patients with lesions affecting the inferior frontal regions ofthe brain have difficulty comprehending phonemes and semantic categories.

4.3.5 Ecological Psychology

James Gibson’s work on visual perception in the 1970s laid the foundations forthe ecological approach in psychology. This approach explores the connectionsbetween the ecological context, body movements, and the action-relevantinformation available to the perceptual framework. In this model, affordances—the possibilities for action, intervention, or use—are key to guiding perception.A drum is a well defined musical instrument, but in principle, tables, chairs, andother rigid surfaces around us become ‘drums’, or can be used as drums, and assuch, having a rigid surface ‘affords’ an object to serve the function of a drum,because of the possibilities of action.

46


This model is has action embedded in the core of our interaction with theworld. Clark (1999) argues for how this model explains our interaction in theworld parsimoniously. For example, in the visual domain, a good model forcatching a ball would include understanding the velocity, trajectory, and directionof the incoming ball, and looking at the action with which the ball was thrown.If our perception was based on complicated mathematics, our response times tomotor stimuli would be much longer than they are in reality.

4.3.6 Conceptual Metaphors

In their work, Lakoff and Johnson (1980) propose a theory for conceptualmetaphors, suggesting that the words we use to describe phenomena drawn from athematic vocabulary, rather than being employed at random (Lakoff and Johnson,1980). For example, our word choices in certain phrases, such as ‘win’ an argumentand look ‘ahead’ into the future, represent how we conceptualize arguments aswar, and time as space. This theory is the foundation for the notion of conceptualmetaphors, not only in language, but also in paralinguistic phenomena. Thesemetaphors of space become apparent in gestures that accompany speech; forinstance, someone might indicate ‘yesterday’ behind their back, and ‘tomorrow’in front of the body.

Conceptual metaphor theory (Lakoff and Johnson, 1980; ?) first suggestedthat such linguistic metaphors show that many abstract ideas are expressed usingmetaphors for spatial concepts. The relationship between linguistic metaphorand spatial orientation might also extend to the link between language andgesture. David MacNeill (1992), in his book, Hand and Mind: What GesturesReveal About Thought, he explores this relationship. In this book, he talksabout the nature of co-speech gestures and idiosyncratic movements of the bodythat are associated with speech. He suggests that the microevolution of anutterance may, in turn, epitomize the macroevolution of the linguistic systemfrom primitive gestures.

Metaphorics in gesture studies deals with gestures that serve as metaphorsfor abstractions in language. A good example of this is the ‘cup-of-meaning’hand shape, where the speaker refers to an abstract idea with a cup-shapedgesture, which changes shape as the idea is refuted or changed. Data from severalstudies related to metaphorics in gestures suggest that speakers in demandingcommunicative situations—especially ones in which they had to express abstractideas—conceptualize those ideas in space (Enfield, 2005; Nunez, 2004; Sweetser,1998). Such metaphorical gestures may help in the conceptualization of abstractconcepts by grounding them in space. These co-speech gestures are observable‘referents’, even if the ideas they communicate are metaphorical or abstract.Music, however, is semantically void and cannot by itself make concrete references.The abstract nature of music makes it challenging to analyze gestures within anyof these possible typographies. It also makes it hard to analyze the meaningsof gestural associations. Do they represent cognitive schemas and back-endmultimodal structures, or are they an epiphenomenon, reflecting habits formedduring training?

47

4. Body Movement

In his book, Understanding Music: Philosophy and Interpretation, Scruton(2016, p.46) says:

“Of course the melody doesn’t literally move, it isn’t literally there.But we hear it all the same, by virtue of our capacity to hearmetaphorically—in other words to organize our experience in termsof concepts that we do not literally apply. ”

This quote rings true in the qualitative observations of the participants’ movementin the experiments conducted for this thesis. Something in the melody is perceivedto be moving metaphorically, and can be represented through hand-movements,using a variety of action metaphors.

4.4 Music Related Movement

Action produces sound, and action and sound are related to each other in a loop(Jensenius, 2007). Sound also invokes images of action. When we hear a louddrum stroke, we can imagine the force with which it might have been produced.In the following sections, I discuss gestural imagery in the service of musicalimagery, and how musical gestures evoke sonic imagery.

4.4.1 Gestural Imagery

Godøy argues that the typology of sonorous objects can be extended to what hecalls gestural–sonorous objects (Godøy, 2010). Theoretical approaches for thismight include analyzing our conceptual apparatuses, our invocation of gesturalmetaphors, and so on in descriptions of sonorous objects; observation studies ofmusic producing and music imitating actions; and sound-tracing studies.

In Schaeffer’s work, sound types are presented with the following categories(Schaeffer et al., 1967b):

1. Impulsive: The overall energy envelope is based on a sharp attack with adecaying resonance.

2. Sustained: The energy envelope is based on a continuous energy transferthat results in a continuously changing sound.

3. Iterative: The excitation pattern is built on a series of rapid anddiscontinuous energy transfers, or a series of attacks that tend to fuseinto one another.

Godøy (2018) proposed a three-part model consisting of gesture sensationsthat are influenced by, and also influence, continuous sound and multimodalgesture sensations. Godøy argues that Schaeffer’s categories apply to musicrelated movement and not just for analyzing sound, leading to phrase transitionswhen playing between sounds in any category in the aforementioned typology.For example, an impulsive drum stroke turning into an iterative drum trill has

48

Music Related Movement

these features present both in the sound and the action producing the sound.For music related movement, he proposed the following categories:

1. Sound producing: These movements are necessary in producing sound; forinstance, directly striking a drum.

2. Sound accompanying: Movements such as dance, walking, and sound-tracings, which are representational or imitating movements.

3. Ancillary: Movements facilitating sound production.

4. Communicative: These movements communicate specific musical momentsto the audience.

In Godøy’s model of gestural–sonorous objects, our mental images of musicalsound are based on an incessant process of mentally following, or tracing, thefeatures and qualities of a sound. In this model, gestural sensations have visualand motor components based on the biomechanical affordances of the soundobjects, and the physical constraints of our bodies.

It might seem to some that this model represents a listening scenario thatis inorganic, or not representative of our real-world musical listening behaviors.Removed from their electro-acoustic contexts, these images of movements andtheir tracings would not be accurate representations of the music. However,while researching improvised gestures of North Indian classical music, Godøy’smodel applied even though that context has not been a part of the originalpropositions of motor-mimesis.

Ornaments as Multimodal Melodic–Gestural Objects

We are never completely still until we die. Body motion is studied from manyangles and within different disciplines, including medicine, robotics, sportssciences, phenomenology, and psychology. Embodied and enactive cognitionclaims that much of human cognition is based on our ability to move and act inour environments.

The production of sound is intrinsically related to the performance of anaction. We have to do something in order to produce a sound—hitting somethingproduces a sound that is intrinsically related to the properties of the materialsthat contact each other, and the action that was involved in producing thatsound influences it. For example, hitting a drum with a mallet with great force,from a distance, or gently, will produce different sonic possibilities. The couplingof action and sound is obvious in this simplest example of sound production—acollision of objects—and holds true for more complex actions that require thecoordination of several muscle groups and body parts; for example, singing atune or playing the piano.

Direct sound-producing actions are not, however, all that we do to producemusic. There are many subtler, postural, and communicative dimensions ofmusic related motion; for example, a pianist’s body swaying while they play aparticularly delicate passage. When we listen to an energetic drum sequence,

49

4. Body Movement

for example, we mentally simulate the movements of the percussionist. Inthis chapter, I unpack the different ways in which our perception of soundis multimodal, and how we are also able to demonstrate the sound–shapeassociations using our bodies.

While we listen to music, the act of tapping our feet to the rhythm—so-calledrhythmic entrainment—has been researched in great detail. The timing precisionand limbic control in our responses to different sound envelopes and differenttapping effectors have been studied more over time than the the embodiment ofmelody.

In order to become proficient in playing an instrument, it is typical to practiseornaments that constitute the building blocks of the style over and over again.To play these ornaments intertwined in longer phrases, our execution of themhas to be ‘without thinking’. Ornamental symbols, notated using shapes; forexample, the trill, mordant, glissando, resemble the contour shapes of the soundproduced.

4.4.2 Gesture and Musical Gestures

A gesture can be defined as a movement of a part of the body, especially ahand or the head, to express an idea or meaning. All body movements are notgestures; there is an implicit signification associated with gestures that is notgenerally associated with the terms actions or movements. Signification mayinclude specific hand shapes and their associated movement qualities. Figure 4.1illustrates hand shapes that are used as mnemonic devices while teaching children,which represent the seven notes of the diatonic scale.

Shrugging is a great example of a gesture that has a clear, universal meaning.It is a short movement—the shoulders are raised quickly, and the palms faceupwards, as if to show that you are empty-handed. If our hands are otherwiseoccupied, this gesture can be transposed into a shrug that incorporates onlythe shoulders and a tilt of the head. A shrug has cultural connotations andits associated notional description is to indicate that the person does not haveanything (in their hands). We can also execute this gesture in several time scales,using various effectors. Thus, communicating and understanding the gestureencompasses its culturally understood meaning, and its smooth execution.

We talk about ‘muscle memory’ in reference to tasks that we perform asautomatisms, without explicit conscious awareness. Of course, our muscles donot actually hold memories, and procedural memory in the brain is what enablesthe performance of these tasks. While gesturing, and especially while makinggestures that have communicative intent, we are able to perform these seeminglydifficult-to-describe actions—such as ‘wiggling your fingers’—easily, which isperhaps a result of much practice. Rosenbaum points out that an innocuousgesture such as ‘wiggling your fingers’ might seem simple to do, but once westart to describe how to do it, the description seems complicated (Rosenbaum,2017, p. 65).

The study of gestures involves multiple disciplines: their semiotic meaningsand associations are studied as language, while their physiology and neurology

50

Music Related Movement

Figure 4.1: Solfa gestures are used to help memorize intervals in a major scale.This method has been used quite often to teach students of singing.

are studied to understand manual control. There may or may not be anythingbiological about the ‘okay’ gesture, in which the thumb and index finger forma ring and the other three fingers are splayed. But this type of gesture isunderstood by, for example, the speakers of a language, or those belonging to aparticular culture. Tasks such as grasping objects can be scientifically studied inthe context of biologically important fine motor skills.

Co-Speech Gesture

Speech is often accompanied by co-speech gestures such as movements of thebody, particularly of the hands. Co-speech gestures are considered paralinguistic,and have only recently been studied in the realm of formal linguistics. Gesturingusually begins before speech, and plays a central role in the development oflanguage. The simplest example is understanding references from pointing.Gesture studies have proposed several sub-types—iconic gestures indicate

51

4. Body Movement

the shapes and sizes of the referenced objects; metaphoric gestures indicatemetaphorical qualities; and beat gestures emphasize particular words in speech.Deictic or pointing gestures can indicate an actual object in space, an abstractreference, or an idea. Many people think of emblems as co-speech gestures;however, these are simply gestural icons in the shared vocabulary of speakersof a language or members of a cultural group—for example, the ‘thumbs up’gesture.

Kendon (2004), in Gesture: Action as Visible Utterance, poses the followingquestion at its outset:

“How can a person, in creating an utterance, at one and the sametime, use both a language system, and depictive pantomimic actions?As a close examination of the coordination with gesture and speechsuggests, these two forms of expressions are integrated, producedtogether under the guidance of a single aim.” (Kendon, 2004, p.2)

Kendon defines gesture as ‘visible action as utterance’ (Kendon, 2004), anddescribes the long neglecteed body in linguistics as a central component in theaffective processing of speech. Why do we make co-speech gestures, what is theirtypology and function, and how can we study them better? The ‘noise’ is thedata, as it were. Researchers in the early twentieth century were preoccupiedwith studying the origins of language through gesture. In the middle of thecentury, however, this interest was deemed ‘unscientific’ (Kendon, 2004,p. 64).The more recent evolution of interest in co-speech gestures and embodiment hasenabled us to further our understanding of music and acoustic perception.

What types of gestures do we use to accompany speech? In the study ofco-speech gestures, the following typology is proposed by McNeill (1992):

1. Beats: Rhythmic gestures that mark words or phrases as significant to thediscourse/pragmatic content.

2. Deictics: These gestures point at concrete entities or particular spaces.

3. Iconics: They depict the form or movement of physical entities, or thephysical relationship between them.

4. Metaphorics: They represent an abstract idea as if it could be held or waslocated around the speaker—for example, a small object you hold in yourhands to mean ‘this idea’.

While studying co-musical gestures, in improvised, especially vocal music, wefind that such a typology of co-speech gestures can prove useful to understandand describe gestures accompanying music (Pearson, 2016).

4.5 Action–Sound Perception

Jensenius proposes a model of action–sound couplings, and the chain ofproduction of action–sound for sounds that we perform intentionally. In this

52

Action–Sound Perception

Figure 4.2: Similarities between the action–fidgeting model and the posture-basedmotion planning theory.

model, a sound producing action has an excitation phase that is preceded by aprefix and followed by a suffix. Godøy (2001) provides a broader background ofsound producing action.

In Jensenius (2007)’s model of music-related movement, in controlling amusical instrument, movement phases, actions, and fidgeting might come togetherto form a chain of several movement units, action units, and fidgeting units.An action is distinguished from fidgeting in that the former is specific andgoal-oriented. In this model, the term gesture is totally avoided, in order to notconflate it with its multiple meanings. Instead, a four-part model is provided:

1. Movement: The act of changing the physical position of a body part orobject

2. Action: It denotes a movement unit or chunk. This is a goal-directedmovement with a particular outcome.

3. Fidgeting: Non goal-directed movements, perhaps unintentional orsubconscious movements, are classified as fidgeting, or ‘movement noise’.

4. Interaction: The reciprocal influence of the moving parts in an actionstream.

In this type of model, the goal-directed nature of certain but not all movementsis key. Jensenius represents this as a stream of action–fidgeting–action. Thisfeature of the model is similar to a different theory of bimanual control, calledposture-based motion planning theory, which I will explain in the followingsubsections.

4.5.1 Ideomotor Theory

Ideomotor theory proposes that whenever an action is executed, the mentalrepresentation of the movement itself is linked to the representation of the effectsin the mind (Herbort and Butz, 2012). This means that once this relationship is

53

4. Body Movement

established, merely anticipating the effect allows us to create the appropriatemovement. A review of the current state of work on ideomotor theory can befound in (Shin et al., 2010).

Laboratory experiments have shown that holding an object, such as a gun,in one’s hand greatly increases the chances of spotting the same object in videos.Another experiment showed that visual targets are perceived as being nearerwhen they can be touched with a handheld tool (Rosenbaum, 2017, p. 97).

An experimental analysis of sound-tracings performed while simultaneouslylistening must include a discussion of ideomotor theory, and action-firstapproaches to perception. The intersecting ideas in this theory are that perceptionis greatly influenced by our intention to interact with objects in the real world.

4.5.2 Motor Control

David Rosenbaum’s theory of posture-based motion planning (PBT) describesthe motion planning phases in the execution of manual control (Rosenbaum,2017). The main claim of this theory is that voluntary movements, instead ofbeing produced directly in ‘one fell swoop’, are generated first by our nervoussystem using a series of ‘goal postures’. Then, movements translate the body fromone goal posture to the next. In this discourse, a posture is a musculo-skeletalstate.

Rosenbaum raises the question of whether there is a point below which wecannot control our motor responses. For example, to what extent do we controlour hands when we clap? We might have conscious control over executing a clap,and may even choose how to position our hands, or adjust the intended loudness.Once we initiate the clap, however, it seems to perform itself. This image ofcontrol can differ depending upon the tasks at hand, the extent of practice, anda number of other factors. Rosenbaum calls mental representations of actionstates ‘images of achievement’, likening them to reference conditions of feedbackcontrol theory. Without images of achievement, motor imagery does not haveideal states to which to aspire.

For a sequence of motor actions to precede purely perceptual states ‘in themind’, a lot of well-rehearsed action sequences must take place. In order toperform a task such as grasping, the following mental apprehensions are needed:distance calculation, a mental image of achievement, or a goal state, and arehearsed aim with hand–eye coordination. Chunking or stacking of multiplegoal states in skilled tasks such as, say arpeggio playing, will influence not justthe performance of the action, but the perception of the objects in relation toour bodies.

In the execution of an action such as a sound-tracing, goal motor-states couldbe pre-planned and contour trajectories ‘filled in’, particularly if sound stimuliare heard multiple times.

54

Summary

4.6 Summary

In this chapter, I have looked at the different ways in which music related motionis executed, and how action-aware models of musical perception can help usunderstand music as an embodied phenomenon. I believe that sound-tracing asan experimental paradigm brings together the multimodal mappings of pitchedsound, gestural imagery evoked by these sounds, and defining geometries ofthese contours. In this chapter, I discuss gesturing during speaking, and howthis field can also contribute to thinking about musical motion. I have alsooutlined some important mechanisms for action–sound associations, as well assome theories of motor planning and control. In the next chapter, I will explainthe experiments performed for this thesis, with the details of the stimuli, andthe data sets created as a result.

55

Chapter 5

Data Sets and Experiments

5.1 Introduction

For the experiments in this thesis, I used a single stimulus set consisting of 16melodies. Using these stimuli, I conducted three experiments as detailed in thischapter. The data sets are described below, and they were released as mentionedin the references.

One stimulus set was used for two experiments, resulting in three data sets,as shown in Figure 5.1.

5.2 Stimulus Set

The stimulus set consists of 16 melodies from four music traditions that involvesinging without words. For this set, we picked the genres classical vocalise, jazzscat, North Indian classical music, and Sámi Joik. These musical traditions alsorepresent the author’s interests and experiences. We picked a short melodicfragment from a performance in each of the four musical traditions. As explainedin the previous chapter, vocal melodies hold a closer relationship to the body,and they are often described in terms of our ability to hum them.

Experiment1

Stimulus Set

Experiment 2,2.1, 2.2

Dataset 1

Dataset 2Dataset 3

Figure 5.1: A stimulus set containing 16 melodies was used for two motioncapture experiments, resulting in three data sets: one for melody–motion pairs,the second for repetitions in sound-tracings, and the third for singing backmelodies after one hearing.

57

5. Data Sets and Experiments

We specifically chose vocals without lyrics to avoid participants’ perceptionof contours being influenced by words and word meanings. In most studiesdealing with melodic contours, the role of text and lyrics has not been explicitlyexplored. However, the inter-dependency of prosody and text has been studiedin compositional contexts.

Here, we also selected stimuli with little to no accompaniment, in order tomaintain the focus of the experiment on melodic contour and the monodic lineonly.

0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6

Figure 5.2: The 16 melodies used as the stimulus set for all the experiments inthe thesis come from four different music cultures and contain no words. TheX-axis represents time, and Y-axis represents pitch height in MIDI notation.

5.2.1 Descriptions of the Musical Styles

Here I will explain the cultural contexts of the musical styles chosen. Thefour musical systems were some where there is a tradition of singing withoutwords. This was to make sure that melody occupies the central position in thecomposition.

North-Indian Classical Music At the foundation of North-Indian classicalmusic, lies concept of a ‘raga’, which includes a scale and a grammar to guide andaid improvisation. In the modern system of ‘khyal’ singing, musicians improvisewith the raga by establishing the grammatical form, and then singing over it on

58

Stimulus Set

a vowel. Nonsense syllables such as ‘re-ne-na-tom’ are also used in some schoolsof music in the tradition, and in forms related to khyal singing, such as drupad,which relies on long improvisations.

The phrases chosen for the stimulus set come from lecture demonstrationsby Pt K. G. Ginde, available in the University of Washington EthnomusicologyArchives (of Washington Ethnomusicology Archives, 1991). The raga that of themelodic stimuli is Bhoopali.

Joik The joik is a song style of the Sámi people from northern Norway, Sweden,Finland, and parts of Russia. The joik is a sung, melodic form that is meant toevoke an element in the landscape, a person, an animal, or any other kind ofentity. A joik might be composed based on any number of characteristics of theentity being evoked. Usually, a joik is sung without words, using syllables to aidand propel the melody.

The joiks chosen for this data set come from the Smithsonian Folkwayscollection, Lappish Joik Songs from Northern Norway, produced and recordedby Wolfgang Laade and Dieter Christensen (1956). The three pieces chosen areby Per Henderek Haetta (Quarja), Inga Susanne Haetta (Markel Joavna Piera),and Nils N. Eira (Track 49) (Laade and Christensen, 1956).

Jazz Scat Scatting is a technique in jazz singing in which meaningless syllablesare sung to improvised melodies. Some trace the origins of scatting to peopleattempting to recreate percussion patterns using the voice. The syllables usedand their composition within improvisations depend on individual performancetechniques, time periods in scat singing, and other factors.

The scat stimuli used for this thesis come from Ella Fitzgerald’s recording of“One Note Samba”, composed by Antonio Carlos Jobim, from a 1969 recordingat the Montreaux Jazz Festival. The scat sections in this performance have littleto no accompaniment.

Vocalize A vocalize is a piece or set of exercises in the western classical singingtradition that is explicitly sung without words. Western classical music is, moreoften than not, composed with words. However, some improvised sections, suchas the coda—a passage that functions as an extended cadence—use this techniqueof singing a vowel without text.

For this stimulus set, we used excerpts from the piece, “Si, Ferite Il Chieggo”,from Rossini’s 1820 opera, Maometto II. The chosen recording is sung by soprano,June Anderson, in a performance with the Ambrosian Philharmonic Orchestrain 1983.

59


Figure 5.3: Experimental flow for experiment 1.

5.3 Experiments

5.3.1 Experiment 1: Melodic Tracings

A total of 32 subjects (17 female and 15 male) were recruited, with a mean ageof 31 years (SD = 9 years). The participants were mainly university studentsand employees, with and without musical training. Their musical experience wasquantified using the OMSI (Ollen Musical Sophistication Index) questionnaire(Ollen, 2006); they were asked about their familiarity with the musical genres,and their dancing experience. The mean OMSI score was 694 (SD = 292),indicating that the general musical proficiency of the subjects in this data setwas quite high. The average familiarity with western classical music was 4.03out of a possible 5 points, 3.25 for jazz music, 1.87 for joik, and 1.71 for Indianclassical music. Thus, two genres (vocalize and scat) were more familiar than thetwo others (North indian and joik). All participants provided written consentbefore they participated in the study, and they were free to withdraw at anypoint during the experiment. The study obtained ethical approval (on 22 August2016; project code 49258) from the Norwegian Centre for Research Data (NSD).

Procedure Each subject participated in the experiment alone; the totalduration was around 10 minutes. Participants were instructed to move theirhands as if to create the melody with their movements. The use of the term’creating’, instead of ’representing’, was purposeful, as in earlier studies (Nymoenet al., 2012, 2013), to avoid the terms playing or singing. Subjects could standfreely, anywhere in the room, and face whichever direction they liked; nearly allof them faced the speakers and chose to stand in the center of the lab. The roomlighting was dimmed to help the subjects feel comfortable and to encourage themto move as they pleased. The stimuli were played at a comfortable listeninglevel on two Genelec 8020 speakers, placed 3 m in front of the subjects at aheight of approximately 1.5 m. Each session consisted of an introduction, twoexample sequences, 32 trials, and a conclusion, as shown in Figure 2. Eachmelody was played twice, with a two-second pause between instances. Duringthe first presentation, participants were asked to listen to the stimuli; during thesecond presentation, they were asked to trace the melody. A long beep precedingthe melody indicated the first presentation of the stimulus; a short beep indicatedthe repetition of the stimulus. All the instructions and required guidelines wererecorded and played back through the speaker so as not to interrupt the flow of

60

Experiments

Intro BeepStimulus

PlayRepeat

Alert

Stimulus Replay + Tracing

Ending

32 Randomized Trials + 8 Repeated Melodies

2 Trial Seqs

8 Minutes40s 10s40s

8 Melodies Repeated to Sing Back

4 Minutes

Figure 5.4: Flow of Experiment 2. The repetitions and sung sections are includedin this experiment.

the experiment as illustrated in Figure 5.3

5.3.2 Experiment 2: Melodic Tracings, Tracing Repetitions,Singing Melodies

For this data set, 20 subjects participated twice in the same experiment describedabove, with eight stimuli, which were chosen randomly each time. As such, therewere two responses to each tracing by each participant in this data set. Therewere 9 male and 11 female participants in this study; the mean age was 28 years(SD = 7.4 years). They were mainly university students and employees, withand without musical training. Their musical experience was quantized using theOMSI questionnaire (Ollen, 2006), and they were asked about their familiaritywith the four musical genres, and dancing experience. The mean OMSI scorewas 540.89 (SD = 342.92). The average familiarity with the four genres was asfollows: 3.62, 2, 1,87, 2.93. All participants provided written consent before theyparticipated in the study, and they were free to withdraw at any point duringthe experiment. The study obtained ethical approval (on 26 June 2017; projectcode 54653) from the Norwegian Centre for Research Data (NSD).

The flow of the experiment is as shown in Figure 5.3. The details ofthe laboratory environment and the playback are the same as explained forExperiment 1 in Paragraph 5.3.1. This second experiment included two sub-parts that were not included in the first experiment:

1. All 20 participants traced eight of the melodies in their stimulus set twiceduring the experiment. Although there have been several studies on sound-tracings in general, very few have looked at whether people repeat their owntracings.

2. Ten participants were chosen at random to perform another task, inwhich they repeated eight melodies in the experiment, by singing them backafter one hearing. The melodies are complex, and hard for even trained singersto remember in their entirety after hearing them just once. The singing wasrecorded using a microphone in the lab. Motion capture was used to determinewhether participants naturally tended to use body movements to remember andreproduce the melodies. One participant had to be excluded due to technicalproblems with recording quality.

61


5.4 Data Sets Used

Data Set 1: Melody–Motion Correspondences through Sound-Tracings

This data set was collected in two stages, during which participants traced 16melodies freely with their hands in two conditions; participants had 21 markerson their bodies, and were recorded using eight infrared motion capture cameras.The data set contains a total of 32 subjects (17 female and 15 male), with amean age of 31 years (SD = 9 years). They were mainly university studentsand employees, with and without musical training. Their musical experiencewas quantized using the OMSI questionnaire (Ollen, 2006), and they were askedabout their familiarity with the musical genres, and their dancing experience.The mean OMSI score was 694 (SD = 292), indicating that the general musicalproficiency in this data set was on the higher side. The average familiarity withwestern classical music was 4.03 out of a possible 5 points, 3.25 for jazz music,1.87 for joik, and 1.71 for north Indian music. All participants provided writtenconsent before they participated in the study; they were free to withdraw at anypoint during the experiment.

After post-processing, this data set contained a total of 794 tracings; it hasbeen released as supplementary material in the article, ‘Analyzing Free-HandTracings of Melodic Phrases’ (Kelkar and Jensenius, 2018).

Data Set 2: Repetitions of Sound-Tracings

This dataset consists of 20 participants tracing eight melodies in two iterations.The eight melodies were randomly selected from the stimulus set. The repetitionsoccur at the end after tracing all 32 stimuli first.

Data Set 3: Singing Complex Melodies Back After One Hearing

For this data set, nine participants sang into a microphone eight randomlyselected melodies from the 32 melodic fragments in the main stimulus set. Onlyeight melodies were selected, in order to limit the duration of the experiment. Atotal of 72 melodic fragments were obtained as a result. There were 9 participantsin this study; 6 male and 3 female. The mean age was 27 years (SD = 5.8 years).They were mainly university students and employees, with and without musicaltraining. Their musical experience was quantized using the OMSI questionnaire(Ollen, 2006), and they were asked about their familiarity with the four musicalgenres, and dancing experience. The mean OMSI score was 540.89 (SD = 342.92).The average familiarity with the four genres was as follows: 3.78, 2.33, 1.89, 3.All participants provided written consent before participation, and they werefree to withdraw at any time during the experiment.

The post-processed data set contained the pitch extracted from each of therecorded phrases.

62

Summary

5.5 Summary

In this section, I described the stimulus set, experiments, and data collected.The same stimulus set was used to perform two experiments, which resulted inthree data collections. The experiments involve participants listening to melodiestwice, and in the second iteration, moving to the melody. Examining these datainvolves analyzing melodic material, motion capture data, and the comparisonsand correlations between melody–motion pairs. This is described in detail inChapter 6.

63

Chapter 6

Methods“...if an algorithm performs well on a certain class of problems then it

necessarily pays for that with degraded performance on the set of all remainingproblems.”

– (Wolpert et al., 1997, p.1)

6.1 Introduction

The data types and analysis methods used in the papers included in the thesisare elaborated on within this chapter. Since the work handles two types ofdata—music data as sound signals and movement data gathered from an infraredmotion capture system—as explained in Section 6.2 and Section 6.3, the analysismethods for each of the two data types are separate, but there are several waysto analyze them together, as explained in Section 6.4. Figure 6.1 explains asummary of the signals, representations, analysis methods, and outputs that areexplained in this chapter.

6.2 Sound Analysis

The stimulus material and the analysis of melodies in this thesis depend uponthe analysis of sound and music in various formats of representation—primarilyaudio files and symbolic data. The stimuli are presented as audio and recordedmaterial. There are several audio file formats.

For a deeper analysis of melody and melodic contour, we extract thefundamental melody from each of these audio files using the YIN algorithm(De Cheveigné and Kawahara, 2002). It has been shown that melodic contoursare an abstraction of pitch and pitch patterns in music. There is little reasonto suggest that sheet music should be the ideal source of comparison to rely oncontour models, as the perception of contour does may or may not correspondto the algorithmic representation of pitch height, and melody may or may notbe understood only as discrete pitches.

6.2.1 Computational Representation and Features of Melody

Computational work in music hinges on the representation of music as data,and the way in which this is done determines and limits how research can beconducted. That which is often considered ‘noise’ in one data set may have alot of meaning, relevance, and applicability in another analysis or interpretation.For example, the insights we gain from score-based representations of musicaltraditions that do not use scores to encode or perform music may be limited.

65

6. Methods

Melody

Motion

ToolboxesPitch, DynamicCurves, etc

Librosa, PitchDetection Algorithms

ToolboxesMocap Toolbox,

MoCapPy

Correlation,Classification, etc

Features

Features

Results

Quantity of Motion,Acceleration, etc

Melody-MotionPairs

MethodsCCA, Time-seriesalignment, etc

Figure 6.1: An illustration of different levels of data handling and analyses inthe experiments. The three types of data each have their own signal

That said, the kind of representation is usually chosen to address and answerspecific questions.

Aucouturier and Bigand (2012) have put forth a dilemma of human andmachine listening, and their pitfalls and merits in their paper. This question isvery important to try to answer as the fields of music information retrieval andmusic psychology progress.

To analyze melodies from musical material, one of two approaches are typicallyused to represent musical material: symbolic or signal-based methods. Symbolicmethods rely on transcribing melodies into their constituent pitch values usingone of many notation types. This may include transcribing pitch into westernnotation, describing pitch classes, transcribing relative interval distances insemitones, and so on. Signal-based analysis of melodic material involves analyzingmusic or melodic features from recorded material using signal processing. Boththese approaches have strengths and weaknesses. While a symbolic model allowsfor generalizations by pointing out exactly what we want to see in the data, thesignal-based approach allows us to see the musical content as more than just thenotes that we have deemed important. These two approaches were developedwith completely different applications in mind. Symbolic data-based methodsare popular in the analysis of western classical music because the practice ofcreating symbolic notation is common and accessible for that type of music.However, signal-based approaches allow for us to truly expand our data setsto include music from anywhere in the world, and for non-musical material tobe analyzed as music. The development of many symbolic notation methodsconfines the region of interest considerably, as most music in the world do notuse notation as the main pedogogical or representational method. In this section,I will describe both of these approaches in detail, focusing on how they facilitate

66

Sound Analysis

melodic contour analysis.A melody can be represented in many ways: a musical excerpt sung, played,

recorded, or recorded and played back; its structural features (such as notes);extracted contours; approximations or deviations of other typological shapes; astructural operation like transposition; or as belonging to a family of melodicgroups like modes, ragas, and so on. Each of these melodic families excludessomething that could be important for another melodic family. For example, amode and a raga both have scale properties, but do not share the same concept ofmelodic grammar. What forms a melodic family is both culturally and musicallydetermined. All of these features together describe the concept of a melody, buthow we choose to represent melody in research is a deliberate choice.

Deciding on the level of representation is the first and most crucial choice inany empirical work related to music analysis. In Languages of Art, Goodman(1968) emphasizes that representing artistic or musical material is an act ofcuration, which includes breaking down the features and picking ones thatwe desire in our system, and leaving behind undesirable ones. In westernclassical music, for example, it is commonplace to think of a score as a faithfulrepresentation of that which is most important to music—in a way, the skeletalor formative representation of that which is music. This idea translates poorlyto another musical cultures; for example, in North Indian music, to confinethe pitch continuum into the chosen discrete representations on a musical staffwould be to take away from the essence of that music. Therefore, the act ofchoosing a method of representation also determines what is important or worthyof analysis, and by extension, might disregard the cultural values of anothermusical system.

How does the choice of representation of music relate to our perceptualapparatus? Research in pitch and pitch organization has focused on pitch withand without musical context. In his chapter, ‘Pitch and Pitch Structures’,Schmuckler argues for ecologically valid perceptual contexts in pitch perceptionstudies (Schmuckler, 2004), arguing that studying pitch outside its musicalcontext presents only half the picture. Arguing in favor of the same contextualsetup, for the studies in this thesis, using melodies in all kinds of contexts areimportant for us to formulate an ecologically valid picture. Therefore, I focusedon analyzing material directly from musical recordings, as opposed to generatingisotonic melodic material specifically for testing purposes. One consequence ofthis approach is that without reliable tools to describe this musical material, itsanalysis and comparison are impossible. Music information retrieval and signalprocessing have come a long way in understanding musical content; we use thatbody of knowledge as the basis for understanding all of our musical stimuli.

6.2.2 Symbolic Approaches

The history of musicology in the West relies on the study and analysis of scoresthat are part of a symbolic system. Notes are transcribed into discrete time,timbre, and pitch units. Rhythmic notation separates into discrete time unitsrelated to beat length; pitch units are separated into diatonic note positions,

67

6. Methods

defined by a one-to-one relation with an absolute frequency; and timbre istypically written for certain instrument or instrument families. The evolution ofthis notation led first to the representation of the ornaments over certain notes inthe music, at which point only specific ornaments could be represented, and latermore were added in response to the need to notate more complex intonation, andfreer rhythm and dynamic qualities. The expansion of western notation in thisway does not mean that it stops being symbolic, but that the types of symbolsand the relations between them expand. While the opposite of a symbolicapproach would be accurate signal-based representation, the latter cannot reallybe classified as notation. A notational system lies in the representationalspace somewhere between actualization and abstraction, allowing for enoughabstraction to represent the intentions of the composer, but not to the extentthat different versions of a piece are impossible.

This history is important because western notation informs, to a large extent,conceptions we have about musical abstraction—for instance, how informationis stored and manipulated for experimentation in musicology, psychology, andcomputer science. The reason a discussion of symbolic musical systems isimportant to this thesis is because a large amount of research done on melodiccontour analysis and melodic grammars relies on symbolic music.

Using symbolic data for melodic analysis involves feeding notes into acomputer. This process divides musical time into discrete units using oneof many available options. Using MIDI scores is a very common way of inputtingsymbolic musical data. MIDI music in its early days made it possible forelectronic instruments to be able to communicate with each other. In a review,Wiggins et al. (1993) presented a framework for the analysis of symbolic notationsystems popular at the time (many of them still survive) on two axes: structuredgenerality, and expressive completeness. In any notation system, there is atrade-off between expressive completeness and structured generality.

Annotation and Transcription

Symbolic methods mainly rely on annotation and transcription. When corpusescontain data in the form of music that was written first, such as classical music,it is easier than dealing with, for example, improvisation and other musical forms.In trying to analyze folk music—even European folk music—(Van Kranenburget al., 2007), symbolic approaches rely on annotations of folk melodies. However,in the living tradition, these melodies are not necessarily played exactly as theyare notated, and taught more often by ear than by reading score.

Symbolic Music and Contour Analysis

Several studies focusing on melodic contour analysis begin by analyzing symbolicdata in the form of western notation, pitch class, or interval numbers. Contouranalysis methods developed with symbolic music in mind often use up and down,or + and − signs, to represent the direction each note takes in relation to theprevious. Although this method is primarily used for western music analysis, it

68

Sound Analysis

is often applied to other types of music as well (Eerola et al., 2006; Eerola andBregman, 2007).

6.2.3 Music Information Retrieval

Music information retrieval (MIR) is an interdisciplinary domain that draws fromsignal processing, electrical engineering, machine learning, and music cognition,among other fields, to build a computational understanding of musical content.As a domain, MIR represents a scope of problems that can together help ussolve a magnitude of cross-comparison based issues with finding, searching for,and indexing music and music metadata. The MIR domain of problems ismost widely used by music libraries and digital music servicing platforms thatconnect millions of listeners to millions of songs. The most common use casefor several applications is building recommendation systems, but the impact ofthese systems on music psychology studies cannot be understated.

Some examples of the use of MIR techniques include studies on emotionalaffect using mood detection algorithms (Juslin et al., 2014), where the MIRtoolbox (Lartillot and Toiviainen, 2007) is used to enumerate the acousticcharacteristics of the stimulus set, which are evaluated against listeners’ ratingsof affect. Knox et al. (2011) analyzed pain relieving music using MIR techniquesto study acoustic characteristics. Malandrakis et al. (2011) used MIR techniquesfor emotion tracking in film music. MIR-based methods and tools are used onboth symbolic and signal-based data sets.

Downie (2003) explains the facets of MIR systems and their challenges.The simplest and the first challenge involves transcription of pitch from audiorecordings. This includes melodic extraction from polyphonic recordings, totheir generating accurate transcriptions, and matching or comparison of melodiccontours from one another. The temporal challenges include detection andidentification of pitch durations, and beat durations. A level above, there arechallenges with tempo detection, detection of non-standard tempo behavior suchas rubato or accelerandos, and so on. The harmonic facet includes problems ofdetection of harmony from sound recordings, or from score. The difficulties inthis domain mainly lie in the diversity of harmonic content, and also the windowof interpretation of harmonic analysis. The timbral facet deals with the ability toidentify between several instruments, but can also be thought of as important fordetecting instrumental expressivity. The editorial challenges include retrievinginformation from details of the sound signal. For example markings on a scorethat could be possible to detail from hearing. This may include dynamic changes,annotations of song sections, identification of structure and its tagging, andso on. The textual facet includes detection of lyrics, synchronising lyrics withsound recordings or scores, matching melodies to text translations, and so onare typical problems within this domain. Finally the bibliographic challengesinclude curation of metadata and information about the ontologies and taggingof sound files.

These challenges are further complicated by a range of problems associatedwith using MIR algorithms in several different scenarios. Downie underlines

69

6. Methods

some of them 2003:

1. Multi-representational challenge: Music data has many formats includingsound recordings, score, metadata, and motion data. Within each datatype, there are several encoding standards, several physical formats suchas tapes, CDs; and digital formats. To have these data types be accessiblein different types of analyses presents a challenge of representation.

2. Multicultural challenge: Although the majority of early MIR algorithmsand interests serve the western classical canon, it is obvious that differentmusical cultures with their own epistemologies and data pose a range ofchallenges in the umbrella of MIR.

3. Multi-experiential challenge: Creative users of music, both musicians andusers of a range of applications developed for musical interaction, can useMIR-like methods in a range of generative or creative applications. Thisinvites comparisons of music data with a large quantity of other kinds ofdata, such as physiological, meta annotations, and so on.

4. Multi-disciplinarity challenge: Inherently, MIR represents an area ofinterest in music information, rather than a range of techniques andmethods. This means that multi-disciplinarity is an ever-present challenge,even though a range of MIR tasks and procedures have been standardizedand accepted by the MIR community.

It is most important to remember that no single algorithm or group ofalgorithms can ‘solve’ these tasks in the way that we are able to, with our earsand brain, enumerate a range of information from an auditory stream. However,it is promising how far music information tasks have come, and how much theyhave diversified by integrating with techniques such as AI.

Motion Capture and MIR

In their 2009 paper, Godoy and Jensenius make a case for a body movementbased model for MIR tasks. The authors suggest that bodily sensations are akey feature of understanding musical style. In order for us to be able to harnessthis model of music listening for music-retrieval, movement-inducing cues, andtaxonomies of multimodal responses to music. This thesis takes a step towardsthis goal by modeling body movement to melody.

6.2.4 Extraction of Pitch Contours and Contour Analysis

A pitch detection algorithm is used to estimate the fundamental pitch or frequencyof a sound signal or musical recording. Generally speaking, pitch detectionalgorithms are in the time or frequency domains, or in both. Pitch detectionalgorithms vary considerably, depending on the task at hand. Extracting pitchfrom a monophonic signal is a relatively simpler task than extracting melodiccontours from recordings of many instruments playing together, for which avariety of other strategies are used (Bittner et al., 2017; Salamon et al., 2012).

70

Sound Analysis

6.2.5 Pitch Detection Algorithms

Generally, two approaches can be taken towards pitch detection: time-domainand frequency-domain analysis. The following is an overview of some commonlyused algorithms and their descriptions:

1. Zero crossing: The simplest idea for pitch detection for a quasi–periodicsignal in the time domain is created by low-pass filtering the signal andthen detecting peaks, or zero crossings. Linear predictive coding is oftenused to calculate f0 using this method.

2. Auto-correlation: This method implements the auto-correlation methodof Boersma (1993) and is available in Praat. Essentially, this algorithminvolves computing correlation of a signal with a delayed copy of itself.Here, r is the autocorrelation function, w

r (τ) =∫ ∞−∞

χ (t)χ (t+ τ) dt

3. Cepstrum: is obtained by computing the inverse Fourier transform of thelogarithm of the estimated spectrum of a signal, as defined below.

C (τ) =∣∣F (log

∣∣F (x(t))2∣∣)∣∣4. Average Magnitude Difference Function:The AMDF pitch detector forms a

function which is the compliment of the autocorrelation function, in thatit measures the difference between the waveform and a lagged version ofitself.

(τ) =∫ ∞−∞|χ(τ)− χ(t+ τ)|b dt, b = 1

5. Spectral subharmonic summation: This method, also available in Praat, isdescribed as a spectral subharmonic summation according to an algorithmby Hermes (1998). The idea of this algorithm is to arrive at a summationof each of the subharmonic components in the spectrum.

6. eSRPD: Enhanced super resolution pitch detector algorithm by (Bagshawet al., 1993).

7. YIN: YIN pitch estimator is an improvement on the autocorrelation functionwith an additional step of a cumulative mean normalized difference function.

For the stimuli in the experiments, we use the YIN algorithm. The methodsdescribed above are to arrive at a representation of the pitch in a melody. Methodsto model the contour, however, have to have a more general understanding ofpitch trajectories.

71

6. Methods

6.2.6 Contour Analysis Methods

Although melodic contour has been established as an interesting and importantfeature of musical processing, only a few quantitative models or formaldescriptions of contour have been proposed. I will outline some of them here:

1. Parson’s Code for Contour Description Parson’s directory formelodic indexing is based on the idea that every subsequent note in a melody canbe represented using just three letters to denote direction: up (U), down (D), andrepeat (R), to generate a unique contour description for each melody (Parsons,1975). The book, which represents several thousand melodies in this way, claimsto help people easily identify a piece of music from its melodic identity. Forexample, Parson’s code for the theme of Beethoven’s Fifth Symphony would be:

*RRD URRD

This seemingly parsimonious model is surprisingly good at identifying a largenumber of melodies, given all the possible permutations and combinations of along enough phrase. This model is also used in Musipedia, an online repositorywith many in-built melodic search algorithms (Irwin, 2008).

2. Adam’s Contour Typology Adam’s contour typology, proposed in 1976,provides a way to think about melodic contours using the following features: therelative positions of the initial (I), final (F), highest (H), and lowest (L) pitchesin a melodic line. These are described as the minimal boundaries of a melodicsegment. Using these minimal boundaries, it is possible to identify a typology of12 different ways in which these four features relate to each other, where threekinds of relationships are possible between each pair: <, >, or =. Using these 12relationships, he defines a set of primary features of melodies: slope, deviation,and reciprocal. Here, slope is the relationship between the initial and final notesof a melody; deviation is a change of direction in the slope of the contour; andreciprocal is defined in terms of the first and only deviation. This gives us a totalof 15 different primary features for melodic shape. Secondary melodic featuresor contour shapes contain finer details about a melodic fragment; the authorexplains this as being the difference between describing any rectangle, versuslisting the properties of a particular rectangle. Repetition and the descriptions ofminimal boundaries form secondary melodic features. The author also proposesthe use of graphs to further elaborate melodic contours.

3. Morris’ Model of Contour Relations This model presents analgorithm for reducing a complex melody to its salient contour, similar tothe Urline of Schenkerian analysis (Katz, 1935), but without tonality as thecentral defining concept (Morris, 1993). This model presents 25 typologies offrequently occurring melodic contour patterns and another combination model.

4. Quinn’s Combinatorial Model for Pitch Contour Quinn’scombinatorial model 1999 creates a matrix representation of all note transitionsto one another. These matrices can then be compared with each other usingmatrix similarity related methods, giving us a model of melodic similarity thatdoes not rely upon accurate pitches being described.

72

Motion Analysis

5. Friedmann’s Contour Adjacence Series (CAS) and ContourClass model (CC) Both these models rely on creating a matrix for all theconstituent notes in a melodic phrase, looking at their transitional probabilitiesfrom one to another.

6. Time Pitch Beat Model Time-pitch-beat, or a TPB number tripletis coded in this model for each melodic fragment. Subsequently, a resolutionvector Q, records the intervallic changes between each consecutive note. Q istherefore an n dimimensional vector, for n+1 notes in the melody (Kim et al.,2000). The TPB vector is then used to find melodic similarities.

7. Fourier Analysis Techniques Schmuckler (1999) proposes conductinga Fourier analysis of the sound signal as a method to compare the contoursof different melodies. Since it shifts contour perception from the time to thefrequency domain, and in the experiments, it has worked well. In Paper I, Ianalyze the melodies in the stimulus set using some of these models, to try tocompare if they would work simultaneously for tracings and melodies.

6.3 Motion Analysis

Motion capture, often shortened as MoCap is a general term referring to a groupof methods used to measure motion. There are a number of techniques thatcan be thought of as motion capture, and a recent overview can be found at(Jensenius, 2018). Verbal descriptions and annotations of movement are thesimplest ways to capture motion; codes for describing gestures are often usedin annotating gestures accompanying speech (Poggi, 2002). Photography, videoanalysis, and light-sensing have also been used to capture motion (Jensenius,2007). The following methods for motion tracking are most important, and havebeen detailed by (Nymoen, 2013):

1. Acoustical tracking: The use of phase differences to measure reflections ofsound, and hence, the changing distance of a moving object (Bishop et al.,2001).

2. Mechanical tracking: Tracking or measuring the angles between twomechanical parts that are changes the length of a resistor to infer thedistance or angle of the tracked object, for example as used by De Laubier(1998). A joystick is a good example of this.

3. Magnetic tracking: Tracking the magnetic field of a moving electromagnet,measuring its distance from a stationary electromagnet (Bishop et al.,2001).

4. Inertial tracking: Using accelerometers and gyroscopes to track the positionof an object based on its interaction with gravity (Bishop et al., 2001).Intertial Motion Units (IMU) are an example of this type of tracking(Tanenhaus and Lipeles, 2009), as are, for example Xsens suits.

73

6. Methods

Figure 6.2: Pictures from the FourMs MoCap Lab where the experiments areconducted. On the left is the lab with the cameras and speakers mounted asshown, on the right is a participant wearing reflective markers.

5. Optical tracking: Using infrared (IR), regular video, stereo video, thermal,or depth cameras to estimate an object’s position in space. Using opticaltracking requires post-processing using computer vision algorithms.

The main methods for motion capture in this thesis revolve around theinfrared capture of motion data. Optical tracking systems with infrared camerastransmit and receive infrared light, measuring the location coordinates oflight reflective markers. Motion capture systems work calibrating a knownconfiguration of distances between markers and the shape of the layout, tolay down a grid. Thereafter, new markers can be introduced and computedby calculating distances as determined by each of the different cameras. Thisproduces a time-sensitive trace of the infrared markers as seen by all of thecameras to measure movement accurately in space and time. Motion captureis used extensively to create geometric models of three-dimensional beings foranimation and computer-generated graphics (CGI). Owing to its precision andsensitivity, it has been a great tool to analyze performing bodies.

6.3.1 Tracking Data through Infrared Motion Capture Systems

In infrared motion capture systems, tracking data are usually recorded as three-dimensional (3D) coordinates for every point that is tracked by reflective markers.The coordinate axes should be situated in space using calibration kits. Thecalibration kits are pre-programmed for the system to understand their placementand calculate the tracking of all other points using a camera system based on thiscalibration. This initial placement of the calibration kit results in the creationof a local three-dimensional coordinate system. Once this framework is in place,

74

Motion Analysis

the subsequent tracking of markers in the motion capture system is recordedalong this three-dimensional local coordinate system (LCS), sometimes calledthe laboratory coordinate system.

On entering the lab, a participant is required to wear a MoCap suit, whichis dark; reflective markers, representing the points that need to be tracked, areplaced on the suit.

6.3.2 Details for the Experiments in the thesis

The experiments conducted in this thesis were carried out in the fourMs motioncapture lab at the University of Oslo using a Qualisys Infrared Motion CaptureSystem. The system consists of eight Oqus 300 cameras surrounding the space,and one regular video camera (Canon XF 105). In Qualisys, the LCS is usuallya Cartesian coordinate system—the XY plane is usually the floor, and the Zaxis represents vertical motion.

For the experiments conducted in this thesis, each participant wore a MoCapsuit with 21 reflective markers on joints (Figure 6.3.3). The labeling schemeis explained in detail in Appendix A. Most of the studies in this thesis involveanalyzing hand marker positions; however, the full body was tracked in orderto retain additional information, in case of further analyses, and for qualitativeanalysis and visualizations of the entire body.

The system captures data up to 500 Hz. For these experiments, we useda capture rate of 200 Hz, because it is sufficient to record movement in thiscontext. We also made a video recording of all the participants, so that we couldhear the sounds played in the lab during post-processing.

6.3.3 Post Processing

Once the data were recorded, they were post-processed and cleaned beforeanalysis. The post-processing phase consisted of marker labeling, removal ofghost markers, gap-filling, and smoothing. Thereafter, the data were exportedto a file format that I discuss below.

Marker Labeling Marker labeling is essential to understanding which joint orpart of body each marker represents. This was done using a list of codes—ashorthand system to easily understand which body part each marker refers to,and on which side of the body, as shown in Figure 6.3.3. The complete list oflabels is explained in Appendix B.

Ghost Markers A MoCap system might often track ghost markers, mistakingsmall glitches or reflections as markers; these must be removed manually. Manyghost markers are present for very short amounts of time, making it easy toremove them.

75

6. Methods

Figure 6.3: An example of a post-processed motion capture stick figure. Adetailed list of marker labels can be found in Appendix B.

Gap Filling Lastly, MoCap recordings may contain missing frames in thedata because of tracking errors, drops in the data packets sent over a network,occlusions, a moving body part overlapping a marker at a certain time point.Usually, these gaps are only a few frames long, and can be filled using interpolationand smoothing.

Common techniques for gap-filling include nearest neighbor interpolation,linear interpolation, or polynomial interpolation. Nearest neighbor interpolationlooks for the nearest points around the gap, and does a step-interpolation betweenthem. Linear interpolation aims to connect the available points before and aftera gap using an interpolated line that passes through the points on either side ofthe gap.

In the post-processing of this data set, polynomial interpolation was used forgap-filling to ensure smooth trajectories.

6.3.4 File Formats for Motion Capture Files

Once the data are labeled, cleaned, and gap-filled, they can be exported intoone of many formats for analysis. The Qualisys Track Manager (QTM) systemis a tool offers visualizations of the data for inspection, but detailed analysis

76

Motion Analysis

MotionCapture

AnalysisNormalization,

Feature Extraction

TSV filesas dataset

QTM Files

Python /Jupyter

NotebooksPost-Process

in QTM

TSV Files

Figure 6.4: An illustration of the flow of mocap data. The mocap files areexported from QTM, imported into python. Normalized files are re-exported,and analyzed in python for obtaining features.

requires exporting the files into a scripting or programming environment. Below,I list some common file formats used for handling MoCap data.

1. TSV (tab-separated values): The output of this file format includes a listof captured attributes, such as capture rate, number of markers, length ofcapture, and marker list. Below these details, this format outputs a columnper axis per marker of data. Each row represents a frame of capture.

2. C3D (coordinate 3D): C3D is commonly used as a standard file formatin biomechanical research. The C3D specification expects physicalmeasurements to be one of two types—positional information (3Dcoordinates) or numeric data (serial information). Each 3D coordinate isstored as raw (X, Y, Z) data samples with information about the sample—accuracy (the average error or residual) and camera contribution (whichspecific cameras were used to produce the data).

3. MAT (MATLAB files): The .MAT file format can be used to export dataso that it is easily readable into MATLAB. The data are output in aSTRUCT file, with matrices for the 3D coordinates of each point.

4. AVI (audio video interleave): Video files can be output from QTM intoAVI. These are useful in video analyses.

5. BVH (bioVision hierarchy): Although QTM does not currently output toit, BVH is a common file format for 3D animation. BVH is compatiblewith a variety of software that use motion capture for artistic or animationpurposes.

For our analysis, we chose to export data as TSV files. this made it easyto import them into Python, or Matlab, and analyze easily. Since the releaseddataset contains hand movement data, TSV also makes for a compact format.Figure 6.4 illustrates the flow of data for mocap through this whole process.

77

6. Methods

6.3.5 Feature Extraction

The motion capture data were imported into Python (v2.7.12) and MATLAB(R2013b, MathWorks, Natick, MA., USA). All of the 10-minute recordings weresegmented using automatic windowing, and each of the segments were manuallyannotated for further analysis .

It is useful to understand the features from motion capture recordings tograsp how motion data can be processed and analyzed. In my own work, I like todifferentiate between physical and perceptual features of motion. This is similarto the physical and psychoacoustic properties of sound features. While physicalattributes refer to mathematical formulae directly applied to motion capturedata, the perceptual features of motion identify something about the qualitywith which a movement is performed; for example, the perceived smoothness ofa movement. For this analysis, we focus on physical features and not perceptualones.

There may also be features that relate particularly to the task at hand. Forexample, in the experiments conducted for this thesis, most people used thehands as the ‘primary effectors’ for movement representations of melodies. Thismeant that the distances between hands, the symmetry between hands, andso on were important features that needed to be measured. For example, wecompute the following features:

1. Velocity: The first derivative of position data describes the velocity of amarker. This can be described along any of the three dimensions.

2. Acceleration: The second derivative from position data describes theacceleration of a marker.

3. Jerk: The third derivative of position data describes the ‘jerk’. Jerk datatypically represents sudden changes in acceleration.

4. Jounce: The fourth derivative from position data describes the jounce,describing sudden changes in jerk.

5. Quantity of motion (QoM): This describes the level of movement. It iscalculated as the average of the vector magnitude for each sample.

6. Range: The range of a marker describes the minimum and maximumvalues of a marker for the duration of calculation across each axis.

7. Cumulative distance: The cumulative distance traveled is calculated asthe summation of the Euclidean distance that each marker travels.

Features for Hand Markers

Table 6.1 includes a description of features specifically developed to analyzethe motion data of hand markers. In Table 6.2, features are made specificallyconsidering the nature of strategies used in data exploration, as is explained inPaper II.

78

Motion Analysis

Motion Features Description1 VerticalMotion z-axis coordinates at each instant of each hand2 Range (min, max) tuple for each hand3 Hand Distance The Euclidean distance between the 2D

coordinates of each hand4 Quantity of Motion The sum of absolute velocities of all the

markers5 Distance Traveled Cumulative Euclidean distance traveled by

each hand per sample6 Absolute Velocity Uniform linear velocity of all dimensions7 Absolute Acceleration The derivative of the absolute velocity8 Smoothness The number of knots of a quadratic spline

interpolation fitted to each tracing9 VerticalVelocity The first derivative of the z-axis in each

participant’s tracing10 CubicSpline10Knots 10 knots fitted to a quadratic spline for each

tracing

Table 6.1: The features extracted from the motion capture data to describe handmovements.

6.3.6 Toolboxes for MoCap Data

The BTK (Biomechanical ToolKit) has a stand-alone application—Mokka, orMotion Kinematic and Kinetic Analyzer (Barre, 2013)—which facilitates easyanalysis of MoCap data on a graphical user interface. This application has toolsfor 2D and 3D visualizations. Mokka is built with Java, and provides a timelineview. C3D files can be imported, and Electromyography or EMG analysis canbe integrated into Mokka.

MoCap Toolbox for MATLAB

The MoCap Toolbox, developed at the University of Jyväskylä, includes a varietyof possible algorithms for feature extraction from MoCap data. This toolboxwas developed especially for working with music or dance related motion; it canbe used to calculate several motion features.

Python Toolboxes Overview

Several toolboxes in Python have been written specifically to analyze MoCapdata. A small overview is listed below:

1. BTK/Biomechanical ToolKit: It was developed as a Python wrapper alongwith the stand-alone application Mokka.

79

6. Methods

# Strategy DistinguishingFeatures

Description

1 Dominanthand asneedle

Right hand QoM muchgreater than left QoM

QoM(LHY ) »∨

«QoM(RHY )

2 Changinginter-palmdistance

Root mean squareddifference of left and righthands in x

RMS(LHX) −RMS(RHX)

3 Lateralsymmetrybetweenhands

Nearly constant differencebetween left and righthands

RHX − LHX = C

4 Manipulatinga smallobject

Right and left hands followsimilar trajectories in x

RH (x, y, z) = LH (x,y, z) + C

5 Drawing arcsalong circles

Fit of (x, y, z) for left andright hands to a sphere

x2 + y2 + z2

6 Percussiveasymmetry

Dynamic time warp of (x,y, z) of left, right hands

dtw (RH (x, y, z), LH(x, y, z))

Table 6.2: Quantitative motion capture features that match the qualitativelyobserved strategies. QoM refers to quantity of motion.

2. PyMo: This toolbox was created by Omid Alemi. It is a Python libraryfor machine learning research on motion capture data. The analysis toolsare written to process Blender BVH files.

3. C3D wrapper: This toolbox reads and writes C3D data into Python.

Additional Contributions

As a part of this thesis, I built a motion capture toolbox in Python. It is notfully developed yet, but I have included its data structures and functions inAppendix B.

6.4 Motion-Sound Analysis

Up until this point, I have described the analysis methods for (1) motion capturedata and (2) audio files and music related data. In the next section, I willdescribe the methods used to interpret the data as melody–motion pairs, assuggested in (Godøy and Jensenius, 2009).

80

Motion-Sound Analysis

6.4.1 Visual Inspection and Visualization

Visualization and visual inspection are powerful tools for understandingmultimodal data, especially while analyzing them together. While developingtechniques to analyze new data, visualization is often the first method usedto consider the possible patterns in the data. Good visualization techniquesalso help communicate the quantitative findings more comprehensively, both forother researchers and a general interested audience. Figure 6.4.1 demonstratesvisualization of Quantities of motion for different melodies by the sameparticipant.

In sound-tracing and other data handled in this thesis, the development ofnew models to analyze and explain the data depends upon how we can firstexplore and understand what lies in the data. In order to understand the criticalelements present in data, visualization is an important first step. Motion datavisualization can be tricky but important. The first reason for this is that thedata are inherently multidimensional, and interpreting them requires us to resistseeing the motion tracings as human movements alone, but also to look at thenuances of the tracings. The second reason is that we often consider combiningmeta-level analytical features from these data to understand what is going on.For example, a simple 3D visualization may not bring out what an accelerationplot can.

Toiviainen and Eerola (2006) present an overview of visualization methodsfor various music information related tasks. Visualization techniques can alsoinclude methods such as principal component analysis (PCA) visualization,which offers a better interpretation of trends in the data. The authors thenuse self-organizing maps (SOMs), an unsupervised learning method to organizemelodies in a hyperspace. An SOM calculates and arranges data points withrespect to each other in the learning space.

6.4.2 Statistical Testing

Statistical hypothesis testing is a method of statistical inference drawn froma series of mathematical measures to accept or reject a hypothesis. There aremany tests, and which test should be chosen according to the type of data, andthe distribution of data. Since I have used t-testing in evaluating a portion ofthe data in Paper II, I describe this method shortly below.

t-testing

A t-test is one of the simplest tools to calculate whether the difference betweenrandom samples from two populations is statistically significant. The processinvolves defining a null hypothesis and stating that the means of the populationsare equal; then, the null hypothesis is supported or rejected by the t-test. Thepower of a t-test lies in its ability to test competing hypotheses on a smallersample set, by approximating the normal distribution function of the data set.A t-test that shows statistical significance will be able to establish a difference

81

6. Methods

Figure 6.5: Visualizations of quantities of motion. Visual inspection

between two sample sets. The t-statistic is calculated using the following formula:

t =(X − µ

)/( σ√n

)Here, t is the test statistic, µ stands for the population mean, σ, the standard

deviation, where X is the sample mean from X1, X2, . . . , Xn are n samplesin the data. From the t-statistic, the standard deviation and mean values arecalculated. In statistical significance testing, the p-value is used to determinestatistical significance by comparing a known probability distribution to thestatistics obtained from the data, at a significance level that is most meaningfulfor the analysis. Typical levels for statistical significance are between 0.001and 0.05, which gives us a confidence rating of 99 to 95 percent that the nullhypothesis can be rejected.

6.4.3 Machine Learning and Artificial Intelligence

Broadly, machine learning describes a set of mathematical paradigms forperforming discrimination and pattern recognition tasks. In the simplest ofthese tasks, a machine can tell apart two different categories of data, as could abasic perceptron machine, which was invented in the late 1950s. The evolutionof algorithms and techniques within machine learning, an exponential growth incomputational power, and an ever improving range of data sets have enabledthis field to grow exponentially.

A widely quoted definition of machine learning is by Samuel (1959), [MachineLearning is a] “Field of study that gives computers the ability to learn without

82


being explicitly programmed”. The notion of explicit programming is to findconditionals in the data such that the computer can execute code without amodel of the categories it is able to discriminate among.

Machine learning problems deal with data sets, which are representations ofreal world phenomena in terms of data. These data can come in many possibleformats, and a learning task may be in one of several different classes of problems.The objective for learning a particular quality of the data also might vary—wemight need to predict a variable in comparison to another, detect the presence ofa certain characteristic, generate a string of words based on a prompt, and so on.The only reason all of these various classes of problems are included under theumbrella of learning is because, instead of explicit instructions, some variables,features, weights, or other paradigms are approximated by the computer. Featuresets are thus commonly encountered in this paradigm, referring to the propertiesof each data element. Machine learning pipelines often include ‘training’ and‘testing’ phases—training is carried out on a smaller portion of the same dataset, and the performance of the algorithm is tested on the rest of it.

Broadly speaking, here are some categories of machine learning methods(Bishop, 2006, p.3):

1. Supervised learning: These methods include pipelines where the trainingset includes category labels. This means that the learning happens throughexamples.

2. Unsupervised learning: When category labels are absent in the training,learning methods are described as unsupervised.

3. Reinforcement learning: This set of methods is based on the idea thatan agent in the world learns from positive and negative feedback on itsdecisions.

The following can be described as typical learning problems:

1. Classification: An example of a classification task is to distinguish betweendifferent categories, given some features. The simplest task is a binaryclassification; for example, to differentiate between two instruments basedon their sound streams. A classification task with multiple classes, forexample to classify handwritten digits.

2. Regression: An example of a regression task is to predict the change of avariable with respect to a different variable. An example of a regression taskis to analyze the correlation between melodic data and motion trajectories,to measure their covariance.

3. Clustering: A clustering task is an unsupervised method to representthe data in a hyperspace so that it separates into k clusters. An idealclustering task would involve maximizing the distances between clustersand minimizing their distances from their respective centroids.

83

6. Methods

4. Retrieval: An example of this type of task is to find a melody, givena contour profile; for example, as seen on Musipedia, where a user caninput a contour profile based on Parson’s code (Parsons, 1975) of contourdirections to retrieve melodies that fit the contour profile (Irwin, 2008).

Neural Networks and Deep Neural Networks A neural network is anarchitecture of artificial ‘neurons’ that simulates the inter-connectedness ofneurons in the brain. Neural network architectures are a particular class oflearning algorithms. Neural networks have also been applied for generating newmaterial; for example, interpolated faces, or melodic material.

A neural network must be trained for a classification or generation task usinga sufficient number of data examples. Usually, differences among architectures ishow the error is propagated through the different neuronal ‘layers’ (the errorcan be passed forwards, backwards, in both directions, and so on), and othersub-networks can be added to specific layers. A deep neural network representsthe addition of many more layers to the network architecture. The presence ofseveral additional layers means that a network is better equipped to deal withmore non-linearity.

Multimodal retrieval For the data sets and experiments described in theprevious chapter, the analysis compares two types of data. In computation,this is referred to as multimodal analysis; for example, the joint analysis ofimage and text pairs. In the multimodal retrieval paradigm, different types ofdata are handled together. The objective is to learn a set of mapping functionsthat project the different modalities onto a common metric space, to be able toretrieve relevant information in one modality through a query in another. Thisparadigm is often used in the retrieval of image from text and text from image.The multimodal retrieval method relevant to this thesis is primarily canonicalcorrelation analysis or CCA, explained in detail in Section 6.4.5

6.4.4 Time Series Distance Metrics

A simple way to analyze sound-tracing related movement data would be tocompare different time series. Several tasks in the appended papers involvemeasuring the distance between two time series. In essence, time series arematrices. Several methods have been proposed for time series distance matching,which can be imagined as the extent of the variance between two different timeseries. Broadly, the methods can be classified as follows:

1. Reduced dimensionality measures: These are useful when time seriesare large or tightly sampled and therefore, time series distance cannotbe measured by sample-to-sample distances. Reduced dimensionalitymeasures can include, for example, discrete Fourier transform, and thena comparison between the transforms. Discrete cosine transform (DCT),Chebyshev polynomials are some ways to measure time series distanceusing reduced dynamics.

84


2. Distance measures: The easiest way to think about distance measures is tocalculate the Euclidean or edit distance, which would be the distance persample between each series. Edit distance measures can also be improvedby introducing edit distance penalties, or using a weighted alignment beforedistance is calculated.

3. Longest common subsequence (LCS): LCS is a set of problems to detect thelongest common subsequences in any two sequences of data. This problemcan be adapted to string commonalities and continuous time series.

4. Data adaptive measures: Data adaptive measures try to fit first to the seriesdata before computing distance; for example, polynomial interpolation orregression using piecewise polynomials or splines. In these methods, theapproximated functions are then compared to find distances. For symbolicdata, such methods can include substrings, case modification, or otherkinds of approximation.

5. Non data adaptive measures: Wavelet transforms, spectral measures suchas discrete Fourier transform (DFT), DCT, and piecewise aggregationapproximation methods can be used.

Time Series Similarity Measures

We can also take another approach; instead of trying to find the distancebetween two time series in a hyperspace, we can measure their similarities.Broad classifications of similarity measures are as follows:

1. Lock step measures: Distance measures such as L1 L2 norms.

2. Elastic measures: DTW, LCS, edit distance, and sequence weightedalignment can be thought of as elastic distance measures because they aimto match subsequences or subsections of data.

3. Threshold based measures: TQuEST or the threshold query executionmethod.

4. Pattern based measures: Spatial assembling distribution, also calledSpADe.

6.4.5 Canonical Correlation Analysis (CCA)

Canonical correlation analysis (CCA) is a common tool used to investigatelinear relationships among two sets of variables. In a review paper, Wang etal. analyze several models for cross-modal retrieval (Wang et al., 2016). Firstly,CCA allows us to approach the data through real-value based methods. Themethod also allows us to approach the data in a pairwise retrieval paradigm.Just like unsupervised methods, the CCA method uses similar pairs to learnmeaningful distance metrics between different modalities. CCA has also been

85

6. Methods

Figure 6.6: A representation of the CCA algorithm used; fa and fb represent thetwo different data sets—melodic and motion features, respectively.

used previously to show music and brain imaging inter–relationships (Bergerand Dmochowski, 2017).

A previous study analyzing tracings to pitched and non pitched sounds alsoused CCA to understand sound–motion relationships. In the paper, the authorsdescribe the inherent non-linearity in the mappings, despite finding intrinsicsound–action relationships (Nymoen et al., 2011). This work was extended ina new paper, where CCA is used to interpret how different features correlate(Nymoen et al., 2013). Pitch and vertical motion have linear relationships inthis analysis, although it is important to note that the sound samples used wereshort and synthetic.

CCA has also been used in research related to the analysis and retrieval ofmusic–motion (Nymoen et al., 2011, 2013; Caramiaux et al., 2009; Caramiauxand Tanaka, 2013). The biggest reservations in analyzing music–motion datausing CCA is that non-linearity cannot be represented, and the method is highlydependent on time synchronization. The temporal evolution of motion andsound remains linear over time (Caramiaux and Tanaka, 2013). To get aroundthis, kernel-based methods can be used to introduce non-linearity. Ohkushiet al. (2011) present a study that used kernel-based CCA methods to analyzemotion and music features together using video sequences from classical ballet,and optical flow based clustering. Bozkurt et al. present a CCA based systemto analyze and generate speech and arm motion for a prosody-driven synthesisof the ‘beat gesture’, which is used to emphasize prosodically salient points inspeech (Bozkurt et al., 2016). CCA is used to explore the data sets in this thesisdue to the previous successes associated with using this family of methods. Thesame data are also analyzed using deep CCA, a neural network approximationof CCA, to better understand the non-linear mappings.

CCA is a statistical method used to find linear combinations of two variables,X = (x1, x2, ..., xn) and Y = (y1, y2, ..., yn), with n and m independent variablesas vectors, a and b, such that their correlation, ρ = corr(aX, bY ), of thetransformed variables is maximized. Furthermore, more linear vectors, a′ and b′,

86

Summary

can be found—they maximize the correlation and are not correlated with theprevious transformed variables. This process can be repeated till d = min(m,n)dimensions.

6.4.6 Deep CCA

Deep CCA is a neural network based approximation of CCA. By introducingneuronal layers, we are able to approximate non-linearity in the problem. TheCCA can be extended to include non-linearity by using a neural network totransform the X and Y variables, as in the case of deep CCA (Andrew et al.,2013). Given the network parameters, θ1 and θ2, the objective is to maximizethe correlation, corr(f(X, θ1), f(Y, θ2)). The network is trained by following thegradient of the correlation objective as estimated from the training data.

6.4.7 Template Matching

Template matching represents a series of methods from image processing thatare used for tasks where a stimulus is chosen as a template to match to otherquery images. This ensures that the algorithm retains some robustness as aresult of searching only for relevant template-related features; it helps to avoidsuch problems as noisy backgrounds. In Müller’s book on information retrievalfor motion data, a method is described for using motion data to create motiontemplates, which act as compact and explicit matrix representations of motiondata for a search and retrieval algorithm (Müller, 2007).

6.5 Summary

In this chapter, I present a summary of the methods used first in sound analysis,then motion analysis, and in a joint analysis of sound and motion. Sound analysisdeals with several levels, starting with storing the data as sound files, signalprocessing algorithms for pitch detection, and methods for contour representationanalysis. I summarize motion capture technologies, focusing on infrared motioncapture, and how data is obtained and analyzed. To cross-correlate the data, Iuse a range of methods from statistical hypothesis testing to canonical correlationanalysis. Other methods for comparing multimodal time-series data have alsobeen explained for context. I have presented an overview of the computationaltools and possibilities, and their contributions to this thesis. Using thesetechnologies, the datasets are analyzed, and a summary of the papers, anda discussion of the results is laid out in the next chapter.

87

Chapter 7

Conclusions

Embodied cognition is all fine, but tell mewhose body are we talking about?

- A well-meaning friend

7.1 Research Summary

Using the methods as described above, four papers form the core of thisthesis, and are appended towards the end. Each of the four papers exploresa dimension of melodic contour: verticality, motion metaphors, body use, andmulti–feature correlational analysis. Some emergent findings from the data arealso discussed in detail, relating to: verticality, imagery, voice, body use, andcultural considerations. I hope the analysis can inform some areas for researchin melodic perception, as well as building systems for indexing, and generatingmusic.

7.1.1 Paper I

Reference: Kelkar, T., & Jensenius, A. R. (2017). Exploring melody andmotion features in “sound-tracings”. In Proceedings of the 14th Sound andMusic Computing Conference (pp. 98-103). Aalto University.

Abstract

Pitch and spatial height are often associated when describing music. In thispaper, we present results from a sound-tracing study in which we investigatedsuch sound–motion relationships. Subjects were asked to move as if they werecreating the melodies they heard, and their motion was captured with an infrared,marker-based camera system. The analysis focused on calculating feature vectorstypically used in melodic contour analyses. We used these features to comparemelodic contour typologies with motion contour typologies. This is basedon using proposed feature sets that were made for melodic contour similaritymeasurements. We applied these features to the melodies and motion contoursto establish whether there is a correspondence between the two, and to find thefeatures that match the most. We found a relationship between vertical motionand pitch contour when evaluating according to features, rather than simplycomparing contours.

89

7. Conclusions

Discussion

The first question we explored was whether models of melodic contour thathave been explored in previous research Schmuckler (2010, 1999) can be appliedto movement data. The reasoning behind this is to test the assumption ofpitch verticality in a full-body free hand sound–tracing study. We test threecontour features that describe many contour models: 1. Signed interval distances,describing the directions of subsequent notes; 2. Intial, final, highest, lowestvector containing these four notes from the melody, and 3. Signed relativedistances, including the intervals in semitones and the directions. We tryto perform a simple retrieval experiment for the 3rd feature with k-nearestneighbors, and find that the recognition accuracy is not significant for anycontour profile. Although these features appear in many contour models, verticalmotion alone does not sufficiently explain tracings drawn by the participants. Inqualitative observations we notice that people use several types of metaphoricalrepresentations instead. This is investigated in Paper II.

7.1.2 Paper II

Reference: Kelkar, T., & Jensenius, A. R. (2017, June). Representationstrategies in two-handed melodic sound-tracing. In Proceedings of the 4thInternational Conference on Movement Computing (p. 11). ACM.

Abstract

This paper describes an experiment in which subjects participated in a sound-tracing task to vocal melodies. They could move their hands freely in the air;their motion was captured using an infrared, marker-based system. We presenta typology of the distinct strategies that the participants used to represent theirperception of the melodies. These strategies appear to be ways to representtime and space through the finite motion possibilities of two hands movingfreely in space. We observed these strategies and present their typologiesthrough qualitative analysis. Then, we numerically verified the consistency ofthese strategies by conducting tests of significance between labeled and randomsamples.

Discussion

In this paper, we describe the main strategies used by most participants toexpress melodic contour in motion metaphors. As discussed in the previouspaper, vertical motion is related but not sufficient to describe sound-tracingsof participants. This is because people use various metaphors to represent thechanges they hear in melodic motion. By metaphors, I am referring to theuse of bimanual movement as if the two hands were carrying or manipulatingobjects having different properties. We isolate 6 metaphorical strategies used, anddemonstrate how these can be quantitatively differentiated from one another. Weperform automated windowing of tracings in the motion data of each participant,

90

Research Summary

and using statistical hypothesis testing, calculate how different metaphors canbe identified.

7.1.3 Paper III

Reference: Kelkar, T., & Jensenius, A. (2018). Analyzing free-hand sound-tracings of melodic phrases. Applied Sciences, 8(1), 135.

Abstract

In this paper, we report on a free-hand motion capture study in which 32participants ‘traced’ 16 melodic vocal phrases in the air with their hands, in twoexperimental conditions. Melodic contours are often thought of as correlatingwith vertical movement (up and down) in time—this was our initial expectation.We found an arch shape in most of the tracings, although this did not directlycorrespond to the melodic contour. Furthermore, the representation of pitch inthe vertical dimension was but one of a diverse range of movement strategies usedto trace melodies. Six different mapping strategies were observed; these strategieswere quantified and statistically tested. The conclusion is that metaphoricalrepresentation is much more common than a ‘graph-like’ rendering in such amelodic sound-tracing task. Other findings include a clear gender difference insome of the tracing strategies, and the unexpected representation of melodies interms of a small object for some of the north Indian music examples. The dataalso show a tendency of participants to move within a shared ‘social box’.

Discussion

In this article, we analyze Dataset 1 further in order to understand thedistribution of the strategies of motion metaphors identified in the previousarticle. We find that there is a clear arch shape when looking at the averages ofthe motion capture data, regardless of the general shape of the melody itself.This may support the idea of a motor constraint hypothesis that has been usedto explain the similar arch-like shape of sung melodies. I will explain this inmore detail in section 7.3. We find a gender difference for some of the strategies.This was most evident for representing small objects using both hands, whichwomen performed more than men. The use of this strategy was also found tobe more common for melodies from north Indian music even when participantswho had no or little exposure to this musical genre. This type of metaphorhas previously been reported to have been used by practitioners of this genre(Clayton, 2001). The data show a tendency of moving within a shared ‘socialbox’. This may be thought of as an invisible space that people constrain theirmovements to, even without any exposure to the other participants’ tracings.We find an inverse relationship between the participants’ heights and the use ofthe extreme periphery of the body to represent the highest notes of the melodicstimuli.

91

7. Conclusions

7.1.4 Paper IV

Reference: Kelkar, T., Roy, U., & Jensenius, A. R. (2018). Evaluating acollection of Sound-Tracing Data of Melodic Phrases. In Proceedings of the 19thInternational Society for Music Information Retrieval Conference, Paris, France.(pp. 74-81)

Abstract

Melodic contour—the ‘shape’ of a melody—is a common way to visualize andremember a musical piece. Studies on sound-tracings investigate the relationshipbetween people’s ability to draw such melodic contours on paper or in theair. Understanding such relationships between music and motion is interestingfrom a cognitive perspective, and it can also be useful in the development ofretrieval systems. The purpose of this paper was to explore the building blocksof a future ‘gesture-based’ melody retrieval system. We present a data setcontaining 16 melodic phrases from four musical styles, with a large range ofcontour variability. This is accompanied by full-body motion capture data of 26participants performing sound-tracing to the melodies, which have multiple labelsets. The data set was examined using canonical correlation analysis (CCA), andits neural network variant (deep CCA), to understand how melodic contours andsound-tracings correlate. The analyses reveal non-linear relationships betweensound and motion. The link between pitch and verticality does not appear strongenough for complex melodies. We also found that descending melodic contourshave the lowest correlations with tracings.

Discussion

In this paper, we analyze Dataset 1 from a multimodal retrieval approach. Thismethod has been used in previous studies to analyze sound tracings (Nymoenet al., 2013). We introduce a new feature for calculating the longest run lengthsof individual tracings to improve the correlations of the system. We test differentcategory labels for melodies. Genres show the least amount of agreeabilityand improvement. With all melody and all motion features, we find an overallcorrelation of 0.44 with Deep CCA, for both the longest ascend and longestdescend features. This supports the view that non-linearity is inherent to tracings.This will be explained further in 7.3.

7.2 Discussion

In Chapter 1, I established that the key research question of the thesis wasto understand the role of embodiment in melodic contour perception usingsound-tracing as the experimental methodology. I elaborate on the findingsbelow.

92

Discussion

RQ1. How do listeners represent melodic motion through bodymovement, and how can we analyze motion representations ofmelodic contours?

From the experiments conducted, we find that both expert musicians and non-experts are able to trace melodies that they hear. The relative ease of the taskpoints to the fact that representing melodic stimuli through visual and gesturalimagery is not a task that requires specialized training or access to musicalknowledge. Most participants reported that the task became rather intuitive tothem as the experiment progressed, even if they anticipated it being difficult.This confirms the findings of previous studies, as well as the hypothesis thatthe connection between visual or motor imagery and melodic stimuli is quiteobvious. All participants were able to perform every melody, outliers in the datasets are purely results of marker dropouts or other technical difficulties.

In Paper III, while doing qualitative analysis of the data, we found differencesin body use that had little to do with the experiments themselves. That bodyheight influences how participants trace melodies is a new finding, where heightnegatively correlates with the perceived pitch–height of the highest musical note.Previous sound-tracing studies have often looked at tracings on a tablet, thusmissing the context associated with representing melodic movement using thewhole body.

RQ2. What are characteristics, and applications of motionrepresentations of melodic contour?

An important characteristic of sound tracings in this context is the representationof melodies as metaphors, where the movement resembles an interaction withimaginary object that act as stand-ins for the perceived movement in the music.

In Paper I and II, I analyze and quantify metaphors found in melodic sound-tracings. Specifically, in the Paper I, I compare symbolic contour typologies andmotion features in sound-tracings to see if vertical motion justifiably explainsthe shape of contour tracings. Evidently, when the paradigm of melodicrepresentation moves from vertical placement to motion representation, contourrepresentations change.

Although much of the literature points to correlations between melodic pitchand vertical movement, our findings reveal a much more complex picture. Forexample, relative pitch height appears to be more important than absolutepitch height. People seem to think about vocal melodies as actions, rather thaninterpreting pitch purely in one (vertical) dimension over time. The literature oncontour features emphasizes that while tracing melodies through an allocentricrepresentation of the listening body, the notion of pitch height representationmatter much less than previously thought. Therefore, contour features cannotbe extracted merely from cross-modal comparisons of two data sets. We proposethat other strategies can be used for the analysis of contour representations, butthis will have to be developed in future research.

93

7. Conclusions

One interesting finding is that there are gender-based differences inparticipants’ sound-tracing strategies. Women seem to show a greater diversityof strategies in general, and they use object-like representations more often thando men.

RQ 3: How can we test if motion related to melodic contours isconsistent: a) within participants, and b) across participants?

In visual inspection of the data set, we find that body use within participants issignificantly different, depending upon a variety of factors. To study this, weperform a correlational analysis of many features with each other: melody labels,contour profile, genre, presentation (recorded vs resynthesized melodies), and todetect the presence of motivic repetition and vibrato.

In Paper IV, we study the correlations among all participants for the abovementioned categories of stimuli. Through these patterns we find that the highestagreement was for representation of a virbrato, motivic repetition, and for melodylabels. This means that within participants, these are the features that are mostsimilarly traced.

In Paper III, we also present a frequency diagram for representation strategiesused by participants for tracing melodies, and find that there are some generaltrends with some melodies represented more often by some strategies.

The control of various melodic features seems to follow a hierarchy in mostparticipants. Vowel changes are most often represented by palm shapes, whereasmelodic leaps are more often represented at the shoulder joint level. The scaleat which melodies are represented relative to the body is also consistent acrossparticipants.

In B, I explain the python functions created to analyze the features of themelody–motion pairs.

RQ 4: Can we build a system to retrieve specific melodies basedon sound-tracings?

In Paper IV, I present the beginnings of a CCA-based system to retrieve melodiesfrom sound-tracings. I have tried to integrate the findings from the empiricalexperiments with an experiment on melodic indexing and contour retrievalthrough body movement. Two versions of canonical correlation analysis areused to demonstrate the highly correlating factors between melodies and theirrespective motions.

Previous studies in this area include shorter sound stimuli or syntheticallygenerated isochronous music samples. The strength of this particular study isin its use of shorter melodies, and that the performed tracings are not iconicor symbolic, but spontaneous. As such the data set might bring us closer tounderstanding contour perception in melodies. Hopefully the data set will proveuseful for pattern mining—it presents the community with novel multi-modalpossibilities, and could be employed in user-centric retrieval interfaces.

94

General Discussion

7.3 General Discussion

This thesis has tied experimental methods in embodied music cognition tocomputational analysis of melody. Within computational musicology, methodsthat use sheet music or symbolic data exclusively limit us to certain musical styles,kinds of analysis pertaining to certain inferences drawn from tonality, and so on.There is also a significant bias toward working with music data sets in generativeand interactive music as these data sets are more available. Despite the factthat most extensive data sets are available in MIDI and symbolic notation, Ihave specifically addressed the limitations of studying contour perception usingsymbolic methods in Paper I of this thesis.

I will now discuss four main aspects of the findings: 1. Verticality, 2. Imagery,3. Voice, and 4. The Body, and discuss their details.

7.3.1 Verticality

Arch-Shape In Paper III, we plot the average vertical trajectory of eachmelody, to discover that the the average vertical trajectories all resemble an archshape. It has been speculated before (Savage et al., 2017) that an arch shaperepresents the shape that would support the motor constraint hypothesis, thatthe effort required to produce songs would follow an arch shaped trajectory. Itis interesting to note that this shape is found even in purely ascending or purelydescending melodies.

Having said this, there are differences within the contour and accelerationprofiles of what is drawn between the ascent and descent. A significant thing tonote here is that this shape is observed only for the vertical profile of contourtracing. Although vertical profile is where we suspect the most amount of motioncorresponding to contour, this is not what we actually see in the data.

Extreme Periphery How close to the body do traced melodies lie? We findthat the use of the space around the body is closely related to pitch height.However, we find an inverse relationship between the participants’ heights andthe use of the extreme periphery of the body to represent the highest notes ofthe melodic stimuli. It is interesting to note that this region around our bodies isreserved for the very highest notes, and some participants also reach beyond thisregion by extending their toes, for example. Similarly, melodies seldom extendbelow the center of mass, there are very few examples in the data of participantsstopping down to represent contour.

Relative vs Absolute Representation For sound tracings relative pitchheight appears to be more important than absolute pitch height. This meansthat the same height in space might be used to represent two very differentpitches depending upon the ambit and the context of the melodic fragment thatis traced. Our ability to zoom-in on context is highlighted by this result. Thiscould tie back into the arch-shaped average tracings, because in essence everysung phrase occupies then, the same procession of events.

95

7. Conclusions

Research points towards gesture and body movement encoding an ‘effort’dimension. As we ascend up in our own pitch range, we need more effort to beable to reach higher, and this effort can be perceived both in the sound as wellas the movement. Representing relative as opposed to absolute pitch might berelated to the perception of effort in a singular phrase.

Non-Linearity In Paper IV, we present a CCA of various features of themelodies. Nymoen et al. (2013) previously highlighted non-linearity in soundtracing features. This points towards two things: 1. Features take precedenceover one another, given that sound tracing is already an act of dimensionalityreduction, and 2. Context-sensitive features make absolute correlational analysisnot yield good results.

When trying to model tracings, a non-linear network gave better results thana vanilla version, as described in Paper IV.

Contour Properties The above results indicate that verticality in contourtracings is not the most important aspect, at least in sound tracings. Previousstudies have shown that verticality is just one dimension of pitch-mapping for us,and variations over languages and cultures exist (Eitan and Timmers, 2010). Inobservations of tracings, we find that the general arch shape also exists as phrase-termination annotation—the hands are placed at a location that representsthe beginning of the phrase, and brought down once the phrase is completed.However, contour features are represented in the middle of a sharp initial ascentand a sharp final descent in the movement. In order to focus on contour features,acceleration and jerk plots are more useful to analyze.

7.3.2 Imagery

Experience with dance, sign language, or musical traditions with simultaneousmusical gestures (such as conducting) all contribute to the interpretation ofmusic as motion. Had we chosen participants from any one pool of experiencewith music-motion, it may have been easier to model sound-tracings, but despitethe diversity of participants’s experiences, there is a considerable amount ofconsistency between subjects. Research has also shown that the representationof time itself depends upon linguistic factors—speakers of different languagesconceive the physical representation of a ‘timeline’ very differently from each other(Boroditsky, 2000; Fuhrman et al., 2011). The agreement of which directionsaround the speaker time seems to travel could influence how melodies travelingthrough time are represented and shaped.

Canvas We see that positing a ‘timeline’ relative to the body is implicitlyimagined while performing this task. Some might adopt the strategy of a timelineahead of them, much like a sketchbook or a graph, while others might imagine acylindrical canvas around their bodies. This allocentric placement of melody on

96

General Discussion

a canvas in front of the body is consistent for each participant’s tracing data,but varies across participants.

Metaphor The idea of movement as metaphor, in the context of this discussionborrows a lot from the conceptual metaphors in speech-gestures. The idea is thatwe refer to conceptual objects as if they had spatial properties and locations.We might say ‘this idea’, indicating to something as if it were present in ourhands. In the qualitative observations of the study, it became clear that melodicmovement could easily be represented in this form.

In Papers I and II, I have delved deeper into the gestural metaphors used torepresent the sense of movement in melody. The use of metaphorical gesturalrepresentations, as shown, is not unusual in the sound-producing movements ofsome vocal styles (Paschalidou et al., 2016; Pearson, 2016).

According to the gestural affordances of musical sound theory (Godøy, 2010),several gestural representations can exist for the same sound, but there is a limitto how much they can differ. In other words, gestural representations mightvary a lot, but that does not mean that any gestural representation response canperfectly fit a single stimulus. Our data support this idea of a number of possibleand overlapping action strategies. Several spatial and visual metaphors are usedby a wide range of people, such as the use of objects with various propertiessuch as rigidity, elasticity, and so on.

Looking at the features of Melody 4, the intervals steadily descend, then theyascend, and finally come down again in the same step-lengths. This arguablyresembles an object that slips smoothly along a slope, and could be a probablereason for the overwhelming representation of this particular melody as an object.In future studies, it would be interesting to see whether we can recreate thismapping in other melodies, or understand how perceived melodic motion canresemble the sounds generated by physical motion of objects around us.

7.3.3 Voice

We expected musical genre to have some impact on the results. For example,given that western vocalizes are sung in a higher pitch range than the rest of thegenres in this data set, as expected, on average, people represent western vocalizetracings spatially higher than tracings in the other genres. In acceleration plotsfor each melody, this difference is stark.

Motif As explained in the appendix, some melodies had a repeating motif.In Paper IV, CCA analysis reveals that one of the best performing indices ofthe network is for distiction between phrases containing motifs and those notcontaining motifs. Motivic repetition reflected in tracings is an indication ofconsistency in the imagery of contour.

Vowels Participants were tasked with imagining as if their movements aresound-producing, which can have some connotations of ‘control’ through

97

7. Conclusions

movement embedded in it. We find that typically, palm shapes are used toidentify changes in vowels of the melody, whereas pitch leaps are represented injoints closer to the body, such as the shoulders. When vowels became absent, asin the case of resynthesized melodies, the formant shape is interpreted as beingsmall and narrow.

Vibrato We found that melodies with the maximum amount of vibrato(Melodies 14 and 16 in Figure 5.2) are represented with the largest changes inacceleration in the motion capture data. This implies that although the pitchdeviation in this case is not so large, the perception of a moving melody is amuch stronger influence on contour shape compared to melodies that have largerpitch changes. It could be argued that both Melodies 4 and 16 contain motivicrepetitions that cause this pattern. However, repeating motifs are as much apart of Melodies 6 and 8 (joik). The effect of the vowels used in these melodiescan thus be negated, due to the similarities between the tracings for original andresynthesized stimuli. There are some melodies that stand out for use of certainhand strategies. Melody 4 (north Indian) is, curiously, primarily represented asa small object.

Phrasing What constitutes a melodic phrase, and how to define the boundaryconditions of melodic phrases is a problem with many possible solutions. If wewere to use sound-tracing as an annotation system for not only non-western, butalso non-standard or non-notatable musical pieces, we would be able to create amodel of melodic phrasing that is more related to how we listen rather than tojust understanding rules about tonality and pitch relationships.

7.3.4 Body

It is worth noting the several limitations of the current experimental methodologyand analysis. Any laboratory study of this kind should present subjects with anunfamiliar and non-ecological environment. The results are presumably also, toa large extent, influenced by the habitus of body use in general—the directionof written scripts in the native languages of the participants, their familiaritywith scientific graphs, and so on.

The extreme periphery of the body, reached in this case by extending theshoulders, seems to be infrequently used simply for participants that are taller.In Paper III, I call this effect as a ‘social box’ wherein people try to occupy thesame space regardless of the differences in their own body dimensions.

Gender We found some differences between the tracings belonging to menand women. Women had a greater diversity of hand strategies in general, asdemonstrated in Paper III. Moreover, some strategies are used more frequentlyby women than men, most interestingly that of representing the movement inmelodies as an imaginary small object. It is unclear why this might be, andbeyond the scope of this project.

98

General Discussion

7.3.5 Cultural Considerations

Despite using melodic stimuli from different musical cultures, the objective wasnot to find cultural differences or musical universals. The research presentsembodied cognition as an approach that can help situate melodic perception inthe body. I have tried to address the why of studying contour typologies earlierby examining what makes them interesting for study.

I would like to borrow the why of studying contour typologies to furtherdiscuss the intentions of data collection practices in computational musicologythat might have underpinnings that we might forget to consider while analyzingthe data. The fields of computer science and systematic musicology are currentlydiversifying and there is a growing interest in understanding ‘other’ cultures. Inan ever growing, more frequently migrating world, our need to apply theory toevery kind of practice is urgent. This means that the links between methods andmodels, and their origins and history, weaken over time as technology advances.The original motivations for building systems for musical indexing and melodicmodelling, are very different from the contexts in which the data are currentlyused—primarily to find cultural differences or musical universals.

In his 1983 book, The Study of Ethnomusicology, Nettl reminds us that thestudy of musical universals was popular among nineteenth century musicologists,such as Wilhelm Wundt; this line of inquiry made a comeback around the1970–80s. The idea of looking for universals persists strongly in computationalmusicology, which offers several tools to understand the ‘feature dispersion’ ofthe musics of the world.

The methodical study of melodic contour extends further back in history thanmodern computational methods, which model pitch contours in different waysand apply the results to generalize and generate music. Early researchers havenoted that melodic contour is the most stable element in differentiating melodicidentities, styles, and canons (Dowling, 1972). Melodic contour typologieshave also been proposed in several of these early works, along with generalizeddescriptions of contour behaviors. Much of the early literature on melodic contouranalyzed them using cultural tropes such as ‘white’ or ‘primitive’ music, as isseen in many works in ethnomusicology (Nettl, 1956; Sachs, 2012; Roberts, 1922).Curt Sachs’s 2012 work on melodic contours and ideas about melodic cascades,in the Wellsprings of Music, have since fallen out of fashion. The book did notage well because of the presence of many misguided cultural meta-theories inthe writing. Susanne Youngerman critiqued his work in a detailed article that Iparaphrase in this section 1974. Youngerman’s article still serves as a concisewarning, against using culture as a broad term to mean different methods ofaesthetic appreciation. A longer description of various challenges can be foundin Appendix A.

In studies of cultural differences in early ethnomusicology, most analysesare conducted with the help of notated music, whether or not notated musicapplies to the musical form or musical culture in question. Contour typologiesof all kinds are based on fixed note positions, even if the instrument does notproduce notes in that manner, for example, the voice. This aspect of the study of

99

7. Conclusions

melodic contours stays with us even today. Through the methods in this thesis,I have demonstrated that looking at melodic contours from the perspectives ofsound-signal and body motion can provide insight on various perceptual patternsassociated with melodic listening.

While trying to understand the history of contour classification, we cannotadopt a perspective that elides the goal of contour typologies—to essentializecultural-melodic phenotypes through qualitative methods, one of which is contourtypologies. I would like to state that developing contour typologies is not a goalof the research conducted for this thesis. It is instead to see how contours areunderstood and represented as shapes.

7.4 Impact and Future Work

The studies conducted for this thesis could inform several areas: 1. research inmelody and prosody perception, 2. search and retrieval systems for music, 3.interactive music generation

7.4.1 Research in Melody and Prosody Perception

Treating sound-tracings as annotation material provides a new way of thinkingabout music analysis and studies on segmentation and phrasing. Through theexperiments conducted, I have shown that people demonstrate a remarkableamount of consistency in their spatial and motor representations of music andsonic objects.

Experiments akin to the ones conducted for this thesis could inform ourunderstanding of melodic phrasing, and the differences in people’s phrasingboundaries in various musical contexts. This could help us build a moresophisticated understanding of pitch perception in musical or melodic contexts.

7.4.2 Search and Retrieval Systems

Several projects on melodic content retrieval using intuitive and multimodalrepresentations of musical data have been developed for use on different datasets. The oldest example of this is the 1975 project, ‘Directory of Tunes andMusical Themes’, in which the author used a simplified contour notation methodinvolving letters to denote contour directions to create a dictionary of musicalthemes, from which one may look up a tune they remember (Parsons, 1975).The descriptions of this book suggested that people would always be able tolook up a tune stuck in their head so long as they could describe the contourdirection of the melody.

Parsons’ model has been adopted for melodic contour retrieval by Musipedia(Irwin, 2008). Another system has been proposed in the recent project calledSoundTracer, in which the motion of a user’s mobile phone is used to retrievetunes from a music archive (Lartillot, 2018). A critical difference betweenthese approaches is how they handle contour and musical mappings, especiallyvariations in time scales and time representations. Most of these methods do

100

Impact and Future Work

not have ground truth models of contours, and instead use one of several waysof mappings, each with its own assumptions.

7.4.3 Interactive Music Generation

Interactive systems to generate music—that fill in the gaps, as it were, fromskeletal notions of melodic contour—are popular. Typically, the interface involvesa way for a user to generate a contour, on to which generative models are mapped,to give a possible musical solution. Several studies on embodied music cognitioninvestigate the idea of drawing or tracing as a method of understanding themappings of shapes and lines to musical material. Common modes of interactionfor explicitly melodic interfaces include tracing, drawing, sketching, and hand-waving for contour depiction. Mappings from these contour depictions takevarious forms.

Generated melodies are usually harder to evaluate for goodness and fit inmusic generation, than, for example harmonic adherence to a particular style.Depending on the task at hand, melodic generation tasks may or may not requirecontinuity and development, and some idea of self-similarity—a musical piecegenerally has a sense of identity maintained through variations that withina range of thematic . Although melodic generation spans a large range ofmathematical, parametric, and learning methods, generally melodic generationis constrained to the symbolic domain. Majority of the systems for generationare thus in the symbolic domain, and inapplicable to musical cultures and styleswhere either datasets for symbolic music do not exist; or symbolic generation isinsufficient to capture the musical style.

Typically, gesture–melody interfaces start with some assumption of melodicgrammars, goals, strategies, styles; gestural interfaces start with some baselinemeasurements of user actions. There is a mid-level representation of the inputand output, at which point mapping might be carried out. The goals of someof these instruments might either be the real-time generation of melodies orcomplete songs based on one or a few factors mapped from the gesture. Thereare also differences in generation, based on whether MIDI output is preferred tocontinuous-pitch output. Continuous and particularly on-the-fly interfaces oftendo not even use a melodic model, preferring instead to map a range of inputs topitch and timbre characteristics, which places the interface control fully in thehands of the user. There is, therefore, a spectrum from fully generative systemswith little mapping control to fully mapped systems with no generation model,between which gesture–melody interfaces lie.

The arrival of Magenta pre-trained melodic generation models has madethe mid-level of melodic generation more accessible to several post-Magentapapers. A proof-of-concept for exploring this was recently published as part ofGoogle’s Magenta project offshoots (Donahue et al., 2018); contour enumerationhappens through a series of press buttons, reducing the 81 piano keys to four,and mapping improvisation using generation on top of these contours. A systemof sketching melodies on paper is also described in Kitahara et al. (2017), wherepeople use a tablet to draw melodic contours, on a grid that resembles the

101

7. Conclusions

piano roll format. TrAP uses a tablet-based interface to map four types ofcontour profiles by segmentation (Roy et al., 2014). The MicroJam applicationby Martin et al. includes a sketch-based interface to input melodies and rhythm;the application can also be used for collaborative performances and modeling(Martin and Tørresen, 2017).

He is an interface for generating melodic material based on brush calligraphy(Kang and Chien, 2010). The system uses computer vision techniques to analyzebrush stroke features and ink thickness and map them on to a pentatonic scale.The Melodic Brush interface by Huang et al. is another example of the use ofcalligraphy inputs, with the specific intention of generating Chinese music, usinga kinect depth camera (Huang et al., 2012).

Interactive interfaces for generating music use multimodal and movement-based methods very often. Hopefully the material in this thesis provides newideas in this area. To extend some of the work in this thesis is also my personalgoal for future work.

7.4.4 Future Work

The analysis of Dataset 2 and Dataset 3 have not yet been published yet in thetime constraint available. In the future, I plan to analyze these and publish them.I also plan to release the results of the full body data collected. The tools andnotebooks released as a part of the thesis also will be improved to accommodatevarious use cases of motion capture data for musical motion. Similarly, I am inthe process of exploring the movement of other body parts except the hands,and what we can infer from, for example mirroring of hand movements withhead movements, and so on.

Although canonical correlation is a way to model and understandcorrespondences between melody and movement, there is no gold standard,or right answer for what movement is to be performed, and how much does itfit. The only way we know when we see a cartoon film where music accompaniesmotion, and so on, is that some features in the sound and movement becomenaturally corresponded to us. In order to bring this out in the sound tracingparadigm, I wish to create a system to synthesize melody—motion pairs bytraining a network to analyze this data set, and to conduct a study in which usersevaluate system generated music—motion pairs in a forced–choice paradigm.

Speech prosody and its relationship with musical melody is also an area thatI have grazed upon several times, but not in the experiments. In the future, Iwould like to incorporate some of the findings related to metaphorical movementback into analysis of speech gesture.

102

BibliographyC. R. Adams. Melodic contour typology. Ethnomusicology, pages 179–215, 1976.

G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlationanalysis. In International Conference on Machine Learning, pages 1247–1255,2013.

M. W. Andrews, W. J. Dowling, J. C. Bartlett, and A. R. Halpern. Identificationof speeded and slowed familiar melodies by younger, middle-aged, and oldermusicians and nonmusicians. Psychology and aging, 13(3):462, 1998.

B. Arons. A review of the cocktail party effect. Journal of the American VoiceI/O Society, 12(7):35–50, 1992.

J.-J. Aucouturier and E. Bigand. Mel cepstrum & ann ova: The difficult dialogbetween mir and music cognition. In Proceedings of the 13th Conference of theInternational Society for Music Information Retrieval, pages 397–402. Citeseer,2012.

A. Baddeley. The episodic buffer: a new component of working memory? Trendsin cognitive sciences, 4(11):417–423, 2000.

A. D. Baddeley and G. Hitch. Working memory. In Psychology of learning andmotivation, volume 8, pages 47–89. Elsevier, 1974.

P. C. Bagshaw, S. Hiller, and M. A. Jack. Enhanced pitch tracking and theprocessing of f0 contours for computer aided intonation teaching. In ThirdEuropean Conference on Speech Communication and Technology, 1993.

M. Baroni and C. Jacoboni. Proposal for a grammar of melody: The bachchorales. 1978.

A. Barre. Mokka: Motion kinetic and kinematic analyzer, 2013. URLhttp://biomechanical-toolkit.github.io/mokka/.

W. L. Bell, D. L. Davis, A. Morgan-Fisher, and E. D. Ross. Acquired aprosodiain children. Journal of Child Neurology, 5(1):19–26, 1990.

N. G. B. K. J. Berger and J. P. Dmochowski. Decoding neurally relevant musicalfeatures using canonical correlation analysis. In Proceedings of the 18thInternational Society for Music Information Retrieval Conference, Souzhou,China, 2017.

I. Biederman. Recognition-by-components: a theory of human imageunderstanding. Psychological review, 94(2):115, 1987.

103

http://biomechanical-toolkit.github.io/mokka/

Bibliography

I. Biederman. Psychophysical and neural correlates of the phenomenology ofshape. Handbook of Experimental Phenomenology: Visual Perception of Shape,Space and Appearance, pages 415–436, 2013.

C. M. Bishop. Pattern Recognition and Machine Learning (Information Scienceand Statistics). Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 0387310738.

G. Bishop, G. Welch, and B. D. Allen. Tracking: Beyond 15 minutes of thought.SIGGRAPH Course Pack, 11, 2001.

R. M. Bittner, J. Salamon, J. J. Bosch, and J. P. Bello. Pitch contours as amid-level representation for music informatics. In Audio Engineering SocietyConference: 2017 AES International Conference on Semantic Audio, Jun 2017.URL http://www.aes.org/e-lib/browse.cfm?elib=18756.

R. Bod. Memory-based models of melodic analysis: Challenging the gestaltprinciples. Journal of New Music Research, 31(1):27–36, 2002.

D. Boer and R. Fischer. Towards a holistic model of functions of music listeningacross cultures: A culturally decentred qualitative approach. Psychology ofMusic, page 0305735610381885, 2011.

P. Boersma. Accurate short-term analysis of the fundamental frequency and theharmonics-to-noise ratio of a sampled sound. In Proceedings of the institute ofphonetic sciences, volume 17, pages 97–110. Amsterdam, 1993.

C. Bohak and M. Marolt. Calculating similarity of folk song variants with melody-based features. In Proceedings of the 10th Conference of the InternationalSociety for Music Information Retrieval, pages 597–602, 2009.

D. Bolinger and D. L. M. Bolinger. Intonation and its parts: Melody in spokenEnglish. Stanford University Press, 1986.

L. Boroditsky. Metaphoric structuring: Understanding time through spatialmetaphors. Cognition, 75(1):1–28, 2000.

P. Bourgine and A. Lesne. Morphogenesis: origins of patterns and shapes.Springer Science & Business Media, 2010.

E. Bozkurt, Y. Yemez, and E. Erzin. Multimodal analysis of speech and armmotion for prosody-driven synthesis of beat gestures. Speech Communication,85:29–42, 2016.

A. S. Bregman. Auditory scene analysis: The perceptual organization of sound.MIT press, 1994.

M. R. Bregman, A. D. Patel, and T. Q. Gentner. Songbirds use spectral shape,not pitch, for sound pattern recognition. Proceedings of the National Academyof Sciences, 113(6):1666–1671, 2016.

104

http://www.aes.org/e-lib/browse.cfm?elib=18756

Bibliography

B. H. Bronson. Melodic stability in oral transmission. Journal of the InternationalFolk Music Council, 3:50–55, 1951.

C. Buteau and G. Mazzola. From contour similarity to motivic topologies.Musicae Scientiae, 4(2):125–149, 2000.

E. Cambouropoulos. Melodic cue abstraction, similarity, and category formation:A formal model. Music Perception: An Interdisciplinary Journal, 18(3):347–370, 2001. ISSN 0730-7829. DOI: 10.1525/mp.2001.18.3.347. URLhttp://mp.ucpress.edu/content/18/3/347.

E. Cambouropoulos. Musical parallelism and melodic segmentation. MusicPerception: An Interdisciplinary Journal, 23(3):249–268, 2006.

B. Caramiaux and A. Tanaka. Machine learning of musical gestures. InProceedings of the 13th International Conference on New Interfaces for MusicalExpression, pages 513–518, 2013.

B. Caramiaux, F. Bevilacqua, and N. Schnell. Towards a gesture-sound cross-modal analysis. In International Gesture Workshop, pages 158–170. Springer,2009.

G. Châtelet. Les enjeux du mobile mathématique, physique, philosophie. 1993.

T. Cheng, S. Fukayama, and M. Goto. Comparing RNN parameters for melodicsimilarity. Proceedings of the 19th International Society for Music InformationRetrieval Conference, pages 1–8, 2018. URL http://staff.aist.go.jp/m.goto/.

C. Cheung, L. S. Hamilton, K. Johnson, and E. F. Chang. The auditoryrepresentation of speech sounds in human motor cortex. Elife, 5:e12577, 2016.

A. Clark. An embodied cognitive science? Trends in cognitive sciences, 3(9):345–351, 1999.

E. Clarke. Meaning and the specification of motion in music. Musicae Scientiae,5(2):213–234, 2001.

M. Clayton. Time in Indian Music: Rhythm, Metre, and Form in NorthIndian Rag Performance: Rhythm, Metre, and Form in North Indian RagPerformance. Oxford University Press, 2001.

M. Clayton and L. Leante. Embodiment in music performance. 2013.

M. Clayton, R. Sager, and U. Will. In time with the music: the conceptof entrainment and its significance for ethnomusicology. In Europeanmeetings in ethnomusicology., volume 11, pages 1–82. Romanian Society forEthnomusicology, 2005.

M. Clynes. Sentics: The touch of emotions. Anchor Press, 1977.

105

https://doi.org/10.1525/mp.2001.18.3.347

http://mp.ucpress.edu/content/18/3/347

http://staff.aist.go.jp/m.goto/

Bibliography

T. S. Collection. Tibetan yang-yig notation: Yang chants with tibetangraphic music notation, 2019. URL https://www.schoyencollection.com/music-notation/graphic-notation/tibetan-yang-yig-ms-5280-1.

J. R. Cowdery. The melodic tradition of Ireland. Kent State University Press,1990.

L. L. Cuddy. On hearing pattern in melody. Psychology of Music, 10(1):3–10,1982.

M. E. Curtis and J. J. Bharucha. Memory and musical expectation for tonesin cultural context. Music Perception: An Interdisciplinary Journal, 26(4):365–375, 2009.

S. Dalla Bella, I. Peretz, and N. Aronoff. Time course of melody recognition: Agating paradigm study. Perception & Psychophysics, 65(7):1019–1028, 2003.

A. de Cheveigne. Pitch perception models. In Pitch, pages 169–233. Springer,2005.

A. De Cheveigné and H. Kawahara. Yin, a fundamental frequency estimator forspeech and music. The Journal of the Acoustical Society of America, 111(4):1917–1930, 2002.

E. De Freitas and N. Sinclair. Diagram, gesture, agency: Theorizing embodimentin the mathematics classroom. Educational Studies in Mathematics, 80(1-2):133–152, 2012.

S. De Laubier. The meta-instrument. Computer Music Journal, 22(1):25–29,1998.

D. Deutsch, R. Lapidis, and T. Henthorn. The speech-to-song illusion. TheJournal of the Acoustical Society of America, 124(4):2471–2471, October 2008.ISSN 0001-4966.

D. Deutsch, T. Henthorn, and R. Lapidis. Illusory transformation from speechto song. The Journal of the Acoustical Society of America, 129(4):2245–2252,2011.

C. Donahue, I. Simon, and S. Dieleman. Piano genie. arXiv preprintarXiv:1810.05246, 2018.

W. J. Dowling. Recognition of melodic transformations: Inversion, retrograde,and retrograde inversion. Perception & Psychophysics, 12(5):417–421, 1972.

W. J. Dowling. Scale and contour: Two components of a theory of memory formelodies. Psychological review, 85(4):341, 1978.

J. S. Downie. Music information retrieval. Annual review of information scienceand technology, 37(1):295–340, 2003.

106

https://www.schoyencollection.com/music-notation/graphic-notation/tibetan-yang-yig-ms-5280-1

https://www.schoyencollection.com/music-notation/graphic-notation/tibetan-yang-yig-ms-5280-1

Bibliography

T. Eerola and M. Bregman. Melodic and contextual similarity of folk songphrases. Musicae Scientiae, 11(1 suppl):211–233, 2007.

T. Eerola, P. Toiviainen, and C. Krumhansl. Real-time prediction of melodies:continuous predictability judgements and dynamic models. In Proceedingsof the 7th international conference on music perception and cognition, pages473–476. SA: Causal Productions Adelaide, 2002.

T. Eerola, T. Himberg, P. Toiviainen, and J. Louhivuori. Perceived complexity ofwestern and african folk melodies by western and african listeners. Psychologyof Music, 34(3):337–371, 2006.

Z. Eitan and R. Timmers. Beethoven’s last piano sonata and those who followcrocodiles: Cross-domain mappings of auditory pitch in a musical context.Cognition, 114(3):405–422, 2010.

I. Fernandez-Prieto, C. Spence, F. Pons, and J. Navarra. Does language influencethe vertical representation of auditory pitch and loudness? i-Perception, 8(3):2041669517716183, 2017.

M. Ferrand, P. Nelson, and G. Wiggins. Memory and melodic density: a modelfor melody segmentation. In Proceedings of the XIV Colloquium on MusicalInformatics (XIV CIM 2003), pages 95–98, 2003.

B. W. Frankland, S. McAdams, and A. J. Cohen. Parsing of melody:Quantification and testing of the local grouping rules of lerdahl and jackendoff’sa generative theory of tonal music. Music Perception: An InterdisciplinaryJournal, 21(4):499–543, 2004.

P. Fuchs. Oxford Handbook of Auditory Science: The Ear. OUP Oxford, 2010.

O. Fuhrman, K. McCormick, E. Chen, H. Jiang, D. Shu, S. Mao, andL. Boroditsky. How linguistic and cultural forces shape conceptions of time:English and mandarin time in 3d. Cognitive science, 35(7):1305–1328, 2011.

J. J. Gibson. The concept of the stimulus in psychology. American psychologist,15(11):694, 1960.

K. H. Glette, A. R. Jensenius, and R. I. Godøy. Extracting action-sound featuresfrom a sound-tracing study. 2010.

R. I. Godøy. Imagined action, excitation, and resonance. Musical imagery, 239:52, 2001.

R. I. Godøy. Motor-mimetic music cognition. Leonardo, 36(4):317–319, 2003.

R. I. Godøy. Gestural affordances of musical sound. Musical gestures: Sound,movement, and meaning, pages 103–125, 2010.

R. I. Godøy. Sonic object cognition. In Springer Handbook of SystematicMusicology, pages 761–777. Springer, 2018.

107

Bibliography

R. I. Godøy and A. R. Jensenius. Body movement in music information retrieval.In 10th International Society for Music Information Retrieval Conference,2009.

R. I. Godøy and M. Leman. Musical gestures: Sound, movement, and meaning.Routledge, 2010.

R. I. Godøy, E. Haga, and A. R. Jensenius. Playing “air instruments”: mimicryof sound-producing gestures by novices and experts. In International GestureWorkshop, pages 256–267. Springer, 2005.

R. I. Godøy, E. Haga, and A. R. Jensenius. Exploring music-related gestures bysound-tracing: A preliminary study. 2006.

R. I. Godøy. Knowledge in music theory by shapes of musical objects andsound-producing actions. In Music, gestalt, and computing, pages 89–102.Springer, 1997.

R. I. Godøy. Cross-modality and conceptual shapes and spaces in music theory.Music and signs, pages 85–98, 1999.

N. Goodman. Languages of art: An approach to a theory of symbols. Hackettpublishing, 1968.

A. Gritten and E. King. Music and gesture. Ashgate Publishing, Ltd., 2006.

A. Gritten and E. King. New perspectives on music and gesture. AshgatePublishing, Ltd., 2011.

H. I. Hair. Discrimination of tonal direction on verbal and nonverbal tasks byfirst grade children. Journal of Research in Music education, 25(3):197–210,1977.

A. R. Halpern and R. J. Zatorre. When that tune runs through your head: apet investigation of auditory imagery for familiar melodies. Cerebral cortex, 9(7):697–704, 1999.

P. Hansen. Vocal learning: its role in adapting sound structures to long-distancepropagation, and a hypothesis on its evolution. Animal Behaviour, 1979.

O. Herbort and M. V. Butz. Too good to be true? ideomotor theory from acomputational perspective. Frontiers in psychology, 3:494, 2012.

D. J. Hermes. Measuring the perceptual similarity of pitch contours. Journal ofSpeech, Language, and Hearing Research, 41(1):73–82, 1998.

M. Hood. The ethnomusicologist, volume 140. Kent State Univ Pr, 1982.

M. X. Huang, W. W. W. Tang, K. W. K. Lo, C. K. Lau, G. Ngai, andS. Chan. MelodicBrush. Proceedings of the Designing Interactive SystemsConference on - DIS ’12, (July):418, 2012. DOI: 10.1145/2317956.2318018.URL http://dl.acm.org/citation.cfm?doid=2317956.2318018.

108

https://doi.org/10.1145/2317956.2318018

http://dl.acm.org/citation.cfm?doid=2317956.2318018

Bibliography

J. E. Hummel and I. Biederman. Dynamic binding in a neural network for shaperecognition. Psychological review, 99(3):480, 1992.

D. Huron and D. Shanahan. Eyebrow movements and vocal pitch height: evidenceconsistent with an ethological signal. The Journal of the Acoustical Society ofAmerica, 133(5):2947–2952, 2013.

S. Hutchins, C. Roquet, and I. Peretz. The vocal generosity effect: How badcan your singing be? Music Perception: An Interdisciplinary Journal, 30(2):147–159, 2012.

K. Irwin. Musipedia: The open music encyclopedia. Reference Reviews, 22(4):45–46, 2008.

C. Jacquemot and S. K. Scott. What is the relationship between phonologicalshort-term memory and speech processing? Trends in cognitive sciences, 10(11):480–486, 2006.

P. Janata, J. L. Birk, J. D. Van Horn, M. Leman, B. Tillmann, and J. J. Bharucha.The cortical topography of tonal structures underlying western music. science,298(5601):2167–2170, 2002.

M. Jeannerod. Mental imagery in the motor context. Neuropsychologia, 33(11):1419–1432, 1995.

A. R. Jensenius. Action-sound: Developing methods and tools to study music-related body movement. 2007.

A. R. Jensenius. Methods for Studying Music-Related Body Motion, pages 805–818.Springer Berlin Heidelberg, Berlin, Heidelberg, 2018. DOI: 10.1007/978-3-662-55004-5_38.

A. R. Jensenius, T. Kvifte, and R. I. Godøy. Towards a gesture descriptioninterchange format. In Proceedings of the 2006 conference on New interfacesfor musical expression, pages 176–179. IRCAM—Centre Pompidou, 2006.

C. Jorgensen and K. Binsted. Web browser control using emg based sub vocalspeech recognition. In Proceedings of the 38th Annual Hawaii InternationalConference on System Sciences, pages 294c–294c. IEEE, 2005.

P. N. Juslin, L. Harmat, and T. Eerola. What makes music emotionallysignificant? exploring the underlying mechanisms. Psychology of Music,42(4):599–623, 2014.

L. Kang and H.-Y. Chien. Hé: Calligraphy as a musical interface. In NIME,pages 352–355, 2010.

A. T. Katz. Heinrich schenker’s method of analysis. The Musical Quarterly, 21(3):311–329, 1935.

109

https://doi.org/10.1007/978-3-662-55004-5_38

https://doi.org/10.1007/978-3-662-55004-5_38

Bibliography

T. Kelkar. Applications of Gesture and Spatial Cognition in Hindustani VocalMusic. PhD thesis, International Institute of Information Technology, 2015.

T. Kelkar and A. R. Jensenius. Analyzing free-hand sound-tracings of melodicphrases. Applied Sciences, 8(1):135, 2018.

A. Kendon. Gesture: Visible action as utterance. Cambridge University Press,2004.

M. Kennedy, T. Rutherford-Johnson, and J. Kennedy. The Oxford dictionary ofmusic. Oxford University Press, 2013.

R. Kent. Coarticulation in recent speech production. Jornal of Phonetics, 5(1):15–133, 1977.

Y. E. Kim, W. Chai, R. Garcia, and B. Vercoe. Analysis of a contour-basedrepresentation for melody. In Proceedings of the 2nd Conference of theInternational Society for Music Information Retrieval, 2000.

M. Kimmel. Properties of cultural embodiment: Lessons from the anthropologyof the body. Body, language and mind, 2:77–108, 2008.

T. Kitahara, S. I. Giraldo, and R. Ramírez. Jamsketch: a drawing-based real-time evolutionary improvisation support system. In Proceedings of the 17thInternational Conference on New Interfaces for Musical Expression, pages505–506, 2017.

D. Knox, S. Beveridge, L. A. Mitchell, and R. A. MacDonald. Acoustic analysisand mood classification of pain-relieving music. The Journal of the AcousticalSociety of America, 130(3):1673–1682, 2011.

E. Kobatake and K. Tanaka. Neuronal selectivities to complex object featuresin the ventral visual pathway of the macaque cerebral cortex. Journal ofneurophysiology, 71(3):856–867, 1994.

E. Kohler, C. Keysers, M. A. Umilta, L. Fogassi, V. Gallese, and G. Rizzolatti.Hearing sounds, understanding actions: action representation in mirrorneurons. Science, 297(5582):846–848, 2002.

S. M. Kosslyn, W. L. Thompson, and G. Ganis. The case for mental imagery.Oxford University Press, 2006.

C. L. Krumhansl, P. Toivanen, T. Eerola, P. Toiviainen, T. Järvinen, andJ. Louhivuori. Cross-cultural music cognition: Cognitive methodology appliedto north sami yoiks. Cognition, 76(1):13–58, 2000.

K. Kuiper and D. Haggo. Livestock auctions, oral poetry, and ordinary language.Language in society, 13(2):205–234, 1984.

M. Kussner. Shape, drawing and gesture: Cross-modal mappings of sound andmusic. PhD thesis, King’s College London, 2014.

110

Bibliography

M. B. Küssner. Music and shape. Literary and Linguistic Computing, 28(3):472–479, 2013.

W. Laade and D. Christensen. Lappish joik songs from northern norway, 1956.

D. R. Ladd. Intonational phonology. Cambridge University Press, 2008.

G. Lakoff and M. Johnson. Conceptual metaphor in everyday language. Thejournal of Philosophy, 77(8):453–486, 1980.

O. Lartillot. Soundtracer, 2018. URL https://itunes.apple.com/us/app/soundtracer/id1320398109?mt=8.

O. Lartillot and P. Toiviainen. A matlab toolbox for musical feature extractionfrom audio. In International conference on digital audio effects, pages 237–244.Bordeaux, 2007.

M. Leman. Embodied music cognition and mediation technology. Mit Press, 2008.

M. Leman and A. Schneider. Origin and nature of cognitive and systematicmusicology: An introduction. In Music, Gestalt, and Computing, pages 11–29.Springer, 1997.

M. Leman, P.-J. Maes, L. Nijs, and E. Van Dyck. What Is Embodied MusicCognition?, pages 747–760. Springer Berlin Heidelberg, Berlin, Heidelberg,2018. ISBN 978-3-662-55004-5. DOI: 10.1007/978-3-662-55004-5_34.

F. Lerdahl and R. Jackendoff. A generative theory of tonal music. 1987.

M. Lesaffre, P.-J. Maes, and M. Leman. The Routledge companion to embodiedmusic interaction. Taylor & Francis, 2017.

D. Levertov. Poems of Denise Levertov, 1960-1967. New Directions Publishing,1983.

E. Levine. A theory of everything, February 2002. URL https://www.ted.com/talks/emily_levine_s_theory_of_everything.

A. M. Liberman and I. G. Mattingly. The motor theory of speech perceptionrevised. Cognition, 21(1):1–36, 1985.

C. P. Long. Aristotle’s phenomenology of form: The shape of beings that become.Epoché: A Journal for the History of Philosophy, 11(2):435–448, 2007.

N. Malandrakis, A. Potamianos, G. Evangelopoulos, and A. Zlatintsi. Asupervised approach to movie emotion tracking. In Acoustics, Speech andSignal Processing (ICASSP), 2011 IEEE International Conference on, pages2376–2379. IEEE, 2011.

S. C. Mangaoang, Aine, J. Brackett, and K. M. Smith. Here lies love andthe politics of disco-opera. In The Routledge Companion to Popular MusicAnalysis, pages 365–381. Routledge, 2018.

111

https://itunes.apple.com/us/app/soundtracer/id1320398109?mt=8

https://itunes.apple.com/us/app/soundtracer/id1320398109?mt=8

https://doi.org/10.1007/978-3-662-55004-5_34

https://www.ted.com/talks/emily_levine_s_theory_of_everything

https://www.ted.com/talks/emily_levine_s_theory_of_everything

Bibliography

P. R. Marler and H. Slabbekoorn. Nature’s music: the science of birdsong.Elsevier, 2004.

C. P. Martin and J. Tørresen. Microjam: An app for sharing tiny touch-screenperformances. In Proceedings of the International Conference on New Interfacesfor Musical Expression, pages 495–496. Aalborg University Copenhagen, 2017.

G. Mazzola and M. Andreatta. Diagrams, gestures and formulae in music.Journal of Mathematics and Music, 1(1):23–46, 2007.

G. Mazzola and O. Zahorka. The rubato workstation for musical analysis andperformance. In Proc. of the 3rd International Conference on Music Perceptionand Cognition (ICMPC). Liege, 1994.

D. McNeill. Hand and mind: What gestures reveal about thought. University ofChicago press, 1992.

J. Mehler and E. Dupoux. Nacer sabiendo: Introducción al desarrollo cognitivodel hombre, volume 5. Anaya-Spain, 1992.

B. Mintzer. Contemporary jazz etudes. Miami, FL: Warner Bros, 2004.

I. Molnar-Szakacs and K. Overy. Music and mirror neurons: from motionto’e’motion. Social cognitive and affective neuroscience, 1(3):235–241, 2006.

G. E. Moorhead and D. Pond. Music of young children. Pillsbury Foundationfor Advancement of Music Education, 1941.

C. F. Mora. Foreign language acquisition and melody singing. ELT journal, 54(2):146–152, 2000.

R. D. Morris. New directions in the theory and analysis of musical contour.Music Theory Spectrum, 15(2):205–228, 1993.

D. Müllensiefen, K. Frieler, et al. Cognitive adequacy in the measurementof melodic similarity: Algorithmic vs. human judgments. Computing inMusicology, 13(2003):147–176, 2004.

M. Müller. Information retrieval for music and motion, volume 2. Springer,2007.

T. Murphey. The song stuck in my head phenomenon: a melodic din in the lad?System, 18(1):53–64, 1990.

E. Narmour. The analysis and cognition of melodic complexity: The implication-realization model. University of Chicago Press, 1992.

B. Nettl. Music in primitive culture. Harvard University Press Cambridge, MA,1956.

S. Nooteboom. The prosody of speech: melody and rhythm. The handbook ofphonetic sciences, 5:640–673, 1997.

112

Bibliography

K. Nymoen. Methods and Technologies for Analysing Links Between MusicalSound and Body Motion. PhD thesis, University of Oslo, 2013.

K. Nymoen, B. Caramiaux, M. Kozak, and J. Torresen. Analyzing sound tracings:A multimodal approach to music information retrieval. In Proceedings of the1st International Association for Computing Machinery Workshop on MusicInformation Retrieval with User-centered and Multimodal Strategies, MIRUM’11, pages 39–44, New York, NY, USA, 2011. Association for ComputingMachinery. ISBN 978-1-4503-0986-8. DOI: 10.1145/2072529.2072541. URLhttp://doi.acm.org/10.1145/2072529.2072541.

K. Nymoen, J. Torresen, R. Godøy, and A. Jensenius. A statistical approachto analyzing sound tracings. Speech, sound and music processing: Embracingresearch in India, pages 120–145, 2012.

K. Nymoen, R. I. Godøy, A. R. Jensenius, and J. Torresen. Analyzingcorrespondence between sound objects and body motion. Association forComputing Machinery Trans. Appl. Percept., 10(2):9:1–9:22, June 2013. ISSN1544-3558. DOI: 10.1145/2465780.2465783. URL http://doi.acm.org/10.1145/2465780.2465783.

U. of Washington Ethnomusicology Archives. Ragamala lecture-demonstrationsrecordings: Pandit k.g. ginde, 1991. URL http://archiveswest.orbiscascade.org/ark:/80444/xv80252.

H. Ohkushi, T. Ogawa, and M. Haseyama. Music recommendation according tohuman motion based on kernel cca-based relationship. EURASIP Journal onAdvances in Signal Processing, 2011(1):121, 2011.

J. E. Ollen. A criterion-related validity test of selected indicators of musicalsophistication using expert ratings. PhD thesis, The Ohio State University,2006.

A. Pannekamp, U. Toepel, K. Alter, A. Hahne, and A. D. Friederici. Prosody-driven sentence processing: An event-related brain potential study. Journalof cognitive neuroscience, 17(3):407–421, 2005.

D. Parsons. The directory of tunes and musical themes. Cambridge, Eng.: S.Brown, 1975.

S. Paschalidou, T. Eerola, and M. Clayton. Voice and movement as predictors ofgesture types and physical effort in virtual object interactions of classical indiansinging. In Proceedings of the 3rd International Symposium on Movement andComputing, page 45. Association for Computing Machinery, 2016.

A. D. Patel. Music, language, and the brain. Oxford university press, 2010.

M. Pearce, D. Müllensiefen, and G. A. Wiggins. A comparison of statisticaland rule-based models of melodic segmentation. In Proceedings of the 9thConference of the International Society for Music Information Retrieval, pages89–94, 2008.

113

https://doi.org/10.1145/2072529.2072541

http://doi.acm.org/10.1145/2072529.2072541

https://doi.org/10.1145/2465780.2465783

http://doi.acm.org/10.1145/2465780.2465783

http://doi.acm.org/10.1145/2465780.2465783

http://archiveswest.orbiscascade.org/ark:/80444/xv80252

http://archiveswest.orbiscascade.org/ark:/80444/xv80252

Bibliography

M. T. Pearce and G. A. Wiggins. Expectation in melody: The influence ofcontext and learning. Music Perception: An Interdisciplinary Journal, 23(5):377–405, 2006.

M. T. Pearce, D. Müllensiefen, and G. A. Wiggins. Melodic grouping in musicinformation retrieval: New methods and applications. In Advances in musicinformation retrieval, pages 364–388. Springer, 2010.

L. Pearson. Gesture in Karnatak Music: Pedagogy and Musical Structure inSouth India. PhD thesis, Durham University, 2016.

G. Perle. The Operas of Alban Berg, Volume II: Lulu, volume 2. Univ ofCalifornia Press, 1989.

I. Poggi. Towards the alphabet and the lexicon of gesture, gaze and touch. InVirtual Symposium on Multimodality of Human Communication. http://www.semioticon. com/virtuals/index. html, 2002.

F. Pulvermüller and L. Fadiga. Active perception: sensorimotor circuits as acortical basis for language. Nature reviews neuroscience, 11(5):351, 2010.

W. V. Quine. Identity, ostension, and hypostasis. The Journal of Philosophy, 47(22):621–633, 1950.

I. Quinn. The combinatorial model of pitch contour. Music Perception: AnInterdisciplinary Journal, 16(4):439–456, 1999.

M. Rahaim. Musicking Bodies: Gesture and Voice in Hindustani Music. WesleyanUniversity Press, 2012.

D. Reisberg. Auditory imagery. Psychology Press, 2014.

G. Rizzolatti and L. Craighero. The mirror-neuron system. Annu. Rev. Neurosci.,27:169–192, 2004.

H. H. Roberts. New phases in the study of primitive music. AmericanAnthropologist, 24(2):144–160, 1922.

D. A. Rosenbaum. Knowing hands: The cognitive psychology of manual control.Cambridge University Press, 2017.

L. D. Rosenblum, M. A. Schmuckler, and J. A. Johnson. The mcgurk effect ininfants. Perception & Psychophysics, 59(3):347–357, 1997.

E. D. Ross. Nonverbal aspects of language. Neurologic Clinics, 11(1):9–23, 1993.

U. Roy, T. Kelkar, and B. Indurkhya. Trap: An interactive system to generatevalid raga phrases from sound-tracings. In Proceedings of the 14th InternationalConference on New Interfaces of Musical Expression Conference, pages 243–246, 2014.

114

Bibliography

E. Rusconi, B. Kwan, B. L. Giordano, C. Umilta, and B. Butterworth. Spatialrepresentation of pitch height: the smarc effect. Cognition, 99(2):113–129,2006.

C. Sachs. The wellsprings of music. Springer Science & Business Media, 2012.

J. Salamon, G. Peeters, and A. Röbel. Statistical characterisation of melodicpitch contours and its application for melody extraction. In ISMIR, pages187–192, 2012.

A. L. Samuel. Some studies in machine learning using the game of checkers. IBM J.Res. Dev., 3(3):210–229, July 1959. ISSN 0018-8646. DOI: 10.1147/rd.33.0210.URL http://dx.doi.org/10.1147/rd.33.0210.

M. Śarma. Tradition of Hindustani Music. APH Publishing, 2006.

P. E. Savage and Q. D. Atkinson. Automatic tune family identification by musicalsequence alignment. In Proceedings of the 16th Conference of the InternationalSociety for Music Information Retrieval, volume 163, pages 162–168, 2015.

P. E. Savage, A. T. Tierney, and A. D. Patel. Global music recordings supportthe motor constraint hypothesis for human and avian song contour. MusicPerception: An Interdisciplinary Journal, 34(3):327–334, 2017.

P. Schaeffer, F. B. Mâche, M. Philippot, F. Bayle, L. Ferrari, I. Malec, andB. Parmegiani. La musique concrète. Presses universitaires de France, 1967a.

P. Schaeffer, G. Reibel, B. Ferreyra, H. Chiarucci, F. Bayle, A. Tanguy, J.-L. Ducarme, J.-F. Pontefract, and J. Schwarz. Solfège de l’objet sonore.INA/GRM, 1967b.

E. G. Schellenberg, A. M. Krysciak, and R. J. Campbell. Perceiving emotionin melody: Interactive effects of pitch and rhythm. Music Perception: AnInterdisciplinary Journal, 18(2):155–171, 2000.

M. A. Schmuckler. Testing models of melodic contour similarity. MusicPerception: An Interdisciplinary Journal, 16(3):295–326, 1999.

M. A. Schmuckler. Pitch and pitch structures. Ecological psychoacoustics, pages271–315, 2004.

M. A. Schmuckler. Melodic contour similarity using folk melodies. MusicPerception: An Interdisciplinary Journal, 28(2):169–194, 2010.

K. Schulze, S. Koelsch, and V. Williamson. Auditory Working Memory, pages461–472. Springer Berlin Heidelberg, Berlin, Heidelberg, 2018. ISBN 978-3-662-55004-5. DOI: 10.1007/978-3-662-55004-5_24. URL https://doi.org/10.1007/978-3-662-55004-5_24.

R. Scruton. Understanding music: Philosophy and interpretation. BloomsburyPublishing, 2016.

115

https://doi.org/10.1147/rd.33.0210

http://dx.doi.org/10.1147/rd.33.0210

https://doi.org/10.1007/978-3-662-55004-5_24

https://doi.org/10.1007/978-3-662-55004-5_24

https://doi.org/10.1007/978-3-662-55004-5_24

Bibliography

C. Seeger. On the moods of a music-logic. Journal of the American MusicologicalSociety, 13(1/3):224–261, 1960.

R. V. Shannon. Is birdsong more like speech or music? Trends in cognitivesciences, 20(4):245–247, 2016.

S. Shayan, O. Ozturk, and M. A. Sicoli. The thickness of pitch: Crossmodalmetaphors in farsi, turkish, and zapotec. The Senses and Society, 6(1):96–105,2011.

Y. K. Shin, R. W. Proctor, and E. J. Capaldi. A review of contemporaryideomotor theory. Psychological bulletin, 136(6):943, 2010.

P. Shove and B. H. Repp. Musical motion and performance: Theoretical andempirical perspectives. The practice of performance: Studies in musicalinterpretation, pages 55–83, 1995.

B. Sievers, L. Polansky, M. Casey, and T. Wheatley. Music and movementshare a dynamic structure that supports universal expressions of emotion.Proceedings of the National Academy of Sciences, 110(1):70–75, 2013.

S. M. Stalinski, E. G. Schellenberg, and S. E. Trehub. Developmental changes inthe perception of pitch contour: Distinguishing up from down. J. Acoust. Soc.Am, 124(3):1759–1763, 2008.

M. E. Tanenhaus and J. L. Lipeles. Miniaturized inertial measurement unit andassociated methods, Apr. 28 2009. US Patent 7,526,402.

D. Temperley. What’s key for key? the krumhansl-schmuckler key-findingalgorithm reconsidered. Music Perception: An Interdisciplinary Journal, 17(1):65–100, 1999.

D. Temperley. Composition, perception, and schenkerian theory. Music TheorySpectrum, 33(2):146–168, 2011.

J. Tenney and L. Polansky. Temporal gestalt perception in music. Journal ofMusic Theory, 24(2):205–241, 1980.

R. Thom. Paraboles et catastrophes. Flammarion, 1983.

A. T. Tierney, F. A. Russo, and A. D. Patel. The motor origins of human andavian song structure. Proceedings of the National Academy of Sciences, 108(37):15510–15515, 2011.

B. Tillmann and E. Bigand. Musical structure processing after repeated listening:schematic expectations resist veridical expectations. Musicae Scientiae, 14(2suppl):33–47, 2010.

B. Tillmann, J. J. Bharucha, and E. Bigand. Implicit learning of tonality: aself-organizing approach. Psychological review, 107(4):885, 2000.

116

Bibliography

P. Toiviainen and T. Eerola. Visualization in comparative music research. InCOMPSTAT, pages 209–221. Springer, 2006.

E. Tolbert. Women cry with words: Symbolization of affect in the karelianlament. Yearbook for Traditional Music, 22:80–105, 1990.

L. J. Trainor, C. M. Austin, and R. N. Desjardins. Is infant-directed speechprosody a result of the vocal expression of emotion? Psychological science, 11(3):188–195, 2000.

S. E. Trehub, D. Bull, and L. A. Thorpe. Infants’ perception of melodies: Therole of melodic contour. Child development, pages 821–830, 1984.

A. Truslit. Creation and movement in music: A tale of musical performanceand its moving figures and music [shape and movement in music]. Berlin-Lichtenfelde: Vieweg, 1938.

J. Urbano, J. Lloréns, J. Morato, and S. Sánchez-Cuadrado. Mirex 2012 symbolicmelodic similarity: Hybrid sequence alignment with geometric representations.Music Information Retrieval Evaluation eXchange, pages 3–6, 2012.

P. Van Kranenburg, J. Garbers, A. Volk, F. Wiering, L. Grijp, and R. Veltkamp.Towards integration of music information retrieval and folk song research.In Proceedings of the 8th International Conference on Music InformationRetrieval, pages 505–8, 2007.

V. Vasant. Sangeet visharad, 1998.

A. Volk, J. Garbers, P. Van Kranenburg, F. Wiering, R. C. Veltkamp, and L. P.Grijp. Applying rhythmic similarity based on inner metric analysis to folksongresearch. In Proceedings of the 8th Conference of the International Society forMusic Information Retrieval, pages 293–296, 2007.

C. Von Ehrenfels. On “gestalt qualities.”. B. Smith (Ed. & Trans.), Foundationsof Gestalt theory, pages 82–117, 1988.

D. T. Vuvan and M. A. Schmuckler. Tonal hierarchy representations in auditoryimagery. Memory & cognition, 39(3):477–490, 2011.

P. Walker, J. G. Bremner, U. Mason, J. Spring, K. Mattock, A. Slater, andS. P. Johnson. Preverbal infants are sensitive to cross-sensory correspondencesmuch ado about the null results of lewkowicz and minar (2014). Psychologicalscience, 25(3):835–836, 2014.

H. Wallach. Über visuell wahrgenommene bewegungsrichtung. PsychologischeForschung, 20(1):325–380, 1935.

K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang. A comprehensive survey oncross-modal retrieval. arXiv preprint arXiv:1607.06215, 2016.

117

Bibliography

M. W. Weiss, S. E. Trehub, and E. G. Schellenberg. Something in the wayshe sings: Enhanced memory for vocal melodies. Psychological Science,23(10):1074–1078, 2012. DOI: 10.1177/0956797612442552. URL https://doi.org/10.1177/0956797612442552. PMID: 22894936.

M. W. Weiss, P. Vanzella, E. G. Schellenberg, and S. E. Trehub. Pianists exhibitenhanced memory for vocal melodies but not piano melodies. The QuarterlyJournal of Experimental Psychology, 68(5):866–877, 2015.

M. W. Weiss, S. E. Trehub, E. G. Schellenberg, and P. Habashi. Pupils dilatefor vocal or familiar music. Journal of Experimental Psychology: HumanPerception and Performance, 42(8):1061, 2016.

R. J. West and R. Fryer. Ratings of suitability of probe tones as tonics afterrandom orderings of notes of the diatonic scale. Music Perception: AnInterdisciplinary Journal, 7(3):253–258, 1990.

B. W. White. Recognition of distorted melodies. The American journal ofpsychology, 73(1):100–107, 1960.

G. Wiggins, E. Miranda, A. Smaill, and M. Harris. A framework for the evaluationof music representation systems. Computer Music Journal, 17(3):31–42, 1993.

H. Wittmann. Intonation in glottogenesis. The melody of language: FestschriftDwight L. Bolinger, pages 315–29, 1980.

D. H. Wolpert, W. G. Macready, et al. No free lunch theorems for optimization.IEEE transactions on evolutionary computation, 1(1):67–82, 1997.

S. Wuerger, R. Shapley, and N. Rubin. “on the visually perceived direction ofmotion” by hans wallach: 60 years later. Perception, 25(11):1317–1367, 1996.

S. Youngerman. Curt sachs and his heritage: A critical review of world historyof the dance with a survey of recent studies that perpetuate his ideas. DanceResearch Journal, 6(2):6–19, 1974.

B. Yung. Cantonese opera: performance as creative process. CambridgeUniversity Press, 1989.

B. Yung. The relationship of text and tune in chinese opera. In Music, language,speech and brain, pages 408–418. Springer, 1991.

R. J. Zatorre and S. R. Baum. Musical melody and speech intonation: Singinga different tune. PLoS biology, 10(7):e1001372, 2012.

R. J. Zatorre and A. R. Halpern. Mental concerts: Musical imagery andauditory cortex. Neuron, 47(1):9 – 12, 2005. ISSN 0896-6273. DOI:https://doi.org/10.1016/j.neuron.2005.06.013. URL http://www.sciencedirect.com/science/article/pii/S0896627305005180.

118

https://doi.org/10.1177/0956797612442552

https://doi.org/10.1177/0956797612442552

https://doi.org/10.1177/0956797612442552

https://doi.org/https://doi.org/10.1016/j.neuron.2005.06.013

http://www.sciencedirect.com/science/article/pii/S0896627305005180

http://www.sciencedirect.com/science/article/pii/S0896627305005180

Papers

Paper I

Exploring melody and motionfeatures in “sound-tracings”

Tejaswinee Kelkar, Alexander Refsum JenseniusPublished in In Proceedings of the 14th Sound and Music ComputingConference, July 2017, pp. 98–103.

I

121

EXPLORING MELODY AND MOTION FEATURES IN“SOUND-TRACINGS”

Tejaswinee KelkarUniversity of Oslo, Department of [email protected]

Alexander Refsum JenseniusUniversity of Oslo, Department of Musicology

[email protected]

ABSTRACT

Pitch and spatial height are often associated when describ-ing music. In this paper we present results from a sound-tracing study in which we investigate such sound–motionrelationships. The subjects were asked to move as if theywere creating the melodies they heard, and their motionwas captured with an infra-red, marker-based camera sys-tem. The analysis is focused on calculating feature vec-tors typically used for melodic contour analysis. We usethese features to compare melodic contour typologies withmotion contour typologies. This is based on using pro-posed feature sets that were made for melodic contour sim-ilarity measurement. We apply these features to both themelodies and the motion contours to establish whether thereis a correspondence between the two, and find the featuresthat match the most. We find a relationship between verti-cal motion and pitch contour when evaluated through fea-tures rather than simply comparing contours.

1. INTRODUCTION

How can we characterize melodic contours? This ques-tion has been addressed through parametric, mathemati-cal, grammatical, and symbolic methods. The applica-tions of characterizing melodic contour can be for findingsimilarity in different melodic fragments, indexing musicalpieces, and more recently, for finding motifs in large cor-pora of music. In this paper, we compare pitch contourswith motion contours derived from people’s expressions ofmelodic pitch as movement. We conduct an experimentusing motion capture to measure body movements throughinfra-red cameras, and analyse the vertical motion to com-pare it with pitch contours.

1.1 Melodic Similarity

Marsden disentangles some of our simplification of con-cepts while dealing with melodic contour similarity, ex-plaining that the conception of similarity itself means dif-ferent things at different times with regards to melodies[1]. Not only are these differences culturally contingent,but also dependent upon the way in which music is repre-sented as data. Our conception of melodic similarity can

Copyright: c© 2017 Author1 et al. This is an open-access article distributed under

the terms of the Creative Commons Attribution 3.0 Unported License, which per-

mits unrestricted use, distribution, and reproduction in any medium, provided the

original author and source are credited.

be compared to the distances of melodic objects in a hy-perspace of all possible melodies. Computational analysesof melodic similarity have also been essential for dealingwith issues regarding copyright infringement [2], “queryby humming” systems used for music retrieval [3, 4], andfor use in psychological prediction [5].

1.2 Melodic Contour Typologies

Melodic contours serve as one of the features that can de-scribe melodic similarity. Contour typologies, and build-ing feature sets for melodic contour have been experimentedwith in many ways. Two important variations stand out —the way in which melodies are represented and featuresare extracted, and the way in which typologies are derivedfrom this set of features, using mathematical methods toestablish similarity. Historically, melodic contour has beenanalysed in two principal ways, using (a) symbolic no-tation, or (b) recorded audio. These two methods differvastly in their interpretation of contour and features.

1.3 Extraction of melodic features

The extraction of melodic contours from symbolic featureshas been used to create indexes and dictionaries of melodicmaterial [6]. This method simply uses signs such as +/-/=,to indicate the relative movement of each note. Adams pro-poses a method through which the key points of a melodiccontour — the high, low, initial, and final points of a melody— are used to create a feature vector that he then uses tocreate typologies of melody [7]. It is impossible to knowwith how much success we can constrain melodic con-tours in finite typologies, although this has been attemptedthrough these methods and others. Other methods, suchas that of Morris, constrain themselves to tonal melodies[8], and yet others, such as Friedmann’s, rely on relativepitch intervals [9]. Aloupis et. al. use geometrical repre-sentations for melodic similarity search. Although manyof these methods have found robust applications, melodiccontour analysis from notation is harder to apply to diversemusical systems. This is particularly so for musics thatare not based on western music notation. Ornaments, forexample, are easier to represent as sound signals than sym-bolic notation.

Extraction of contour profiles from audio-based pitch ex-traction algorithms has been demonstrated in several recentstudies [10,11], including specific genres such as flamencovoice [12, 13]. While such audio-based contour extractionmay give us a lot of insight about the musical data at hand,

Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland

SMC2017-98

123

0 100 200 300 400 500 600 700

Time (centiseconds)

0

100

200

300

400

500

Pitch (

Hz)

Melody1

0 100 200 300 400 500 600 700

Time (centiseconds)

0

100

200

300

400

500

Pitch (

Hz)

Melody2

0 100 200 300 400 500 600 700

Time (centiseconds)

0

100

200

300

400

500

Pitch (

Hz)

Melody3

0 100 200 300 400 500 600 700

Time (centiseconds)

0

100

200

300

400

500

Pitch (

Hz)

Melody4

Figure 1. Examples of Pitch features of selected melodies,extracted through autocorrelation.

the generalisability of such a method is harder to evaluatethan those of the symbolic methods.

1.4 Method for similarity finding

While some of these methods use matrix similarity com-putation [14], others use edit distance-based metrics [15],and string matching methods [16]. Extraction of sound sig-nals to symbolic data that can then be processed in any ofthese ways is yet another method to analyse melodic con-tour. This paper focuses on evaluating melodic contourfeatures through comparison with motion contours, as op-posed to being compared to other melodic phrases. Thiswould shed light on whether the perception of contour as afeature is even consistent, measurable, or whether we needother types of features to capture contour perception.

Yet another question is how to evaluate contours and theirbehaviours when dealing with data such as motion responsesto musical material. Motion data could be transposed tofit the parameters required for score-based analysis, whichcould possibly yield interesting results. Contour extrac-tion from melody, motion, and their derivatives could alsodemonstrate interesting similarities between musical mo-tion and melodic motion. This is what this paper tries toaddress: looking at the benefits and disadvantages of us-ing feature vectors to describe melodic features in a multi-modal context. The following research questions were themost important for the scope of this paper:

1. Are the melodic contours described in previous stud-ies relevant for our purpose?

2. Which features of melodic contours correspond tofeatures extracted from vertical motion in melodictracings?

In this paper we compare melodic movement, in termsof pitch, with vertical contours derived from motion cap-ture recordings. The focus is on three features of melodiccontour, using a small dataset containing motion responsesof 3 people to 4 different melodies. This dataset is from alarger experiment containing 32 participants and 16 melodies.

2. BACKGROUND

2.1 Pitch Height and Melodic Contour

This paper is concerned with melody, that is, sequences ofpitches, and how people trace melodies with their hands.Pitch appears to be a musical feature that people easily re-late to when tracing sounds, even when the timbre of thesound changes independently of the pitch [17–19]. Melodiccontour has been studied in terms of symbolic pitch [20,21]. Eitan explores the multimodal associations with pitchheight and verticality in his papers [22,23]. Our subjectiveexperience of melodic contours in cross cultural contextsis also investigated in Eerola’s paper [24].

The ups and downs in melody have often been comparedto other multimodal features that also seem to have up-down contours, such as words that signify verticality. Thisattribute of pitch to verticality has also been used as a fea-ture in many visualization algorithms. In this paper, we fo-cus particularly on the vertical movement in the tracings ofparticipants, to investigate if there is, indeed, a relationshipwith the vertical contours of the melodies. We also wantto see if this relationship can be extracted through featuresthat have been explored to represent melodic contour. Ifthe features proposed for melodic contours are not enough,we wish to investigate other methods that can be used torepresent a common feature vector between melody andmotion in the vertical axis. All 4 melodies in the smalldataset that we create for the purposes of this experimentare represented as pitch in Figure 1.

0 50 100 150 2000

1000

2000LH plot for Participant 1

0 50 100 150 2000

1000

2000RH plot for Participant 1

0 50 100 150 2000

1000

2000LH plot for Participant 2

0 50 100 150 2000

1000

2000RH plot for Participant 2

0 50 100 150 2000

1000

2000

Dis

pla

cem

ent(

mm

)

LH plot for Participant 3

0 50 100 150 2000

1000

2000

Dis

pla

ce

me

nt(

mm

)

RH plot for Participant 3

Figure 2. Example plots of some sound-tracing responsesto Melody 1. Time (in frames) runs along the x-axes, whilethe y-axes represent the vertical position extracted from themotion capture recordings (in millimetres). LH=left hand,RH=right hand.


SMC2017-99

124

Figure 3. A symbolic transcription of Melody 1, a sus-tained vibrato of a high soprano. The notated version dif-fers significantly from the pitch profile as seen in Figure2. The appearance of the trill and vibrato are dimensionsthat people respond through in motion tracings, that don’tclearly appear in the notated version.

Feature 1 Feature 3Melody1 [+, -, +, -, +,

-],[0, 4, -4, 2, -2, 4, 0, -9],

Melody2 [+, -, -] [0, 2, -2, -2, 0, 0],Melody3 [+, -, -, -, -, -,

-],[0, -2, -4, -1, -1, -1, -4, -2,-3, 0, 0, 0],

Melody4 [+, -, +, -, -,+, -, -]

[0, -2, 2, -4, 2, -2, 4, -2, -2]

Table 1. Examples of Features 1 and 3 for all 3 melodiesfrom score.

2.2 Categories of contour feature descriptors

In the following paragraphs, we will describe how the fea-ture sets selected for comparison in this study are com-puter. The feature sets that come from symbolic notationanalysis are revised to compute the same features from thepitch extracted profiles of the melodic contours.

2.2.1 Feature 1: Sets of signed pitch movement direction

These features are described in [6], and involve a descrip-tion of the points in the melody where the pitch ascends ordescends. This method is applied by calculating the firstderivatives of the pitch contours, and assigning a changeof sign whenever the spike in the velocity is greater than orless than the standard deviation of the velocity. This helpsus come up with the transitions that are more important tothe melody, as opposed to movement that stems from vi-bratos, for example.

2.2.2 Feature 2: Initial, Final, High, Low features

Adams, and Morris [7, 8] propose models of melodic con-tour typologies and melodic contour description modelsthat rely on encoding melodic features using these descrip-tors, creating a feature vector of those descriptors. For thisstudy, we use the feature set containing initial, final, highand low points of the melodic and motion contours com-puted directly from normalized contours.

2.2.3 Feature 3: Relative interval encoding

In these sets of features, for example as proposed in Fried-man, Quinn, Parsons, [6, 9, 14], the relative pitch distancesare encoded either as a series of ups and downs, combinedwith features such as operators (¡,=,¿) or distances of rel-ative pitches in terms of numbers. Each of these meth-ods employs a different strategy to label the high and low

Figure 4. Lab set-up for the Experiment with 21 mark-ers positioned on the body. 8 Motion capture cameras arehanging on the walls.

points of melodies. Some rely on tonal pitch class distri-bution, such as Morris’s method, which is also analogousto Schenkerian analysis in terms of ornament reduction;while others such as Friedmann’s only encode changes thatare relative to the ambit of the current melodic line. For thepurposes of this study, we pick the latter method given asall the melodies in this context are not tonal in the way thatwould be relevant to Morris.

3. EXPERIMENT DESCRIPTION

The experiment was designed so that subjects were instructedto perform hand movements as if they were creating themelodic fragments that they heard. The idea was that theywould ”shape” the sound with their hands in physical space.As such, this type of free-hand sound-tracing task is quitedifferent from some sound-tracing experiments using penon paper or on a digital tablet. Participants in a free-handtracing situation would be less fixated upon the preciselocations of all of their previous movements, thus givingus an insight of the perceptually salient properties of themelodies that they choose to represent.

3.1 Stimuli

We selected 16 melodic fragments from four genres of mu-sic that use vocalisations without words:

1. Scat singing2. Western classical vocalise3. Sami joik4. North Indian music

The melodic fragments were taken from real recordings,containing complete phrases. This retained the melodies inthe form that they were sung and heard in, thus preservingtheir ecological quality. The choice of vocal melodies wasboth to eliminate the effect of words on the perception ofmusic, but also to eliminate the possibility of imitating thesound-producing actions on instruments (”air-instrument”performance) [25].

There was a pause before and after each phrase. Thephrases were an average of 4.5 seconds in duration (s.d.1.5s). These samples were presented in two conditions: (1)the real recording, and (2) a re-synthesis through a saw-tooth wave from an autocorrelation analysis of the pitchprofile. There was thus a total of 32 stimuli per participant.


SMC2017-100

125

The sounds were played at comfortable listening levelthrough a Genelec 8020 speaker, placed 3 metres aheadof the participants at a height of 1 meter.

3.2 Participants

A total of 32 participants (17 female, 15 male) were re-cruited to move to the melodic stimuli in our motion cap-ture lab. The mean age of the participants was 31 years(SD=9). The participants were recruited from the Univer-sity of Oslo, and included students, and employees, whowere not necessarily from a musical background.

The study was reported to and obtained ethical approvalfrom the Norwegian Centre for Research Data. The par-ticipants signed consent forms and were free to withdrawduring the experiment, if they wished.

3.3 Lab set-up

The experiment was run in the fourMs motion capture lab,using a Qualisys motion capture system with eight wall-mounted Oqus 300 cameras (Figure 3.1, capturing at 200Hz. The experiment was conducted in dim light, with noobservers, to make sure that participants felt free to moveas they liked. A total of 21 markers were placed on thebody of the participants: the head, shoulders, elbows, wrists,knees, ankles, the torso, and the back of the body. Therecordings were post-processed in Qualisys Track Man-ager (QTM), and analysed further in Matlab.

3.4 Procedure

The participants were asked to trace all 32 melody phrases(in random order) as if their hand motion was ‘producing’the melody. The experiment lasted for a total duration of10 minutes. After post processing the data from this ex-periment, we get a dataset for motion of 21 markers whilethe participants performed sound-tracing. We take a subsetof this data for further analysis of contour features. In thisstep, we extract the motion data for the left and the righthands from a small subset of 4 melodies performed by 3participants. We focus on the vertical movement of boththe hands given as this analysis pertains to verticality ofpitch movement. We process these motion contours alongwith the pitch contours for the 4 selected melodies, through3 melodic features as described in section 2.2.

4. MELODIC CONTOUR FEATURES

For the analysis, we record the following feature vectorsthrough some of the methods mentioned in section 1.2.The feature vectors are calculated as mentioned below:

Feature 1 Signed interval distances: The obtained motionand pitch contours are binned iteratively to calcu-late average values in each section. Mean verticalmotion for all participants is calculated. This meanmotion is then binned in the way that melodic con-tours are binned. The difference between the valuesof the successive bins is calculated. The sign of thisdifference is concatenated to form a feature vectorcomposed of signed distances.

Figure 5. Example of post-processed Motion CaptureRecording. The markers are labelled and their relative po-sitions on the co-ordinate system is measured.

Feature 2 Initial, Final, Highest, Lowest vector: Thesefeatures were obtained by calculating the four fea-tures mentioned above as indicators of the melodiccontour. This method has been used to form a typol-ogy of melodic contours.

Feature 3 Signed relative distances: The obtained signsfrom Feature 1 are combined with relative distancesof each successive bin from the next. The signs andthe values are combined to give a more complete pic-ture. Here we considered the pitch values at the bins.These did not represent pitch class sets, and there-fore made the computation “genre-agnostic.”

Signed relative distances of melodies are then comparedto signed relative distances of average vertical motion toobtain a feature vector.

5. RESULTS

5.1 Correlation between pitch and vertical motion

Feature 3, which considered an analysis of signed relativedistances had the correlation coefficient of 0.292 for all 4melodies, with a p value of 0.836 which does not show aconfident trend. Feature 2, containing a feature vector formelodic contour typology, performs with a correlation co-efficient of 0.346, indicating a weak positive relationship,with a p value of 0.07, which indicates a significant posi-tive correlation. This feature performs well, but is not ro-bust in terms of its representation of the contour itself, andfails when individual tracings are compared to melodies,yielding an overall coefficient of 0.293.


SMC2017-101

126

1 2 3 4 5 6 7

Bins per note

0

100

200

300Mean Motion of RHZ for Melody 1

1 2 3 4 5

Bins per note

0

100

200


1 2 3 4 5 6 7 8

Bins per note

0

100

200


1 2 3 4 5 6 7 8 9

Bins per note

0

100

200

300

Dis

pla

cem

ent in

mm Mean Motion of RHZ for Melody 4

1 2 3 4 5 6 7

Bins per note

0

100

200

300Mean segmentation bins of pitches for Melody 1

1 2 3 4 5

Bins per note

0

100

200


1 2 3 4 5 6 7 8

Bins per note

0

100

200


1 2 3 4 5 6 7 8 9

Bins per note

0

100

200

300

Pitch m

ovem

ent in

Hz

Mean segmentation bins of pitches for Melody 4

(a) Motion Responses (b) Melodic Contour Bins

Figure 6. Plots of the representation of features 1 and 3. These features are compared to analyse similarity of the contours.

5.2 Confusion between tracing and target melody

As seen in the confusion matrix in Figure 7, the tracings arenot clearly classified as target melodies by direct compar-ison of contour values itself. This indicates that althoughthe feature vectors might show a strong trend in verticalmotion mapping to pitch contours, this is not enough forsignificant classification. This demonstrates the need forhaving feature vectors that adequately describe what is go-ing on in music and motion.

6. DISCUSSION

A significant problem when analysing melodies throughsymbolic data is that a lot of the representation of texture,as explained regarding Melody 2, gets lost. Vibratos, or-naments, and other elements that might be significant forthe perception of musical motion can not be captured effi-ciently through these methods. However, these ornamentscertainly seem salient for people’s bodily responses. Fur-ther work needs to be carried out to explain the relationshipof ornaments and motion, and this relationship might havelittle or nothing to do with vertical motion.

We also found that the performance of a tracing is fairlyintuitive to the eye. The decisions for choosing particularmethods of expressing the music through motion do notappear odd when seen from a human perspective, and yetcharacterizing what are significant features for this cross-modal comparison is a much harder question.

Our results show that vertical motion seems to correlatewith pitch contours in a variety of ways, but most signifi-cantly when calculated in terms of signed relative values.Signed relative values, as in Feature 3, also maintain thecontext of the melodic phrase itself, and this is seen tobe significant for sound-tracings. Interval distances matter

less than the current ambit of melody that is being traced.Other contours apart from pitch and melody are also sig-

nificant for this discussion, especially timbral and dynamicchanges. However, the relationships between those andmotion were beyond the scope of this paper. The inter-pretation of motion other than just vertical motion is alsonot handled within this paper.

The features that were shown to be significant can be ap-plied for the whole dataset to see relationships betweenvertical motion and melody. Contours of dynamic and tim-bral change can also be interesting to compare with thesame methods against melodic tracings.

7. REFERENCES

[1] A. Marsden, “Interrogating melodic similarity: adefinitive phenomenon or the product of interpreta-tion?” Journal of New Music Research, vol. 41, no. 4,pp. 323–335, 2012.

[2] C. Cronin, “Concepts of melodic similarity in musiccopyright infringement suits,” Computing in musicol-ogy: a directory of research, no. 11, pp. 187–209,1998.

[3] A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith,“Query by humming: musical information retrieval inan audio database,” in Proceedings of the third ACMinternational conference on Multimedia. ACM, 1995,pp. 231–236.

[4] L. Lu, H. You, H. Zhang et al., “A newapproach toquery by humming in music retrieval.” in ICME, 2001,pp. 22–25.


SMC2017-102

127

Target Melodies including synthesized variants

Mot

ion

Trac

es

Figure 7. Confusion matrix for Feature 3, to analyse theclassification of raw motion contours with pitch contoursfor 4 melodies.

[5] N. N. Vempala and F. A. Russo, “Predicting emotionfrom music audio features using neural networks,” inProceedings of the 9th International Symposium onComputer Music Modeling and Retrieval (CMMR).Lecture Notes in Computer Science London, UK,2012, pp. 336–343.

[6] D. Parsons, The directory of tunes and musical themes.Cambridge, Eng.: S. Brown, 1975.

[7] C. R. Adams, “Melodic contour typology,” Ethnomusi-cology, pp. 179–215, 1976.

[8] R. D. Morris, “New directions in the theory and anal-ysis of musical contour,” Music Theory Spectrum,vol. 15, no. 2, pp. 205–228, 1993.

[9] M. L. Friedmann, “A methodology for the discussionof contour: Its application to schoenberg’s music,”Journal of Music Theory, vol. 29, no. 2, pp. 223–248,1985.

[10] J. Salamon and E. Gomez, “Melody extraction frompolyphonic music signals using pitch contour charac-teristics,” IEEE Transactions on Audio, Speech, andLanguage Processing, vol. 20, no. 6, pp. 1759–1770,2012.

[11] R. M. Bittner, J. Salamon, S. Essid, and J. P. Bello,“Melody extraction by contour classification,” in Proc.ISMIR, pp. 500–506.

[12] E. Gomez and J. Bonada, “Towards computer-assistedflamenco transcription: An experimental comparisonof automatic transcription algorithms as applied to acappella singing,” Computer Music Journal, vol. 37,no. 2, pp. 73–90, 2013.

[13] J. C. Ross, T. Vinutha, and P. Rao, “Detecting melodicmotifs from audio for hindustani classical music.” inISMIR, 2012, pp. 193–198.

[14] I. Quinn, “The combinatorial model of pitch con-tour,” Music Perception: An Interdisciplinary Journal,vol. 16, no. 4, pp. 439–456, 1999.

[15] G. T. Toussaint, “A comparison of rhythmic similaritymeasures.” in ISMIR, 2004.

[16] D. Bainbridge, C. G. Nevill-Manning, I. H. Witten,L. A. Smith, and R. J. McNab, “Towards a digital li-brary of popular music,” in Proceedings of the fourthACM conference on Digital libraries. ACM, 1999,pp. 161–169.

[17] K. Nymoen, “Analyzing sound tracings: a multimodalapproach to music information retrieval,” in Proceed-ings of the 1st international ACM workshop on Mu-sic information retrieval with user- centered and mul-timodal strategies, 2011.

[18] M. B. Kussner and D. Leech-Wilkinson, “Investigat-ing the influence of musical training on cross-modalcorrespondences and sensorimotor skills in a real-timedrawing paradigm,” Psychology of Music, vol. 42,no. 3, pp. 448–469, 2014.

[19] G. Athanasopoulos and N. Moran, “Cross-cultural rep-resentations of musical shape,” Empirical MusicologyReview, vol. 8, no. 3-4, pp. 185–199, 2013.

[20] M. A. Schmuckler, “Testing models of melodic contoursimilarity,” Music Perception: An InterdisciplinaryJournal, vol. 16, no. 3, pp. 295–326, 1999.

[21] J. B. Prince, M. A. Schmuckler, and W. F. Thompson,“Cross-modal melodic contour similarity,” CanadianAcoustics, vol. 37, no. 1, pp. 35–49, 2009.

[22] Z. Eitan and R. Timmers, “Beethovens last pianosonata and those who follow crocodiles: Cross-domainmappings of auditory pitch in a musical context,” Cog-nition, vol. 114, no. 3, pp. 405–422, 2010.

[23] Z. Eitan and R. Y. Granot, “How music moves,” Mu-sic Perception: An Interdisciplinary Journal, vol. 23,no. 3, pp. 221–248, 2006.

[24] T. Eerola and M. Bregman, “Melodic and contextualsimilarity of folk song phrases,” Musicae Scientiae,vol. 11, no. 1 suppl, pp. 211–233, 2007.

[25] R. I. Godøy, E. Haga, and A. R. Jensenius, “Playing airinstruments: mimicry of sound-producing gestures bynovices and experts,” in International Gesture Work-shop. Springer, 2005, pp. 256–267.


SMC2017-103

128

Paper III

Analyzing free-handsound-tracings of melodic phrases

Tejaswinee Kelkar, Alexander Refsum JenseniusPublished in Applied Sciences, January 2018, pp. 137-157.

III

135

applied sciences

Article

Analyzing Free-Hand Sound-Tracings ofMelodic Phrases

Tejaswinee Kelkar * ID and Alexander Refsum Jensenius ID

University of Oslo, Department of Musicology, RITMO Centre for Interdisciplinary Studies in Rhythm,Time and Motion, 0371 Oslo, Norway; [email protected]* Correspondence: [email protected]; Tel.: +47-4544-8254

Academic Editor: Vesa ValimakiReceived: 31 October 2017; Accepted: 15 January 2018; Published: 18 January 2018

Abstract: In this paper, we report on a free-hand motion capture study in which 32 participants‘traced’ 16 melodic vocal phrases with their hands in the air in two experimental conditions.Melodic contours are often thought of as correlated with vertical movement (up and down) in time,and this was also our initial expectation. We did find an arch shape for most of the tracings,although this did not correspond directly to the melodic contours. Furthermore, representationof pitch in the vertical dimension was but one of a diverse range of movement strategies used totrace the melodies. Six different mapping strategies were observed, and these strategies have beenquantified and statistically tested. The conclusion is that metaphorical representation is much morecommon than a ‘graph-like’ rendering for such a melodic sound-tracing task. Other findings includea clear gender difference for some of the tracing strategies and an unexpected representation ofmelodies in terms of a small object for some of the Hindustani music examples. The data also showa tendency of participants moving within a shared ‘social box’.

Keywords: motion; melody; shape; sound-tracing; multi-modality

1. Introduction

How do people think about melodies as shapes? This question comes out of the authors’ generalinterest in understanding more about how spatiotemporal elements influence the cognition of music.When it comes to the topic of melody and shape, these terms often seem to be interwoven. In fact,the Concise Oxford Dictionary of Music defines melody as: “A succession of notes, varying in pitch,which has an organized and recognizable shape.” [1]. Here, shape is embedded as a component inthe very definition of melody. However, what is meant by the term ‘melodic shape’, and how can westudy such melodic shapes and their typologies?

Some researchers have argued for thinking of free-hand movements to music (or ‘air instrumentperformance’ [2]) as visible utterances similar to co-speech gestures [3–5]. From the first author’sexperience as an improvisational singer, a critical part of learning a new singing culture was thephysical representation of melodic content. This physical representation includes bodily posture,gestural vocabulary and the use of the body to communicate sung phrases. In improvised music,this also includes the way in which one uses the hands to guide the music and the expectation ofa familiar audience from the performing body. These body movements may refer to spatiotemporalmetaphors, quite like the ones used in co-speech gestures.

In their theory of cognitive metaphors, Lakoff and Johnson point out how the metaphorsin everyday life represent the structure through which we conceptualize one domain with therepresentation of another [6]. Zbikowski uses this theory to elaborate how words used to describepitches in different languages are mapped onto the metaphorical space of the ‘vertical dimension’ [7].Descriptions of melodies often use words related to height, for example: a ‘high’- or ‘low’-pitched

Appl. Sci. 2018, 8, 135; doi:10.3390/app8010135 www.mdpi.com/journal/applsci137

Appl. Sci. 2018, 8, 135 2 of 21

voice, melodies going ‘up’ and ‘down’. Shayan et al. suggest that this mapping might be morestrongly present in Western cultures, while the use of other metaphors in other languages, such as thickand thin pitch, might explain pitch using other non-vertical one-dimensional mapping schemata [8].The vertical metaphor, when tested with longer melodic lines, shows that we respond non-linearlyto the vertical metaphors of static and dynamic pitch stimuli [9]. Research in music psychology hasinvestigated both the richness of this vocabulary and its perceptual and metaphorical allusions [10].However, the idea that the vertical dimension is the most important schema of melodic motion isvery persistent [11]. Experimentally, pitch-height correspondences are often elicited by comparingtwo or three notes at a time. However, when stimuli become more complex, resembling real melodies,the persistence of pitch verticality is less clear. For ‘real’ melodies, shape descriptions are often used,such as arches, curves and slides [12].

In this paper, we investigate shape descriptions through a sound-tracing approach. This was doneby asking people to listen to melodic excerpts and then move their body as if their movement wascreating the sound. The aim is to answer the following research questions:

1. How do people present melodic contour as shape?2. How important is vertical movement in the representation of melodic contour in sound-tracing?3. Are there any differences in sound-tracings between genders, age groups and levels of musical

experience?4. How can we understand the metaphors in sound-tracings and quantify them from the data obtained?

The paper starts with an overview of related research, before the experiment and various analysesare presented and discussed.

2. Background

Drawing melodies as lines has seemed intuitive across different geographies and time periods,from the Medieval neumes to contemporary graphical scores. Even the description of melodies aslines enumerates some of their key properties: continuity, connectedness and appearance as a shape.Most musical cultures in the world are predominantly melodic, which means that the central melodicline is important for memorability. Melodies display several integral patterns of organization andtime-scales, including melodic ornaments, motifs, repeating patterns, themes and variations. These areall devices for organizing melodic patterns and can be found in most musical cultures.

2.1. Melody, Prosody and Memory

Musical melodies may be thought of as closely related to language. For example, prosody,which can also be described as ‘speech melody,’ is essential for understanding affect in spokenlanguage. Musical and linguistic experiences with melody can often influence one another [13]. Speechmelodies and musical melodies are differentiated on the basis of variance of intervals, delineation anddiscrete pitches as scales [14]. While speech melodies show more diversity in intonation, there is lesserdiversity in prosodic contours internally within a language. Analysis of these contours is used forrecognition of languages, speakers and dialects in computation [15].

It has been argued that tonality makes musical melodies more memorable than speech melodies [16],while contour makes them more recognizable, especially in unfamiliar musical styles [17]. Dowling et al.suggest that adults use contour to recognize unfamiliar melodies, even when they have been transposedor when intervals are changed [16]. There is also neurological evidence supporting the idea thatcontour memory is independent of absolute pitch location [18]. Early research in contour memoryand recognition demonstrated that acquisition of memory for melodic contour in infants and childrenprecedes memory for intonation [17,19–22]. Melodic contour is also described as a ‘coarse-grained’schema that lacks the detail from musical intervals [14].

138

Appl. Sci. 2018, 8, 135 3 of 21

2.2. Melodic Contour

Contour is often used to refer to sequences of up-down movement in melodies, but there arealso several other terms that in different ways touch upon the same idea. Shape, for example, is moregenerally used for referring to overall melodic phrases. Adams uses the terms contour and shapeinterchangeably [23] and also adds melodic configuration and outline to the mix of descriptors.Tierney et al. discuss the biological origins and commonalities of melodic shape [24]. They also notethe predominance of arch-shaped melodies in music, the long duration of the final notes of phrases,and the biases towards smooth pitch contours. The idea of shapes has also been used to analyzemelodies of infants crying [25]. Motif, on the other hand, is often used to refer to a unit of melody thatcan be operated upon to create melodic variation, such as transposition, inversion, etc. Yet anotherterm is that of melodic chunk, which is sometimes used to refer to the mnemonic properties of melodicunits, while museme is used to indicate instantaneous perception. Of all these terms, we will stickwith contour for the remainder of this paper.

2.3. Analyzing Melodic Contour

There are numerous analysis methods that can be used to study melodic contour, and they maybe briefly divided into two main categories: signal-based or symbolic representations. When thecontour analysis uses a signal-based representation, a recording of the audio is analyzed withcomputational methods to extract the melodic line, as well as other temporal and spectral featuresof the sound. The symbolic representations may start from notated or transcribed music scoresand use the symbolic note representations for analysis. Similar to how we might whistle a shortmelodic excerpt to refer to a larger musical piece, melodic dictionaries have been created to indexmusical pieces [26]. Such indexes merit a thorough analysis of contour typologies, and several contourtypologies were created to this end [23,27,28]. Contour typology methods are often developed fromsymbolic representations and notated as discrete pitch items. Adam’s method for contour typologyanalysis, for example, codes the initial and final pitches of the melodies as numbers [23]. Parson’stypology, on the other hand, uses note directions and their intervals as the units of analysis [26].There are also examples of matrix comparison methods that code pitch patterns [27]. A comparison ofthese methods to perception and memory is carried out in [29,30], suggesting that the information-richmodels do better than more simplistic ones. Perceptual responses to melodic contour changes havealso been studied systematically [30–32], revealing differences between typologies and which onescome closest to resembling models of human contour perception.

The use of symbolic representations makes it easier to perform systematic analysis andmodification of melodic music. While such systematic analysis works well for some types ofpre-notated music, it is more challenging for non-notated or non-Western music. For such non-notatedmusics, the signal-based representations may be a better solution, particularly when it comes toproviding representations that more accurately describe continuous contour features. Such continuousrepresentations (as opposed to more discrete, note-based representations) allow the extraction ofinformation from the actual sound signal, giving a generally richer dataset from which to work.The downside, of course, is that signal-based representations tend to be much more complex, and hencemore difficult to generalize from.

In the field of music perception and cognition, the use of symbolic music representations,and computer-synthesized melodic stimuli, has been most common. This is the case even thoughthe ecological validity of such stimuli may be questioned. Much of the previous research into theperception of melodic contour also suffers from a lack of representation of non-Western and alsonon-classical musical styles, with some notable exceptions such as [33–35].

Much of the recent research into melodic representations is found within the music informationretrieval community. Here, the extraction of melodic contour and contour patterns directly from thesignal is an active research topic, and efficient algorithms for extraction of the primary melody have beentested and compared in the MIREX (Music Information Retrieval Evaluation Exchange) competitions

139

Appl. Sci. 2018, 8, 135 4 of 21

for several years. Melodic contour is also used to describe the instrumentation of music from audiosignals, for example in [36,37]. It is also interesting to note that melodic contour is used as the first stepto identify musical structure in styles such as in Indian Classical music [38], and Flamenco [39].

2.4. Pitch and the Vertical Dimension

As described in Section 1, the vertical dimension (up-down) is a common way to describepitch contours. This cross-modal correspondence has been demonstrated in infants [40], showingpreferences for concurrence of auditory pitch ‘rising,’ visuospatial height, as well as visual ‘sharpness’.The association with visuospatial height is elaborated further with the SMARC (Spatial-MusicalAssociation of Response Codes) effect [11]. Here, participants show a shorter response time for lowerpitches co-occurring with left or bottom response codes, while higher pitches strongly correlated withresponse codes for right or top. A large body of work tries to tease apart the nuances of the suggestedeffect. Some of the suggestions include the general setting of the instruments and the bias of readingand writing being from left to right in most of the participants [41], as contributing factors to themanifestation of this effect.

The concepts of contour rely upon pitch height being a key feature of our melodic multimodalexperience. Even the enumeration of pitch in graphical formats plays on this persistent metaphor.Eitan brings out the variety of metaphors for pitch quality descriptions, suggesting that up and downmight only be one of the ways in which cross-modal correspondence with pitch appears in differentlanguages [9,10]. Many of the tendencies suggested in the SMARC effect are less pronounced whenmore, and more complex, stimuli appear. These have been tested in memory or matching tasks,rather than asking people to elicit the perceived contours. The SMARC effect may here be seen incombination with the STEARC (Spatial-Temporal Association of Response Codes) effect, stating thattimelines are more often perceived as horizontally-moving objects. In combination, these two effectsmay explain the overwhelming representation of vertical pitch on timelines. The end result is that wenow tend to think of melodic representation mostly in line-graph-based terms, along the lines of whatAdams ([23], p. 183) suggested:

There is a problem of the musical referents of the terms. As metaphoric depictions, mostof these terms are more closely related to the visual and graphic representations of musicthan to its acoustical and auditory characteristics. Indeed, word-list typologies of melodiccontour are frequently accompanied by ‘explanatory’ graphics.

This ‘problem’ of visual metaphors, however, may actually be seen as an opportunity to exploremultimodal perception that was not possible to understand at the time.

2.5. Embodiment and Music

The accompaniment of movement to music is understood now as an important phenomenonin music perception and cognition [42]. Research studying the close relationship between soundand movement has shed light on the mechanism to understand action as sound [43] and sound asaction [44,45]. Cross-modal correspondence is a phenomenon with a tight interactive loop with thebody as a mediator for perceptual, as well as performative roles [46,47]. Some of these interactionsshow themselves in the form of motor cortex activation when only listening to music [48]. This hasfurther led to empirical studies of how music and body movement share a common structure thataffords equivalent and universal emotional expression [49]. Mazzola et al. have also worked ona topological understanding of musical space and the topological dynamics of musical gesture [50].

Studies of Hindustani music show that singers use a wide variety of movements and gestures thataccompany spontaneous improvisation [4,51,52]. These movements are culturally codified; they appearin the performance space to aid improvisation and musical thought, and they also convey thisinformation to the listener. The performers also demonstrate a variety of imaginary ‘objects’ withvarious physical properties to illustrate their musical thought.

140

Appl. Sci. 2018, 8, 135 5 of 21

Some other examples of research on body movement and melody include Huron’s studiesof how eyebrow height accompany singing as a cue response to melodic height [53], and studiessuggesting that especially arch-shaped melodies have common biological origins that are related tomotor constraints [24,54].

2.6. Sound-Tracings

Sound-tracing studies aim at analyzing spontaneous rendering of melodies to movement,capturing instantaneous multimodal associations of the participants. Typically, subjects are asked todraw (or trace) a sound example or short musical excerpt as they are listening. Several of thesestudies have been carried out with digital tablets as the transducer or the medium [2,44,55–59].One restriction of using tablets is that the canvas of the rendering space is very restrictive.Furthermore, the dimensionality does not evolve over time and represents a narrow bandwidthof possible movements.

An alternative to tablet-based sound-tracing setups is that of using full-body motion capture.This may be seen as a variation of ‘air performance’ studies, in which participants try to imitate thesound-producing actions of the music to which they listen [2]. Nymoen et al. carried out a series ofsound-tracing studies focusing on movements of the hands [60,61], elaborating on several featureextraction methods to be used for sound-tracing as a methodology. The current paper is inspired bythese studies, but extending the methodology to full-body motion capture.

3. Sound-Tracing of Melodic Phrases

3.1. Stimuli

Based on the above considerations and motivations, we designed a sound-tracing study ofmelodic phrases. We decided to use melodic phrases from vocal genres that have a tradition of singingwithout words. Vocal phrases without words were chosen so as to not introduce lexical meaning asa confounding variable. Leaving out instruments also avoids the problem of subjects having to choosebetween different musical layers in their sound-tracing.

The final stimulus set consists of four different musical genres and four stimuli for each genre.The musical genres selected are: (1) Hindustani (North Indian) music, (2) Sami joik, (3) scat singing,(4) Western classical vocalize. The melodic fragments are phrases taken from real recordings, to retainmelodies within their original musical context. As can be seen in the pitch plots in Figure 1,the melodies are of varying durations with an average of 4.5 s (SD = 1.5 s). The Hindustani and joikphrases are sung by male vocalists, whereas the scat and vocalize phrases are sung by female vocalists.This is represented in the pitch range of each phrase as seen in Figure 1. The Hindustani and joikmelodies are mainly sung in a strong chest voice in this stimulus set. Scat vocals are sung witha transition voice from chest to head. The Vocalizes in this set are sung by a soprano, predominantly inthe head register. Hindustani and vocalize samples have one dominant vowel that is used throughoutthe phrase. The Scat singing examples use traditional ‘shoobi-doo-wop’ syllables, and joik examplesin this set predominantly contain syllables such as ‘la-la-lo’.

To investigate the effects of timbre, we decided to create a ‘clean’ melody representation of eachfragment. This was done by running the sound files through an autocorrelation algorithm to createphrases that accurately resemble the pitch content, but without the vocal, timbral and vowel content ofthe melodic stimulus. These 16 re-synthesized sounds were added to the stimulus set, thus obtaininga total of 32 sound stimuli (summarized in Table 1).

141

Appl. Sci. 2018, 8, 135 6 of 21

0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6

Figure 1. Pitch plots of all the 16 melodic phrases used as experiment stimuli, from each genre.The x axis represents time in seconds, and the y axis represents notes. The extracted pitches werere-synthesized to create a total of 32 melodic phrases used in the experiment.

Table 1. An overview of the 32 different stimuli: four phrases from each musical genre, all of whichwere presented in both normal and re-synthesized versions.

Type Hindustani Joik Scat Vocalize

Normal 4 4 4 4Re-synthesized 4 4 4 4

3.2. Subjects

A total of 32 subjects (17 female, 15 male) was recruited, with a mean age of 31 years(SD = 9 years). The participants were mainly university students and employees, both with andwithout musical training. Their musical experience was quantized using the OMSI (Ollen MusicalSophistication Index) questionnaire [62], and they were also asked about the familiarity with themusical genres, and their experience with dancing. The mean of the OMSI score was 694 (SD = 292),indicating that the general musical proficiency in this dataset was on the higher side. The averagefamiliarity with Western classical music was 4.03 out of a possible 5 points, 3.25 for jazz music, 1.87 withjoik, and 1.71 with Indian classical music. Thus, two genres (vocalize and scat) were more familiar thanthe two others (Hindustani and joik). All participants provided their written consent for inclusionbefore they participated in the study, and they were free to withdraw during the experiment. The studyobtained ethical approval from the Norwegian Centre for Research Data (NSD), with the project code49258 (approved on 22 August 2016).

3.3. Procedure

Each subject performed the experiment alone, and the total duration was around 10 min.They were instructed to move their hands as if their movement was creating the melody. The use ofthe term creating, instead of representing, is purposeful, as in earlier studies [60,63], to avoid the act of

142

Appl. Sci. 2018, 8, 135 7 of 21

playing or singing. The subjects could freely stand anywhere in the room and face whichever directionthey liked, although nearly all of them faced the speakers and chose to stand in the center of the lab.The room lighting was dimmed to help the subjects feel more comfortable to move as they pleased.

The sounds were played at a comfortable listening level through a Genelec 8020 speaker, placed3 m in front of the subjects. Each session consisted of an introduction, two example sequences, 32 trialsand a conclusion, as sketched in Figure 2. Each melody was played twice with a 2-s pause in between.During the first presentation, the participants were asked to listen to the stimuli, while during thesecond presentation, they were asked to trace the melody. A long beep indicated the first presentationof a stimulus, while a short beep indicated the repetition of a stimuli. All the instructions and requiredguidelines were recorded and played back through the speaker to not interrupt the flow of theexperiment.

.

Figure 2. The experiment flow, with an approximate total duration of 10 min

The subjects’ motion was recorded using an infrared marker-based motion capture systemfrom Qualisys AB (Gothenburg, Sweden), with 8 Oqus 300 cameras surrounding the space (Figure 3a)and one regular video camera (Canon XF105 (manufactured in Tokyo, Japan)), for reference.Each subject wore a motion capture suit with 21 reflective markers on each joint (Figure 3b). The systemcaptured at a rate of 200 Hz. The data were post-processed in the Qualisys Track Manager software(QTM, v2.16, Qualisys AB, Gothenburg, Sweden), which included labeling of markers and removal ofghost-markers (Figure 3c). We used polynomial interpolation to gap-fill the marker data, where needed.The post-processed data was exported to Python (v2.7.12 and MATLAB (R2013b, MathWorks, Natick,MA, USA) for further analysis. Here, all of the 10-min recordings were segmented using automaticwindowing, and each of the segments were manually annotated for further analysis in Section 6.

(a) (b) (c)

Figure 3. (a) The motion capture lab used for the experiments. (b) The subjects wore a motion capturesuit with 21 reflective markers. (c) Screenshot of a stick-figure after post-processing in Qualisys TrackManager software (QTM).

143

Appl. Sci. 2018, 8, 135 8 of 21

4. Analysis

Even though full-body motion capture was performed, we will in the following analysis onlyconsider data from the right and left hand markers. Marker occlusions from six of the subjects weredifficult to account for in the manual annotation process, so only data from 26 participants were usedin the analysis that will be presented in Section 5. This analysis is done using comparisons of meansand distribution patterns. The occlusion problems were easier to tackle with the automatic analysis,so the analysis that will be presented in Section 6 was performed on data from all 32 participants.

4.1. Feature Selection from Motion Capture Data

Table 2 shows a summary of the features extracted from the motion capture data. Vertical velocityis calculated as the first derivative of the z-axis (vertical motion) for each tracing over time. ‘Quantityof motion’ is a dimensionless quantity representing the overall amount of motion in any direction fromframe to frame. Hand distance is calculated as the euclidean distance between the x,y,z coordinates foreach marker for each hand. We also calculate the sample-wise distance traveled for each hand marker.

Table 2. The features extracted from the motion capture data.

Motion Features Description

1 VerticalMotion z-axis coordinates at each instant of each hand2 Range (Min, Max) tuple for each hand3 HandDistance The euclidean distance between the 2d coordinates of each hand4 QuantityofMotion The sum of absolute velocities of all the markers5 DistanceTraveled Cumulative euclidean distance traveled by each hand per sample6 AbsoluteVelocity Uniform linear velocity of all dimensions7 AbsoluteAcceleration The derivative of the absolute velocity8 Smoothness The number of knots of a quadratic spline interpolation fitted to each tracing9 VerticalVelocity The first derivative of the z-axis in each participant’s tracing

10 CubicSpline10Knots 10 knots fitted to a quadratic spline for each tracing

4.2. Feature Selection from Melodic Phrases.

Pitch curves from the melodic phrases were calculated using the autocorrelation algorithm inPraat (v6.0.30, University of Amsterdam, The Netherlands), eliminating octave jumps. These pitchcurves were then exported for further analysis together with the motion capture features in Python.Based on contour analysis from the literature [23,30,64], we extracted three different melodic features,as summarized in Table 3.

Table 3. The features extracted from the melodic phrases.

Melodic Features Description

1 SignedIntervalDirection Interval directions (up/down) calculated for each note2 InitialFinalHighestLowest Four essential notes of a melody: initial, final, highest, lowest3 SignedRelativeDistances Feature 1 combined with relative distances of each successive note

from the next, only considering the number of semitones for eachsuccessive change.

5. Analysis of Overall Trends

5.1. General Motion Contours

One global finding from the sound-tracing data is that of a clear arch shape when looking atthe vertical motion component over time. Figure 4 shows the average contours calculated from thez-values of the motion capture data of all subjects for each of the melodic phrases. It is interestingto note a clear arch-like shape for all of the graphs. This fits with suggestions of a motor constraint

144

Appl. Sci. 2018, 8, 135 9 of 21

hypothesis suggesting that melodic phrases in general have arch-like shapes [24,54]. In our study,however, the melodies have several different types of shapes (Figure 1), so this may not be the bestexplanation for the arch-like motion shapes. A better explanation may be that the subjects wouldstart and end their tracing from a resting position in which the hands would be lower than duringthe tracing, thus leading to the arch shapes seen in Figure 4.

Time (s) Time (s) Time (s) Time (s)

Ver

tical

Mov

emen

t V

ertic

al M

ovem

ent

Ver

tical

Mov

emen

t V

ertic

al M

ovem

ent

0 6 0 60 60 6

0

2000

0

2000

0

2000

0

2000

Figure 4. Average contours plotted from the vertical motion capture data in mm (z-axis) for each ofthe melodies (red for the original and blue for the re-synthesized versions of the melodies). The x-axisrepresents normalized time, and the y-axis represents aggregated tracing height for all participants.

5.2. Relationship between Vertical Movement and Melodic Pitch Distribution

To investigate more closely the relationship between vertical movement and the distribution ofthe pitches in the melodic fragments, we may start by considering the genre differences. Figure 5presents the distribution of pitches in each genre in the stimulus set. These are plotted on a logarithmicfrequency scale to represent the perceptual relationships between them. In the plot, each of thefour genres are represented by their individual distributions. The color distinction is on the basis ofwhether the melodic phrase has one direction or many. Phrases closer to being ascending, descending,or stationary are coded as not changing direction. We see that in all of these conditions, the vocalizephrases in the dataset have the highest pitch profiles and the Hindustani phrases have the lowest.

If we then turn to look at the vertical dimension of the tracing data, we would expect to seea similar distribution between genres as that for the pitch heights of the melodies. Figure 6 shows thedistribution of motion capture data for all tracings, sorted in the four genres. Here, the distribution ofmaximum z-values of the motion capture data shows a quite even distribution between genres.

145

Appl. Sci. 2018, 8, 135 10 of 21

Figure 5. Pitch distribution for each genre based on mean pitches in each phrase. If movement tracingswere an accurate representation of absolute pitch height, movement plots should resemble this plot.

Figure 6. Plots of the maximum height of either hand for each genre. The left/pink distributionsare from female subjects, while the right/blue distributions are from the male subjects. Each half ofeach section of the violin plot represents the probability distribution of the samples. The black linesrepresent each individual data point.

5.3. Direction Differences

The direction differences in the tracings can be studied by calculating the coefficients of variationof movement in all three axes for both the left (LH) and right (RH) hands. These coefficients are

146

Appl. Sci. 2018, 8, 135 11 of 21

found to be LHvar (x,y,z) = (63.7,45.7,26.6); RHvar (x,y,z) = (56.6,87.8,23.1), suggesting that the amountof dispersion on the z-axis (the vertical) is the most consistent. This suggests that a wide array ofrepresentations in the x and y-axes are used.

The average standard deviations for the different dimensions were found to be LHstd = (99 mm,89 mm, 185 mm); RHstd = (110 mm, 102 mm, 257 mm). This means that most variation is found in thevertical movement of the right hand, indicating an effect of right-handedness among the subjects.

5.4. Individual Subject Differences

Plots of the distributions of the quantitiy of motion (QoM) for each subject for all stimuli showa large degree of variation (Figure 7). Subjects 4 and 12, for example, have very small diversity inthe average QoM for all of their tracings, whereas Subjects 2 and 5 show a large spread. There arealso other participants, such as 22, who move very little on average for all their tracings. Out of thetwo types of representations (original and re-synthesized stimuli), we see that there is in generala larger diversity of movement for the regular melodies as opposed to the synthesized ones. However,the statistical difference between synthesized and original melodies was not significant.

Figure 7. Distribution of the average quantity of motion for each participant. Left/red distributionsare of the synthesized stimuli, while the right/green distributions are of the normal recordings.

5.5. Social Box

Another general finding from the data that is not directly related to the question at hand, but thatis still relevant for understanding the distribution of data, is what we call a shared ‘social box’ amongthe subjects. Figure 6 shows that the maximum tracing height of the male subjects were higherthan those of the female subjects. This is as expected, but a plot of the ‘tracing volume’ (the spatialdistribution in all dimensions) reveals that a comparably small volume was used to represent most ofthe melodies (Figure 8). Qualitative observation of the recordings reveal that shorter subjects weremore comfortable about stretching their hands out, while taller participants tended to use a morerestrictive space relative to their height. This happened without there being any physical constraints of

147

Appl. Sci. 2018, 8, 135 12 of 21

their movements, and no instructions that had pointed in the direction of the volume to be covered bythe tracings.

Figure 8. A three-dimensional plot of all sound-tracings for all participants reveal a fairly constrainedtracing volume, a kind of ‘social box’ defined by the subjects. Each color represents the tracings ofa single participant, and numbers along each axis are milimetres.

It is almost as if the participants wanted to fill up an invisible ‘social box,’ serving as the collectivecanvas on which everything can be represented. This may be explained by the fact that we sharea world together that has the same dimensions: doors, cars, ceilings, and walls are common to usall, making us understand our body as a part of the world in a particular way. In the data from thisexperiment, we explore this by analyzing the range of motion relative to the heights of the participantsthrough linear regression. The scaled movement range in the horizontal plane is represented inFigure 9. and shows that the scaled range reduces steadily over time as the height of the participantsincreases. Shorter participants occupy a larger area in the horizontal plane, while taller participantsoccupy a relatively smaller area. The R2 coefficient of regression is found to be 0.993, meaning that thiseffect is significant.

148

Appl. Sci. 2018, 8, 135 13 of 21

Figure 9. Regression plot of the heights of the participants against scaled (x, y) ranges. There is a clearlydecreasing trend for the scaled range of movements in the horizontal plane. The taller a participant,the lower is their scaled range.

5.6. An Imagined Canvas

In a two-dimensional tracing task, such as with pen on paper, the ‘canvas’ of the tracing is bothfinite and visible all the time. Such a canvas is experienced also for tasks performed with a graphicaltablet, even if the line of the tracing is not visible. In the current experiment, however, the canvas isthree-dimensional and invisible, and it has to be imagined by the participant. Participants who traceby just moving one hand at a time seem to be using the metaphor of a needle sketching on a movingpaper, much like an analogue ECG (Electro CardioGram) machine. Depending on the size of thetracing, the participants would have to rotate or translate their bodies to move within this imaginedcanvas. We observe different strategies when it comes to how they reach beyond the constraints oftheir kinesphere, the maximum volume you can reach without moving to a new location. They maystep sideways, representing a flat canvas placed before them, or may rotate, representing a cylindricalcanvas around their bodies.

6. Mapping Strategies

Through visual inspection of the recordings, we identify a limited number of strategies used inthe sound-tracings. We therefore propose six schemes of representation that encompass most of thevariation seen in the hands’ movement, as illustrated in Figure 10 and summarized as:

1. One outstretched hand, changing the height of the palm2. Two hands stretching or compressing an “object”3. Two hands symmetrically moving away from the center of the body in the horizontal plane4. Two hands moving together to represent holding and manipulating an object5. Two hands drawing arcs along an imaginary circle6. Two hands following each other in a percussive pattern

149

Appl. Sci. 2018, 8, 135 14 of 21

(1) Dominant hand (2) Changing inter-palm distance (3) Hands as mirror images

(4) Manipulating small objects (5) Drawing arcs along circles (6) Asymmetric percussive action

Figure 10. Motion history images exemplifying the six dominant sound-tracing strategies. The blacklines from the hands of the stick figures indicate the motion traces of each tracing.

These qualitatively derived strategies were the starting point for setting up an automatic extractionof features from the motion capture data. The pipeline for this quantitative analysis consists of thefollowing steps:

1. Feature selection: Segment the motion capture data into a six-column feature vector containingthe (x,y,z) coordinates of the right palm and the left palm, respectively.

2. Calculate quantity of motion (QoM): Calculate the average of the vector magnitude foreach sample.

3. Segmentation: Trim data using a sliding window of 1100 samples in size. This corresponds to5.5 s, to accommodate the average duration of 4.5 s of the melodic phrases. The hop size for thewindows is 10 samples, to obtain a large set of windowed segments. The segments that have themaximum mean values are then separated out to get a set of sound-tracings.

4. Feature analysis: Calculate features from Table 4 for each segment.5. Thresholding: Minimize the six numerical criteria by thresholding the segments based on

two-times the standard deviation for each of the computed features.6. Labeling and separation: Obtain tracings that can be classified as dominantly belonging to one of

the six strategy types.

150

Appl. Sci. 2018, 8, 135 15 of 21

Table 4. Quantitative motion capture features that match the qualitatively selected strategies. QoM,quantities of motion.

# Strategy Distinguishing Features Description Mean SD

1 Dominant hand as needle Right hand QoM much greater than left QoM QoM(LHY) »∨

«QoM(RHY) 0.50 0.062 Changing inter-palm distance Root mean squared difference of left, right hands in x RMS(LHX) − RMS(RHX) 0.64 0.123 Lateral symmetry between hands Nearly constant difference between left and right hands RHX − LHX = C 0.34 0.114 Manipulating a small object Right and left hands follow similar trajectories in x RH(x,y,z) = LH(x,y,z) + C 0.72 0.075 Drawing arcs along circles Fit of (x,y,z) for left and right hands to a sphere x2 + y2 + z2 0.17 0.046 Percussive asymmetry Dynamic time warp of (x,y,z) of Left, Right Hands dtw(RH(xyz), LH(xyz)) 0.56 0.07

After running the segmentation and labeling on the complete data set, we performed a t-test todetermine whether there is a significant difference between the labeled samples and the other samples.The results, summarized in Table 5 show that the selected features demonstrate the dominance ofone particular strategy for many tracings. All except Feature 4 (manipulating a small ‘object’) showsignificant results compared to all other tracing samples for automatic annotation of hand strategies.While this feature cannot be extracted from the aforementioned heuristic, the simple feature foreuclidean distance between two hands proves effective to be able to explain this strategy.

Table 5. Significance testing for each feature against the rest of the features.

Strategy # p-Value

Strategy 1 vs. rest 0.003Strategy 2 vs. rest 0.011Strategy 3 vs. rest 0.005Strategy 4 vs. rest 0.487Strategy 5 vs. rest 0.003Strategy 6 vs. rest 0.006

In Figure 11, we see that hand distance might be an effective way to compare different handstrategies. Strategy 2 performs the best on testing for separability. The hand distance for Strategy 4,for example is significantly lower than the rest. This is because this tracing style represents people whouse the metaphor of an imaginary object to represent music. This imaginary object seldom changes itsphysical properties—its length and breadth and general size is usually maintained.

Taking demographics into account, we see that the distribution of the female subjects’ tracings forvocalizes have a much wider peak than the rest of the genres. In the use of hand strategies, we observethat women use a wider range of hand strategies as compared to men (Figure 11). Furthermore,Strategy 5 (drawing arcs) is done entirely by women. The representation of music as objects is alsoseen to be more prominent in women, as is the use of asymmetrical percussive motion. Comparing thesame distribution of genders for genres, we do not find a difference in overall movement or generalbody use between the genders. If anything, the ‘social box effect’ makes the height differences ofgenres smaller than they are.

In Figure 12, we visualize the use of these hand strategies for every melody by all the participants.Strategy 2 is used in 206 tracings, whereas Strategy 5 is used for only 8 tracings. Strategies 1, 3, 4 and 5are used 182, 180, 161 and 57 times,respectively. Through this heat map in 12, we also can find someoutliers for the strategies that are more infrequently used. For example, we see that Melodies 4, 13 and16 show specially dominant use of some hand strategies.

151

Appl. Sci. 2018, 8, 135 16 of 21

Figure 11. Hand distance as a feature to discriminate between tracing strategies.

Figure 12. Heat map of representation of hand strategy per melody.

7. Discussion

In this study, we have analyzed people’s tracings to melodies from four musical genres. Althoughmuch of the literature points to correlations between melodic pitch and vertical movement, our findingsshow a much more complex picture. For example, relative pitch height appears to be much moreimportant than absolute pitch height. People seem to think about vocal melodies as actions, rather thaninterpreting the pitches purely in one dimension (vertical) over time. The analysis of contour featuresfrom the literature shows that while tracing melodies through an allocentric representation of the

152

Appl. Sci. 2018, 8, 135 17 of 21

listening body, the notions of pitch height representations matter much less than previously thought.Therefore contour features cannot be extracted merely by cross-modal comparisons of two data sets.We propose that other strategies can be used for contour representations, but this is something thatwill have to be developed more in future research.

According to the gestural affordances of musical sound theory [65], several gestural representationscan exist for the same sound, but there is a limit to how much they can be manipulated. Our data supportthis idea of a number of possible and overlapping action strategies. Several spatial and visual metaphorsare used by a wide range of people. One interesting finding is that there are gender differences betweenthe representations of the different sound-tracing strategies. Women seem to show a greater diversityof strategies in general, and they also use object-like representations more often than men.

We expected musical genre to have some impact on the results. For example, given that Westernvocalizes are sung with a pitch range higher than the rest of the genres in this dataset (Figure 5), it isinteresting to note that, on average, people do represent vocalize tracings spatially higher than therest of the genres. We also found that the melodies with the maximum amount of vibrato (melodies14 and 16 in Figure 5) are represented with the largest changes of acceleration in the motion capturedata. This implies that although the pitch deviation in this case is not so significant, the perceptionof a moving melody is much stronger by comparison to other melodies that have larger changes inpitch. It could be argued that both melody 4 and 16 contain motivic repetition that cause this pattern.However, repeating motifs are as much parts of melodies 6 and 8 (joik). The values represented by thesemelodies are applicable to their tracings as original as well as synthesized phrases. The effect of thevowels used in these melodies can also thus be negated. As seen in Figure 12, there are some melodiesthat stand out for some hand strategies. Melody 4 (Hindustani) is curiously highly represented asa small object. Melody 12 is overwhelmingly represented by symmetrical movements of both hands,while Melodies 8 and 9 are overwhelmingly represented by using 1 hand as the tracing needle.

We find it particularly interesting that subjects picked up on the idea of using small objects asa representation technique in their tracings. The use of small objects to represent melodies is welldocumented in Hindustani music [52,66–68]. However, the subjects’ familiarity score with Hindustanimusic was quite low, so familiarity can not explain this interesting choice of representation in our study.Looking at the melodic features of melody 4, for example, it is steadily descending in intervals untilit ascends again and comes down the same intervals. This may be argued to resemble an object thatsmoothly slips on a slope, and could be a probable reason for the overwhelming object representationof this particular melody. In future studies, it would be interesting to see whether we can recreate thismapping in other melodies, or model melodies in terms of naturally occurring melodic shapes bornout of physical forces interacting with each other.

It is worth noting that there are several limitations with the current experimental methodologyand analysis. Any laboratory study of this kind would present subjects with an unfamiliar andnon-ecological environment. The results would also to a large extent be influenced by the habitus ofbody use in general. Experience in dance, sign language, or musical traditions with simultaneousmusical gestures (such as conducting), all play a part in the interpretation of music as motion. Despitethese limitations, we do see a considerable amount of consistency between subjects.

8. Conclusions

The present study shows that there are consistencies in people’s sound-tracing to the melodicphrases used in the experiment. Our main findings can be summarized as:

• There is a clear arch shape when looking at the averages of the motion capture data, regardless ofthe general shape of the melody itself. This may support the idea of a motor constraint hypothesisthat has been used to explain the similar arch-like shape of sung melodies.

• The subjects chose between different strategies in their sound-tracings. We have qualitativelyidentified six such strategies and have created a set of heuristics to quantify and testtheir reliability.

153

Appl. Sci. 2018, 8, 135 18 of 21

• There is a clear gender difference for some of the strategies. This was most evident for Strategy 4(representing small objects), which women performed more than men.

• The ‘obscure’ strategy of representing melodies in terms of a small object, as is typical in Hindustanimusic, was also found in participants who had no or little exposure to this musical genre.

• The data show a tendency of moving within a shared ‘social box’. This may be thought of asan invisible space that people constrain their movements to, even without any exposure to theother participants’ tracings. In future studies, it would be interesting to explore how constant sucha space is, for example by comparing multiple recordings of the same participants over a periodof time.

In future studies, we want to investigate all of these findings in greater detail. We are particularlyinterested in taking the rest of the body’s motion into account. It would also be relevant to usethe results from such studies in the creation of interactive systems, ‘reversing’ the process, that is,using tracing in the air as a method to retrieve melodies from a database. This could open up someexciting end-user applications and also be used as a tool for music performance.

Supplementary Materials: The following are available online at Available online: http://www.mdpi.com/2076-3417/8/1/135/s1, supplementary archive consists of data files of the following nature: segmented motion tracingsof 26 participants annotated with participant number, melody traced, and hand strategy used. The melodies arefrom 1 to 16, and the pitch data and sound stimuli are separately provided as well. More information about thesame and code can be provided upon request.

Acknowledgments: This work was partially supported by the Research Council of Norway through its Centresof Excellence scheme, Project Number 262762.

Author Contributions: T.K. and A.R.J. both conceived and designed the experiments; T.K. performed theexperiments and analyzed the data; both authors contributed analysis tools, and wrote the paper.

Conflicts of Interest: The authors declare no conflict of interest.

References

1. Kennedy, M. Rutherford-Johnson, T.; Kennedy, J. The Oxford Dictionary of Music, 6th ed.; Oxford UniversityPress: Oxford, UK, 2013. Available online: http://www.oxfordreference.com/view/10.1093/acref/9780199578108.001.0001/acref-9780199578108 (accessed on 16 January 2018).

2. Godøy, R.I.; Haga, E.; Jensenius, A.R. Playing “Air Instruments”: Mimicry of Sound-Producing Gesturesby Novices and Experts; International Gesture Workshop; Springer: Berlin/Heidelberg, Germany, 2005;pp. 256–267.

3. Kendon, A. Gesture: Visible Action as Utterance; Cambridge University Press: Cambridge, UK, 2004.4. Clayton, M.; Sager, R.; Will, U. In time with the music: The concept of entrainment and its significance for

ethnomusicology. In European Meetings in Ethnomusicology; ESEM Counterpoint 1; Romanian Society forEthnomusicology: Bucharest, Romania, 2005; Volume 11, pp. 1–82.

5. Zbikowski, L.M. Musical gesture and musical grammar: A cognitive approach. In New Perspectives on Musicand Gesture; Ashgate Publishing Ltd.: Farnham, UK, 2011; pp. 83–98.

6. Lakoff, G.; Johnson, M. Conceptual metaphor in everyday language. J. Philos. 1980, 77, 453–486.7. Zbikowski, L.M. Metaphor and music theory: Reflections from cognitive science. Music Theory Online 1998,

4, 1–8.8. Shayan, S.; Ozturk, O.; Sicoli, M.A. The thickness of pitch: Crossmodal metaphors in Farsi, Turkish,

and Zapotec. Sens. Soc. 2011, 6, 96–105.9. Eitan, Z.; Schupak, A.; Gotler, A.; Marks, L.E. Lower pitch is larger, yet falling pitches shrink. Exp. Psychol.

2014, 61, 273–284.10. Eitan, Z.; Timmers, R. Beethoven’s last piano sonata and those who follow crocodiles: Cross-domain

mappings of auditory pitch in a musical context. Cognition 2010, 114, 405–422.11. Rusconi, E.; Kwan, B.; Giordano, B.L.; Umilta, C.; Butterworth, B. Spatial representation of pitch height:

The SMARC effect. Cognition 2006, 99, 113–129.12. Huron, D. The melodic arch in Western folksongs. Comput. Musicol. 1996, 10, 3–23.

154

Appl. Sci. 2018, 8, 135 19 of 21

13. Fedorenko, E.; Patel, A.; Casasanto, D.; Winawer, J.; Gibson, E. Structural integration in language and music:Evidence for a shared system. Mem. Cognit. 2009, 37, 1–9.

14. Patel, A.D. Music, Language, and the Brain; Oxford University Press: New York, NY, USA, 2010.15. Adami, A.G.; Mihaescu, R.; Reynolds, D.A.; Godfrey, J.J. Modeling prosodic dynamics for speaker recognition.

In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP’03), Hong Kong, China, 6–10 April 2003; Volume 4.

16. Dowling, W.J. Scale and contour: Two components of a theory of memory for melodies. Psychol. Rev. 1978,85, 341–354.

17. Dowling, W.J. Recognition of melodic transformations: Inversion, retrograde, and retrograde inversion.Percept. Psychophys. 1972, 12, 417–421.

18. Schindler, A.; Herdener, M.; Bartels, A. Coding of melodic gestalt in human auditory cortex. Cereb. Cortex2012, 23, 2987–2993.

19. Trehub, S.E.; Becker, J.; Morley, I. Cross-cultural perspectives on music and musicality. Philos. Trans. R. Soc.Lond. B Biol. Sci. 2015, 370, 20140096.

20. Trehub, S.E.; Bull, D.; Thorpe, L.A. Infants’ perception of melodies: The role of melodic contour. Child Dev.1984, 55, 821–830.

21. Trehub, S.E.; Thorpe, L.A.; Morrongiello, B.A. Infants’ perception of melodies: Changes in a single tone.Infant Behav. Dev. 1985, 8, 213–223.

22. Morrongiello, B.A.; Trehub, S.E.; Thorpe, L.A.; Capodilupo, S. Children’s perception of melodies: The role ofcontour, frequency, and rate of presentation. J. Exp. Child Psychol. 1985, 40, 279–292.

23. Adams, C.R. Melodic contour typology. Ethnomusicology 1976, 20, 179–215.24. Tierney, A.T.; Russo, F.A.; Patel, A.D. The motor origins of human and avian song structure. Proc. Natl. Acad.

Sci. USA 2011, 108, 15510–15515.25. Díaz, M.A.R.; García, C.A.R.; Robles, L.C.A.; Altamirano, J.E.X.; Mendoza, A.V. Automatic infant cry analysis

for the identification of qualitative features to help opportune diagnosis. Biomed. Signal Process. Control 2012,7, 43–49.

26. Parsons, D. The Directory of Tunes and Musical Themes; S. Brown: Cambridge, UK, 1975.27. Quinn, I. The combinatorial model of pitch contour. Music Percept. Interdiscip. J. 1999, 16, 439–456.28. Narmour, E. The Analysis and Cognition of Melodic Complexity: The Implication-Realization Model; University of

Chicago Press: Chicago, IL, USA, 1992.29. Marvin, E.W. A Generalized Theory of Musical Contour: Its Application to Melodic and Rhythmic Analysis

of Non-Tonal Music and Its Perceptual and Pedagogical Implications. Ph.D. Thesis, University of Rochester,Rocheste, NY, USA, 1988.

30. Schmuckler, M.A. Testing models of melodic contour similarity. Music Percept. Interdiscip. J. 1999, 16, 295–326.31. Schmuckler, M.A. Melodic contour similarity using folk melodies. Music Percept. Interdiscip. J. 2010,

28, 169–194.32. Schmuckler, M.A. Expectation in music: Investigation of melodic and harmonic processes. Music Percept.

Interdiscip. J. 1989, 7, 109–149.33. Eerola, T.; Himberg, T.; Toiviainen, P.; Louhivuori, J. Perceived complexity of western and African folk

melodies by western and African listeners. Psychol. Music 2006, 34, 337–371.34. Eerola, T.; Bregman, M. Melodic and contextual similarity of folk song phrases. Musicae Sci. 2007, 11, 211–233.35. Eerola, T. Are the emotions expressed in music genre-specific? An audio-based evaluation of datasets

spanning classical, film, pop and mixed genres. J. New Music Res. 2011, 40, 349–366.36. Salamon, J.; Peeters, G.; Röbel, A. Statistical Characterisation of Melodic Pitch Contours and its Application

for Melody Extraction. In Proceedings of the ISMIR, Porto, Portugal, 8–12 October 2012; pp. 187–192.37. Bittner, R.M.; Salamon, J.; Bosch, J.J.; Bello, J.P. Pitch Contours as a Mid-Level Representation for Music

Informatics. In Proceedings of the Audio Engineering Society Conference: 2017 AES International Conferenceon Semantic Audio, Erlangen, Germany, 22–24 June 2017.

38. Rao, P.; Ross, J.C.; Ganguli, K.K.; Pandit, V.; Ishwar, V.; Bellur, A.; Murthy, H.A. Classification of melodicmotifs in raga music with time-series matching. J. New Music Res. 2014, 43, 115–131.

39. Gómez, E.; Bonada, J. Towards computer-assisted flamenco transcription: An experimental comparison ofautomatic transcription algorithms as applied to a cappella singing. Comput. Music J. 2013, 37, 73–90.

155

Appl. Sci. 2018, 8, 135 20 of 21

40. Walker, P.; Bremner, J.G.; Mason, U.; Spring, J.; Mattock, K.; Slater, A.; Johnson, S.P. Preverbal infants aresensitive to cross-sensory correspondences much ado about the null results of Lewkowicz and Minar (2014).Psychol. Sci. 2014, 25, 835–836.

41. Timmers, R.; Li, S. Representation of pitch in horizontal space and its dependence on musical andinstrumental experience. Psychomusicol. Music Mind Brain 2016, 26, 139–148.

42. Leman, M. Embodied Music Cognition and Mediation Technology; MIT Press: Cambridge, MA, USA, 2008.43. Godøy, R.I. Motor-mimetic music cognition. Leonardo 2003, 36, 317–319.44. Jensenius, A.R. Action-Sound: Developing Methods and Tools to Study Music-Related Body Movement.

Ph.D. Thesis, University of Oslo, Oslo, Norway, 2007.45. Jensenius, A.R.; Kvifte, T.; Godøy, R.I. Towards a gesture description interchange format. In Proceedings of

the 2006 Conference on New Interfaces for Musical Expression, IRCAM—Centre Pompidou, Paris, France,4–8 June 2006; pp. 176–179.

46. Gritten, A.; King, E. Music and Gesture; Ashgate Publishing, Ltd.: Hampshire, UK, 2006.47. Gritten, A.; King, E. New Perspectives on Music and Gesture; Ashgate Publishing, Ltd.: Surrey, UK, 2011.48. Molnar-Szakacs, I.; Overy, K. Music and mirror neurons: From motion to ’e’motion. Soc. Cognit. Affect. Neurosci.

2006, 1, 235–241.49. Sievers, B.; Polansky, L.; Casey, M.; Wheatley, T. Music and movement share a dynamic structure that

supports universal expressions of emotion. Proc. Natl. Acad. Sci. USA 2013, 110, 70–75.50. Buteau, C.; Mazzola, G. From contour similarity to motivic topologies. Musicae Sci. 2000, 4, 125–149.51. Clayton, M.; Leante, L. Embodiment in Music Performance. In Experience and Meaning in Music Performance;

Oxford University Press: New York, NY, USA, 2013; pp. 188–207.52. Rahaim, M. Musicking Bodies: Gesture and Voice in Hindustani Music; Wesleyan University Press: Middletown,

CT, USA, 2012.53. Huron, D.; Shanahan, D. Eyebrow movements and vocal pitch height: Evidence consistent with an

ethological signal. J. Acoust. Soc. Am. 2013, 133, 2947–2952.54. Savage, P.E.; Tierney, A.T.; Patel, A.D. Global music recordings support the motor constraint hypothesis for

human and avian song contour. Music Percept. Interdiscip. J. 2017, 34, 327–334.55. Godøy, R.I.; Haga, E.; Jensenius, A.R. Exploring Music-Related Gestures by Sound-Tracing: A Preliminary

Study. 2006. Available online: https://www.duo.uio.no/bitstream/handle/10852/26899/Godxy_2006b.pdf?sequence=1&isAllowed=y (accessed on 16 January 2018).

56. Glette, K.H.; Jensenius, A.R.; Godøy, R.I. Extracting Action-Sound Features From a Sound-TracingStudy. 2010. Available online: https://www.duo.uio.no/bitstream/handle/10852/8848/Glette_2010.pdf?sequence=1&isAllowed=y (accessed on 16 January 2018).

57. Küssner, M.B. Music and shape. Lit. Linguist. Comput. 2013, 28, 472–479.58. Roy, U.; Kelkar, T.; Indurkhya, B. TrAP: An Interactive System to Generate Valid Raga Phrases from

Sound-Tracings. In Proceedings of the NIME, London, UK, 30 June–3 July 2014; pp. 243–246.59. Kelkar, T. Applications of Gesture and Spatial Cognition in Hindustani Vocal Music. Ph.D. Thesis, International

Institute of Information Technology, Hyderabad, India, 2015.60. Nymoen, K.; Godøy, R.I.; Jensenius, A.R.; Torresen, J. Analyzing Correspondence Between Sound Objects

and Body Motion. ACM Trans. Appl. Percept. 2013, 10, 9.61. Nymoen, K.; Caramiaux, B.; Kozak, M.; Torresen, J. Analyzing Sound-Tracings: A Multimodal Approach to

Music Information Retrieval. In Proceedings of the 1st International ACM Workshop on Music InformationRetrieval with User-centered and Multimodal Strategies (MIRUM ’11), Scottsdale, AZ, USA, 30 November2011; ACM: New York, NY, USA, 2011; pp. 39–44.

62. Ollen, J.E. A Criterion-Related Validity Test of Selected Indicators of Musical Sophistication Using ExpertRatings. Ph.D. Thesis, The Ohio State University, Columbus, OH, USA, 2006.

63. Nymoen, K.; Torresen, J.; Godøy, R.; Jensenius, A. A statistical approach to analyzing sound-tracings.In Speech, Sound and Music Processing: Embracing Research in India; Springer: Berlin/Heidelberg, Germany,2012; pp. 120–145.

64. Schmuckler, M.A. Components of melodic processing. In Oxford Handbook of Music Psychology; OxfordUniersity Press: Oxford, UK, 2009; p. 93.

65. Godøy, R.I. Gestural affordances of musical sound. In Musical Gestures: Sound, Movement, and Meaning;Routledge: New York, NY, USA, 2010; pp. 103–125.

156

Appl. Sci. 2018, 8, 135 21 of 21

66. Paschalidou, S.; Clayton, M.; Eerola, T. Effort in Interactions with Imaginary Objects in Hindustani VocalMusic—Towards Enhancing Gesture-Controlled Virtual Instruments. In Proceedings of the 2017 InternationalSymposium on Musical Acoust, Montreal, QC, Canada, 18–22 June 2017.

67. Paschalidou, S.; Clayton, M. Towards a sound-gesture analysis in Hindustani Dhrupad vocal music: Effort andraga space. In Proceedings of the ICMEM, University of Sheffield, Sheffield, UK, 23–25 March 2015; Volume 23,p. 25.

68. Pearson, L.; others. Gesture in Karnatak Music: Pedagogy and Musical Structure in South India. Ph.D. Thesis,Durham University, Durham, UK, 2016.

c© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open accessarticle distributed under the terms and conditions of the Creative Commons Attribution(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

157

Paper IV

Evaluating a collection ofSound-Tracing Data of MelodicPhrases

Tejaswinee Kelkar, Udit Roy, Alexander Refsum JenseniusPublished in In Proceedings of the 19th International Society for MusicInformation Retrieval Conference, Paris, France., September 2018, pp. 74-81.

IV

159

EVALUATING A COLLECTION OF SOUND-TRACING DATA OFMELODIC PHRASES

Tejaswinee KelkarRITMO, Dept. of Musicology

University of [email protected]

Udit RoyIndependent Researcher

[email protected]

Alexander Refsum JenseniusRITMO, Dept. of Musicology

University of [email protected]

ABSTRACT

Melodic contour, the ‘shape’ of a melody, is a commonway to visualize and remember a musical piece. The pur-pose of this paper is to explore the building blocks of a fu-ture ‘gesture-based’ melody retrieval system. We presenta dataset containing 16 melodic phrases from four musi-cal styles and with a large range of contour variability.This is accompanied by full-body motion capture data of26 participants performing sound-tracing to the melodies.The dataset is analyzed using canonical correlation analy-sis (CCA), and its neural network variant (Deep CCA), tounderstand how melodic contours and sound tracings re-late to each other. The analyses reveal non-linear relation-ships between sound and motion. The link between pitchand verticality does not appear strong enough for complexmelodies. We also find that descending melodic contourshave the least correlation with tracings.

1. INTRODUCTION

Can hand movement be used to retrieve melodies? In thispaper we use data from a ‘sound-tracing’ experiment (Fig-ure 1) containing motion capture data to describe music–motion cross-relationships, with the aim of developing aretrieval system. Details about the experiment and howmotion metaphors come to play a role in the representa-tions are presented in [19]. While our earlier analysis wasfocused on the use of the body and imagining metaphorsfor tracings [17, 18], in this paper, we will focus on mu-sical characteristics and study music–motion correlations.The tracings present a unique opportunity for cross-modalretrieval, because a direct correspondence between tracingand melodic contour presents an inherent ‘ground-truth.’

Recent research in neuroscience and psychology hasshown that action plays an important role in perception. Inphonology and linguistics, the co-articulation of action andsound is also well understood. Theories from embodiedmusic cognition [22] have been critical to this explorationof multimodal correspondences.

c© Tejaswinee Kelkar, Udit Roy, Alexander Refsum Jense-nius. Licensed under a Creative Commons Attribution 4.0 InternationalLicense (CC BY 4.0). Attribution: Tejaswinee Kelkar, Udit Roy,Alexander Refsum Jensenius. “Evaluating a collection of Sound-TracingData of Melodic Phrases”, 19th International Society for Music Informa-tion Retrieval Conference, Paris, France, 2018.

Figure 1. An example of post-processed motion capturedata from a sound-tracing study of melodic phrases.

Contour perception is a coarse-level musical abilitythat we acquire early during childhood [30, 33, 34]. Re-search suggests that our memory for contour is enhancedwhen melodies are tonal, and when tonal accent points ofmelodies co-occur with strong beats [16], making melodicmemory a salient feature in musical perception. More gen-erally, it is easier for people to remember the general shapeof melody rather than precise intervals [14], especially ifthey are not musical experts. Coarse representations ofmelodic contour, such as with drawing or moving handsin the air may be intuitive to capturing musical momentsof short time scales [9, 25].

1.1 Research Questions

The inspiration for our work mainly comes from severalprojects on melodic content retrieval using intuitive andmulti-modal representations of musical data. The oldestexample of this is the 1975 project titled ‘Directory ofTunes and Musical Themes,’ where the author uses a sim-plified contour notation method, involving letters for de-

74161

noting contour directions, to create a dictionary of musi-cal themes where one may look up a tune they remem-ber [29]. This model is adopted for melodic contour re-trieval in Musipedia.com [15]. Another system is proposedin the recent project SoundTracer, in which a user’s mo-tion of their mobile phone is used to retrieve tunes from amusic archive [21]. A critical difference between these ap-proaches is how they handle mappings between contour in-formation and musical information, especially differencesbetween time-scales and time-representations. Most ofthese methods do not have ground-truth models of con-tours, and instead use one of several ways of mappings,each with its own assumptions.

Godøy et al. has argued for using motion-based, graphi-cal, verbal, and other representations of motion data in mu-sic retrieval systems [10]. Liem et al. make a case for usingmultimodal user-centered strategies as a way to navigatethe discrepancy between audio similarity and music simi-larity [23], with the former referring to more mathematicalfeatures, and the latter to more perceptual features. Weproceed with this as the point of departure for describingour dataset and its characteristics, to approach the goal ofmaking a system for classifying sound-tracings of melodicphrases with the following specific questions:

1. Are the mappings between melodic contour and mo-tion linearly related?

2. Can we confirm previous findings regarding correla-tion between pitch and the vertical dimension?

3. What categories of melodic contour are most corre-lated for sound-tracing queries?

2. RELATED WORK

Understanding the close relationship between music andmotion is vital to understanding subjective experiences ofperformers and listeners, [7, 11, 12]. Many empirical ex-periments aimed at investigating music–motion correspon-dences deal with stimulus data that is made to explicitlyobserve certain mappings, for example pitched and non-pitched sound, vertical dimension and pitch, or player ex-pertise [5, 20, 27]. This means that the music examplesthemselves are sorted into types of sound (or types of mo-tion). We are more interested in observing how a varietyof these mapping relationships change in the content ofmelodic phrases. For this we use multiple labeling strate-gies as explained in section 3.4. Another contribution ofthis work is the use of musical styles from various parts ofthe world, including those that contain microtonal inflec-tions.

2.1 Multi-modal retrieval

Multi-modal retrieval is the paradigm of information re-trieval used to handle different types of data together. Theobjective is to learn a set of mapping functions that projectthe different modalities into a common metric space, tobe able to retrieve relevant information in one modality

through a query in another. We see that this paradigm isused often in the retrieval of image from text and text fromimage. Canonical Correlation Analysis (CCA) is a com-mon tool for investigating linear relationships of two setsof variables. In the review paper by Wang et al. for crossmodal retrieval [35], several implementations and modelsare analyzed. CCA is also previously used to show musicand brain imaging cross relationships [3].

A previous paper analyzing tracings to pitched andnon pitched sounds also used CCA to understand music–motion relationships [25], where the authors describe in-herent non-linearity in the mappings, despite finding in-trinsic sound-action relationships. This work was extendedin [26], in which CCA was used to interpret how differentfeatures correlate with each other. Pitch and vertical mo-tion have linear relationships in this analysis, although itis important to note that the sound samples used for thisstudy were short and synthetic.

The biggest reservations in analyzing music–motiondata through CCA is that non-linearity cannot be repre-sented, and the dependence of the method on time syn-chronization is high. The temporal evolution of motionand sound remains linear over time [6]. To get aroundthis, kernel-based methods can be used to introduce non-linearity. Ohkushi et al., present a paper that uses Kernel-based CCA methods to analyze motion and music featurestogether using video sequences from classical ballet, andoptical flow based clustering. Bozkurt et al. present a CCAbased system to analyze and generate speech and arm mo-tion for prosody-driven synthesis of the ’beat-gesture’ [4],which is used for emphasizing prosodically salient pointsin speech. We explore our dataset through CCA due tothe previous successes of using this family of methods.We will analyze the same data using Deep CCA, a neural-network approximation of CCA, to understand better thenon-linear mappings.

2.2 Canonical Correlation Analysis

CCA is a statistical method to find a linear combina-tion of two variables X = (x1, x2, ..., xn) and Y =(y1, y2, ..., ym) with n andm independent variables as vec-tors a and b such that their correlation ρ = corr(aX, bY )of the transformed variables is maximized. Linearvectors a′ and b′ can be found such that a′, b′ =argmax

a,bcorr(aTX, bTY ). We can then find the second

set of coefficients which maximize the correlation of thevariables X ′ = aX and Y ′ = bY with the additional con-straint to keep (X,X ′) and (Y, Y ′) uncorrelated. This pro-cess can be repeated till d = min(m,n) dimensions.

The CCA can be extended to include non-linearity byusing a neural network to transform the X and Y variablesas in the case of Deep CCA [2]. Given the network param-eters θ1 and θ2, the objective is to maximize the correla-tion corr(f(X, θ1), f(Y, θ2)). The network is trained byfollowing the gradient of the correlation objective as esti-mated from the training data.

Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 75

162

3. EXPERIMENT DESCRIPTION

3.1 Procedure

The participants were instructed to move their hands as iftheir movement was creating the melody. The use of theterm ‘creating,’ instead of ‘representing,’ is purposeful, asshown in earlier studies [26,27], to be able to access sound-production as the tracing intent. The experiment durationwas about 10 minutes. All melodies were played at a com-fortable listening level through a Genelec 8020 speaker,placed 3m in front of the subjects. Each session consistedof an introduction, two example sequences, 32 trials anda conclusion. Each melody was played twice with a 2spause in between. During the first presentation, the partic-ipants were asked to listen to the stimuli, while during thesecond presentation, they were asked to trace the melody.All the instructions and required guidelines were recordedand played back through the speaker. Their motions aretracked using 8 infra-red cameras from Qualisys (7 Oqus300 and 1 Oqus 410). We then post-process the data inQualisys Track Manager (QTM) first by identifying andlabeling each marker for each participant. Thereafter, wecreate a dataset containing Left and Right hand coordinatesfor all participants.

Six participants in the study had to be excluded due totoo many marker dropouts, giving us a final dataset con-taining 26 participants tracing 32 melodies: 794 tracingsfor 16 melodic categories.

3.2 Subjects

The 32 subjects (17 females, 15 males) had a mean ageof 31 years (SD = 9 years). They were mainly univer-sity students and employees, both with and without musi-cal training. Their musical experience was quantized usingthe OMSI (Ollen Musical Sophistication Index) question-naire [28], and they were also asked about the familiaritywith the musical genres, and their experience with dancing.The mean of the OMSI score was 694 (SD = 292), indicat-ing that the general musical proficiency in this dataset wason the higher side. The average familiarity with Westernclassical music was 4.03 out of a possible 5 points, 3.25 forjazz music, 1.87 with Sami joik, and 1.71 with Hindustanimusic. None of the participants reported having heard anyof the melodies played to them. All participants providedtheir written consent for inclusion before they participatedin the study, and they were free to withdraw during the ex-periment. The study design was approved by the Nationalethics board (NSD).

3.3 Stimuli

In this study, we decided to use melodic phrases from vocalgenres that have a tradition of singing without words. Vo-cal phrases without words were chosen so as to not intro-duce lexical meaning as a confounding variable. Leavingout instruments also avoids the problem of subjects havingto choose between different musical layers in their sound-tracing. The final stimulus set consists of four different

Figure 2. Pitch plots of all the 16 melodic phrases used asexperiment stimuli, from each genre. The x axis representstime in seconds, and the y axis represents notes. The ex-tracted pitches were re-synthesized to create a total of 32melodic phrases used in the experiment.

musical genres and four stimuli for each genre. The mu-sical genres selected are: (1) Hindustani music, (2) Samijoik, (3) jazz scat singing, (4) Western classical vocalise.The melodic fragments are phrases taken from real record-ings, to retain melodies within their original musical con-text. As can be seen in the pitch plots in Figure 2, themelodies are of varying durations with an average of 4.5 s(SD = 1.5 s). The Hindustani and joik phrases are sung bymale vocalists, whereas the scat and vocalise phrases aresung by female vocalists. This is represented in the pitchrange of each phrase as seen in Figure 2.

Seeger

Schaeffer

Varna

Hood

xx xy xyy xyx

Impulsive Iterative Sustained

Ascending Descending Stationary Varying

Arch Bow Tooth Diagonal

AdamsRepetition Recurrence

Figure 3. Contour Typologies discussed previously inmelodic contour analysis. This figure is representative,made by the authors.

Melodic contours are overwhelmingly written about interms of pitch, and so we decided to create a ‘clean’ pitch–only representation of each melody. This was done byrunning the sound files through an autocorrelation algo-rithm to create phrases that accurately resemble the pitchcontent, but without the vocal, timbral and vowel contentof the melodic stimulus. These 16 re-synthesized soundswere added to the stimulus set, thus obtaining a total of 32sound stimuli.

76 Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018

163

ID Description

1 All 16 Melodies2 IJSV 4 Genres3 ADSC Ascending, Descending,

Steady or Combined4 OrigVSyn Original vs Synthesized5 VibNonVib Vibrato vs No Vibrato6 MotifNonMotif Motif Repetition Present vs

Not

Table 1. Multiple labellings for melodic categories: werepresent the 16 melodies using 5 different label sets. Thishelps us analyze which features are best related to whichcontour classes, genres, or melodic properties.

3.4 Contour Typology Descriptions

We base the selection of melodic excerpts on the descrip-tions of melodic contour classes as seen in Figure 3. Thereference typologies are based on the work of Seeger [32],Hood [13], Schaeffer [8], Adams [1], and the Hindustaniclassical Varna system. Through these typologies, we hopeto cover commonly understood contour shapes and makesure that the dataset contains as many of them as possible.

3.4.1 Multiple labeling

To represent the different contour types and categories thatthese melodies represent, we create multiple labels that ex-plain the differences. This enables us to understand howthe sound tracings actually map to the different possiblecategories, and makes it easier to see patterns from thedata. We describe these labels as seen in Table 3.4.1. Mul-tiple labels allow us to see what categories does the datadescribe, and which features or combination of featurescan help retrieve which labels. Some of these labels arecategories, while some are one-versus-rest. Category la-bels include individual melodies, genres, and contour cat-egories, while one-versus-rest correlations are computedfor finding whether vibrato, motivic repetitions exist in themelody, and whether the melodic sample is re-synthesizedor original.

4. DATASET CREATION

4.1 Preprocessing of Motion Data

We segment each phrase that is traced by the participants,label participant and melody numbers, and extract the datafor left and right hand markers for this analysis, since theinstructions asked people to trace using their hands. Toanalyze this data, we are more interested in contour fea-tures and shape information than time-scales. We thereforetime-normalize our datasets so that every melodic sampleand every motion tracing is the same length. This makes iteasier to find correlations between music and motion datausing different features.

Figure 4. Feature distribution of melodies for each genre.We make sure that a wide range of variability in the fea-tures, as described in Table 2 is present in the dataset.

Feature Calculated by

1 Pitch Autocorrelation function usingPRAAT

2 Loudness RMS value of the sound usingLibrosa

3 Brightness Spectral Centroid using Librosa4 Number

of NotesNumber of notes per melody

Table 2. Melody features extracted for analysis, and de-tails of how they are extracted.

5. ANALYSIS

5.1 Music

Since we are mainly interested in melodic correlations, themost important feature describing melodies is to extractpitch. For this, we use autocorrelation algorithm avail-able in the PRAAT phonetic program. We use Librosav0.5.1 [24] to compute the RMS energy (loudness), andthe brightness using Spectral Centroid. We transcribe themelodies to get the number of notes per melody. The dis-tribution of these features can be seen for each genre inthe stimulus set in Figure 4. We have tried to be true tothe musical styles used in this study, most of which do nothave written notation as an inherent part of their pedagogy.

5.2 Motion

For tracings, we calculate 9 features that describe vari-ous characteristics of motion. We record only X and Zaxes, as maximum motion is found along these directions.The derivatives of motion (velocity, acceleration, jerk) andquantity of motion (QoM) which is a cumulative velocityquantity are calculated. Distance between hands, cumula-tive distance, and symmetry features are calculated as indi-cators of contour-supporting features, as found in previousstudies.


164

Feature Description

1 X-coordinate (X) Axis corresponding to thedirection straight ahead ofthe participant

2 Z-coordinate (Z) Axis corresponding to theupwards direction

3 Velocity (V) First derivative of verticalposition

4 Acceleration (A) Second derivative of verticalposition

5 Quantity of Mo-tion

Sum of absolute velocitiesfor all markers

6 Distance betweenHands

Sample-wise Euclidean dis-tance between hand markers

7 Jerk Third derivative of verticalposition

8 Cumulative Dis-tance Traveled

Euclidean distance traveledper sample per hand

9 Symmetry Difference between the leftand right hand in terms ofvertical position and hori-zontal velocity

Table 3. Motion features used for analysis. 1-5 are for thedominant hand, while 6-9 are features for both hands.

5.3 Joint Analysis

In this section we present our analysis on our dataset withthese two feature sets. We analyze the tracings for eachmelody as well as utilize the multiple label sets to discoverinteresting patterns in our dataset which are relevant for aretrieval application.

5.3.1 Dynamic Time Warping

Dynamic Time Warping (DTW) is a method to align se-quences of different lengths using substitution, additionand subtraction costs. It is a non-metric method giving usthe distance between two sequences after alignment.

In recent research, vertical motion has been shown tocorrelate with pitch in the past for simple sounds. Someform of non-alignment is also observed between the mo-tion and pitch signals. We perform the same analysis onour data: compute the correlation between pitch and mo-tion in the Z axis before and after alignment with DTWfor the 16 melodies and plot their mean and variance inFigure 5.

5.3.2 Longest Run-lengths

While observing the dataset, we find that longest ascend-ing and descending sequences in the melodies are mostoften reliably represented in the motions, although vari-ances in stationary notes, and ornaments is likely to bemuch higher. To exploit this feature in tracings, we use“Longest Run-lengths” as a measure. We find multiplesubsequences following a pattern which can possess dis-criminative qualities. For our analysis, we use the ascend-ing and descending patterns, thus finding the subsequences

Figure 5. Correlations of pitch with raw data (red) vs afterDTW-alignment (blue). Although a DTW alignment im-proves the correlation, we observe that correlation is stilllow suggesting that vertical motion and pitch height arenot that strongly associated.

from the feature sequence which are purely ascending ordescending. We then rank the subsequences and build afeature vector from the lengths of the top N results. Thisstep is particularly advantageous when comparing featuresfrom motion and music sequences as it captures the overallpresence of the pattern in the sequence remaining invariantto the mis-alignment or lag between the sequences fromdifferent modalities. As an example, if we select the Z-axis motion of the dominant hand and the melody pitch asour sequences and retrieve top 3 ascending subsequencelengths. To make the features robust, we do a low passfiltering of the sequence as a preprocessing step.

We analyze our dataset by computing the features forfew combinations of motion and music features for ascend-ing and descending patterns. Thereafter, we perform CCAand show the resulting correlation of first transformed di-mension in Table 4. We utilize the various label categoriesgenerated for the melodies, and show the impact of the fea-tures on the labels from each category in Tables 4 and 5.We select the top four run lengths as our feature for eachmusic–motion feature sequence. For Deep CCA analysis,we use a two layered network (same for both motion andmusic features) with 10 and 4 neurons. A final round oflinear CCA is also performed on the network output.

6. RESULTS AND DISCUSSION

Figure 5 shows correlations with raw data and after DTWalignment between the vertical motion and pitch for eachmelody. Overall, the correlation improves after DTWalignment, suggesting phase lags and phase differences be-tween the timing of melodic peaks and onsets, and those ofmotion. We see no significant differences between genres,although the improvement in correlations for the vocalizeexamples is the least pre and post DTW. This could be be-cause of the continuous vibrato in these examples, causingpeople to use more ‘shaky’ representations which are most


165

Motion Music All ADSC IJSV

Ascend Pattern CCA Deep CCA CCA Deep CCA CCA Deep CCA

Z Pitch 0.19 0.23 0.25 0.16 0.09 0.05 0.24 0.17 0.12 0.13 0.16 -0.13 0.01 0.37 0.19 0.21 0.08 0.36Z + V Pitch 0.21 0.27 0.26 0.09 0.15 0.10 0.30 0.03 0.05 0.17 0.22 -0.13 -0.01 0.35 0.24 0.25 0.15 0.34All All 0.33 0.44 0.31 0.14 0.19 0.29 0.44 0.29 0.01 0.36 0.30 0.28 0.23 0.42 0.38 0.43 0.27 0.52

Descend Pattern

Z Pitch 0.18 0.21 0.16 -0.11 0.15 0.20 0.17 0.19 0.09 0.19 0.22 0.21 -0.04 0.23 0.22 0.18 0.08 0.28Z + V Pitch 0.21 0.31 0.23 0.03 0.14 0.22 0.28 0.28 0.30 0.32 0.26 0.23 0.10 0.24 0.42 0.18 0.34 0.17All All 0.35 0.44 0.39 0.12 0.20 0.25 0.38 0.02 0.37 0.37 0.35 0.25 0.12 0.36 0.40 0.22 0.14 0.52

Table 4. Correlations for all samples in the dataset and the two major categorizations of music labels, using ascend anddescend patterns as explained in Section 5.3.2, and features from Tables 3 and 2

Motion Music MotifNonMotif OrgSyn VibNonVib

Ascend Pattern CCA Deep CCA CCA Deep CCA CCA Deep CCA

Z Pitch 0.05 0.23 0.13 0.26 0.19 0.19 0.22 0.25 0.33 0.07 0.33 0.13Z + V Pitch 0.10 0.24 0.17 0.31 0.19 0.22 0.24 0.31 0.33 0.09 0.32 0.20All All 0.29 0.34 0.36 0.47 0.30 0.35 0.42 0.45 0.38 0.29 0.49 0.40

Descend Pattern

Z Pitch 0.20 0.17 0.19 0.21 0.20 0.16 0.23 0.18 0.20 0.17 0.24 0.18Z + V Pitch 0.22 0.22 0.32 0.29 0.24 0.20 0.35 0.26 0.22 0.22 0.14 0.34All All 0.25 0.40 0.37 0.45 0.38 0.33 0.45 0.44 0.33 0.35 0.54 0.35

Table 5. Correlations for two-class categories, using ascend and descend patterns as explained in Section 5.3.2with features from Tables 3 and 2

consistent between participants. The linear mappings ofpitch and vertical motion are limited, making the datasetchallenging. This also means that the associations betweenpitch and vertical motion, as described in previous stud-ies, are not that clear for this stimulus set, especially aswe use musical samples that are not controlled for beingisochronous, nor equal tempered.

Thereafter, we conduct CCA and Deep CCA analysisas seen in Tables 4, 5. Overall, Deep CCA performs betterthan its linear counterpart. We find better correlation withall features from Table 3, as opposed to just using verti-cal motion and velocity. With ascending and descendinglongest run-lengths, we are able to achieve similar resultsfor correlating all melodies with their respective tracings.However, descending contour classification does not havesimilar success. There is more general agreement on con-tour with some melodies than others, with purely descend-ing melodies having particularly low correlation. There issome evidence that descending intervals are harder to iden-tify than ascending intervals [31], and this could explain alow level of agreement in this study amongst people for de-scending melodies. Studying differences between ascend-ing and descending contours requires further study.

While using genre-labels (IJSV) for correlation, we findthat scat samples show the least correlation, and the leastimprovement. Speculatively, this could be related to thehigh number of spoken syllables in this style, even thoughthe syllables are not words. Deep CCA also gives an over-all correlation of 0.54 for recognizing melodies containingvibrato from the dataset. This is an indication that sonic

textures are well represented in such a dataset.With all melody and all motion features, we find an

overall correlation of 0.44 with Deep CCA, for both thelongest ascend and longest descend features. This supportsthe view that non-linearity is inherent to tracings.

7. CONCLUSIONS AND FUTURE WORK

Interest in cross-modal systems is growing in the context ofmulti-modal analysis. Previous studies in this area includeshorter time scales or synthetically generated isochronousmusic samples. The strength of this particular study isin using musical excerpts as are performed, and that theperformed tracings are not iconic or symbolic, but spon-taneous. This makes the dataset a step closer to under-standing contour perception in melodies. We hope thatthe dataset will prove useful for pattern mining, as itpresents novel multimodal possibilities for the communityand could be used for user-centric retrieval interfaces.

In the future, we wish to create a system to synthe-size melody–motion pairs based on training a network tothis dataset, and conducting a user evaluation study, whereusers evaluate system generated music–motion pairs in aforced–choice paradigm.

8. ACKNOWLEDGMENTS

Partially supported by the Research Council of Norwaythrough its Centres of Excellence scheme (262762 &250698), and the Nordic Sound and Music ComputingNetwork funded by the Nordic Research Council.


166

9. REFERENCES

[1] Charles R Adams. Melodic contour typology. Ethno-musicology, pages 179–215, 1976.

[2] Galen Andrew, Raman Arora, Jeff Bilmes, and KarenLivescu. Deep canonical correlation analysis. In In-ternational Conference on Machine Learning, pages1247–1255, 2013.

[3] Nick Gang Blair Kaneshiro Jonathan Berger andJacek P Dmochowski. Decoding neurally relevant mu-sical features using canonical correlation analysis. InProceedings of the 18th International Society for Mu-sic Information Retrieval Conference, Souzhou, China,2017.

[4] Elif Bozkurt, Yucel Yemez, and Engin Erzin. Multi-modal analysis of speech and arm motion for prosody-driven synthesis of beat gestures. Speech Communica-tion, 85:29–42, 2016.

[5] Baptiste Caramiaux, Frederic Bevilacqua, and NorbertSchnell. Towards a gesture-sound cross-modal anal-ysis. In International Gesture Workshop, pages 158–170. Springer, 2009.

[6] Baptiste Caramiaux and Atau Tanaka. Machine learn-ing of musical gestures. In NIME, pages 513–518,2013.

[7] Martin Clayton and Laura Leante. Embodiment in mu-sic performance. 2013.

[8] Rolf Inge Godøy. Images of sonic objects. OrganisedSound, 15(1):54–62, 2010.

[9] Rolf Inge Godøy, Egil Haga, and Alexander RefsumJensenius. Exploring music-related gestures by sound-tracing: A preliminary study. 2006.

[10] Rolf Inge Godøy and Alexander Refsum Jensenius.Body movement in music information retrieval. In 10thInternational Society for Music Information RetrievalConference, 2009.

[11] Anthony Gritten and Elaine King. Music and gesture.Ashgate Publishing, Ltd., 2006.

[12] Anthony Gritten and Elaine King. New perspectives onmusic and gesture. Ashgate Publishing, Ltd., 2011.

[13] Mantle Hood. The ethnomusicologist, volume 140.Kent State Univ Pr, 1982.

[14] David Huron. The melodic arch in western folksongs.Computing in Musicology, 10:3–23, 1996.

[15] K Irwin. Musipedia: The open music encyclopedia.Reference Reviews, 22(4):45–46, 2008.

[16] Mari Riess Jones and Peter Q Pfordresher. Trackingmusical patterns using joint accent structure. Cana-dian Journal of Experimental Psychology/Revue cana-dienne de psychologie experimentale, 51(4):271, 1997.

[17] Tejaswinee Kelkar and Alexander Refsum Jensenius.Exploring melody and motion features in sound-tracings. In Proceedings of the SMC Conferences,pages 98–103. Aalto University, 2017.

[18] Tejaswinee Kelkar and Alexander Refsum Jense-nius. Representation strategies in two-handed melodicsound-tracing. In Proceedings of the 4th InternationalConference on Movement Computing, page 11. ACM,2017.

[19] Tejaswinee Kelkar and Alexander Refsum Jense-nius. Analyzing free-hand sound-tracings of melodicphrases. Applied Sciences, 8(1):135, 2018.

[20] M Kussner. Creating shapes: musicians and non-musicians visual representations of sound. In Proceed-ings of 4th Int. Conf. of Students of Systematic Mu-sicology, U. Seifert and J. Wewers, Eds. epOs-Music,Osnabruck (Forthcoming), 2012.

[21] Olivier Lartilot. Soundtracer, 2018.

[22] Marc Leman. Embodied music cognition and media-tion technology. Mit Press, 2008.

[23] Cynthia Liem, Meinard Muller, Douglas Eck, GeorgeTzanetakis, and Alan Hanjalic. The need for music in-formation retrieval with user-centered and multimodalstrategies. In Proceedings of the 1st international ACMworkshop on Music information retrieval with user-centered and multimodal strategies, pages 1–6. ACM,2011.

[24] Brian McFee, Colin Raffel, Dawen Liang, Daniel PWEllis, Matt McVicar, Eric Battenberg, and Oriol Nieto.librosa: Audio and music signal analysis in python.2015.

[25] Kristian Nymoen, Baptiste Caramiaux, MariuszKozak, and Jim Torresen. Analyzing sound tracings:A multimodal approach to music information retrieval.In Proceedings of the 1st International ACM Workshopon Music Information Retrieval with User-centeredand Multimodal Strategies, MIRUM ’11, pages 39–44,New York, NY, USA, 2011. ACM.

[26] Kristian Nymoen, Rolf Inge Godøy, Alexander Ref-sum Jensenius, and Jim Torresen. Analyzing corre-spondence between sound objects and body motion.ACM Trans. Appl. Percept., 10(2):9:1–9:22, June 2013.

[27] Kristian Nymoen, Jim Torresen, Rolf Godøy, andAlexander Refsum Jensenius. A statistical approachto analyzing sound tracings. Speech, sound and musicprocessing: Embracing research in India, pages 120–145, 2012.

[28] Joy E Ollen. A criterion-related validity test of selectedindicators of musical sophistication using expert rat-ings. PhD thesis, The Ohio State University, 2006.

[29] Denys Parsons. The directory of tunes and musicalthemes. Cambridge, Eng.: S. Brown, 1975.


167

[30] Aniruddh D Patel. Music, language, and the brain. Ox-ford university press, 2010.

[31] Art Samplaski. Interval and interval class similarity:Results of a confusion study. Psychomusicology: AJournal of Research in Music Cognition, 19(1):59,2005.

[32] Charles Seeger. On the moods of a music-logic.Journal of the American Musicological Society,13(1/3):224–261, 1960.

[33] Sandra E Trehub, Judith Becker, and Iain Morley.Cross-cultural perspectives on music and musicality.Philosophical Transactions of the Royal Society ofLondon B: Biological Sciences, 370(1664):20140096,2015.

[34] Sandra E Trehub, Dale Bull, and Leigh A Thorpe. In-fants’ perception of melodies: The role of melodic con-tour. Child development, pages 821–830, 1984.

[35] Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, andLiang Wang. A comprehensive survey on cross-modalretrieval. arXiv preprint arXiv:1607.06215, 2016.


168

Appendices

Appendix A

Details of the Experiments

A.1 Marker Labeling

For the 21 markers place on the body, the following are the codes:

1. RTOE: Right foot marker

2. LTOE: Left foot marker

3. RKNE: Right knee marker

4. LKNE: Left knee marker

5. RHEAD: Right head marker

6. LHEAD: Left head marker

7. STRN: Sternum marker

8. RSHO: Right shoulder marker

9. LSHO: Left shoulder marker

10. RBAK: Right back marker (asymmetric)

11. RWRA: Right wrist inside marker

12. RWRB: Right wrist outside marker

13. LWRA: Left wrist inside marker

14. LWRB: Left wrist outside marker

15. RH: Right palm marker

16. LH: Left palm marker

17. RELB: Right elbow marker

18. LELB: Left elbow marker

19. LowBAK: Tailbone marker

20. RHIP: Right hip marker

21. LHIP: Left hip marker

171

A. Details of the Experiments

0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6

Figure A.1: The 16 melodies used as the stimulus set for all the experiments inthe thesis come from four different music cultures and contain no words.

A.2 List of Melodic Stimuli

1. North Indian singing: All four melodies are sung by Ginde (of WashingtonEthnomusicology Archives, 1991). The individual features are mentionedbelow.

• Melody 1: stationary contour profile, lowest note of the stimulus set• Melody 2: ascending contour profile• Melody 3: stationary contour profile• Melody 4: motivic repetition

2. Joik: The individual features are mentioned below.

• Melody 5: Per Henderek Haetta (Quarja), antecedent• Melody 6: Per Henderek Haetta (Quarja), consequent• Melody 7: Nils N. Eira (Track 49), audible vocal break• Melody 8: Inga Susanne Haetta (Markel Joavna Piera, motivic

repetition

3. Jazz Scat: All four melodies are sung by Fitzgerald as described inChapter 5. The individual features for the choices are mentioned below.

172

File Formats for Symbolic Music Notation

• Melody 9: descending contour profile• Melody 10: no clear contour direction, rhythmic play• Melody 11: descending contour profile• Melody 12: ascending contour profile

4. Vocalise: All four melodies are sung by June Anderson, as described inChapter 5. The individual features are mentioned below.

• Melody 13: ascending contour profile• Melody 14: extreme vibrato of a whole tone• Melody 15: stationary contour profile• Melody 16: motivic repetition, highest sustained note of the stimulus

set

A.3 File Formats for Symbolic Music Notation

Some symbolic representation languages that inform melodic data sets are asfollows:

1. MIDI Music exchange digital interface: This is a standard interchangeformat for music data, that carries event messages specifying, pitch,duration, velocity, vibrato, panning, and clock signals to set the tempo.MIDI messages have made it possible for digital musical instruments tocommunicate with each other.

2. MusicXML : This is an XML based representation of music notation forthe web.

3. Humdrum Syntax: Humdrum is a protocol created by David Huron in1980, that is now widely used as a set of command line tools fur musicanalysis.

4. Pitch class: set of pitches described as numbers.

5. Music Encoding Initiative (MEI): MEI is is a standard for encodingsymbolic musical data

6. Notation Interchange File Format (NIFF): This is a file format to be ableto use notation in various types of software for reading music notationseamlessly.

173

Appendix B

List of implemented functions andfeaturesThe datasets and codes can be found on

http://tejaswineek.github.io

As a part of the code developed for this thesis, the following functions have beenimplemented for mocap data:

B.1 Dependencies

The python toolbox in progress has dependencies in numpy, scipy, matplotlib,pandas.

B.2 Data Types

The datasets are a collection of tsv files. Each file encodes the followinginformation: [Participant ID, Stimulus ID, Strategy ID].

B.3 Functions

B.3.1 Motion Features

1. Quantity of Motion

2. Range

3. Hand Distance

4. Upsampling

5. Downsampling

6. Velocity

7. Acceleration

8. Jerk

9. Normalize

We are able to plot the data in the following ways:

175

B. List of implemented functions and features

1. 2D plotting

2. 3D plotting

Additional functions for analysis:

1. Spline Interpolation

2. Movement Plane

3. Peak Detection

176

computational analysis of melodic contour and body movement

Documents