learn2dance: learning statistical music-to-dance … · discrete hmm and synthesize alternative...

LEARN2DANCE: LEARNING STATISTICAL MUSIC-TO-DANCE MAPPINGS FOR CHOREOGRAPHY SYNTHESIS 1

Learn2Dance:Learning Statistical Music-to-Dance Mappings

for Choreography SynthesisFerda Ofli*, Member, IEEE, Engin Erzin, Senior Member, IEEE, Yucel Yemez, Member, IEEE,

and A. Murat Tekalp, Fellow, IEEE

Abstract—We propose a novel framework for learning many-to-many statistical mappings from musical measures to dancefigures towards generating plausible music-driven dance chore-ographies. We obtain music-to-dance mappings through use offour statistical models: i) Musical measure models, representing amany-to-one relation, each of which associates different melodypatterns to a given dance figure via a hidden Markov model(HMM), ii) exchangeable figures model, which captures the diver-sity in a dance performance through a one-to-many relation, ex-tracted by unsupervised clustering of musical measure segmentsbased on melodic similarity, iii) figure transition model, whichcaptures the intrinsic dependencies of dance figure sequencesvia an n-gram model, iv) dance figure models, capturing thevariations in the way particular dance figures are performed, bymodeling the motion trajectory of each dance figure via an HMM.Based on the first three of these statistical mappings, we define adiscrete HMM and synthesize alternative dance figure sequencesby employing a modified Viterbi algorithm. The motion parame-ters of the dance figures in the synthesized choreography are thencomputed using the dance figure models. Finally, the generatedmotion parameters are animated synchronously with the musicalaudio using a 3D character model. Objective and subjectiveevaluation results demonstrate that the proposed framework isable to produce compelling music-driven choreographies.

Index Terms—Multimodal dance modeling, music-to-dancemapping, music-driven dance performance synthesis and anima-tion, automatic dance choreography creation, musical measureclustering

I. INTRODUCTION

CHOREOGRAPHY is the art of arranging dance move-ments for performance. Choreographers tailor sequences

of body movements to music in order to embody or expressideas and emotions in the form of a dance performance. There-fore, dance is closely bound to music in its structural course,artistic expression, and interpretation. Specifically, the rhythmand expression of body movements in a dance performance arein synchrony with those of the music, and hence, the metricorders in the course of music and dance structure coincide, asReynolds states in [1]. In order to successfully establish thecontextual bond as well as the structural synchrony between

Copyright (c) 2010 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

F. Ofli is with the Tele-Immersion Group, Electrical Engineering and Com-puter Sciences Department, College of Engineering, University of Californiaat Berkeley, Berkeley, CA, 94720. E-mail: [email protected].

E. Erzin, Y. Yemez, and A. M. Tekalp are with the Multimedia, Visionand Graphics Laboratory, College of Engineering, Koc University, Sariyer,Istanbul, 34450, Turkey. E-mail: {eerzin,yyemez,mtekalp}@ku.edu.tr.

dance motion and the accompanying music, choreographerstend to thoughtfully design dance motion sequences for a givenpiece of music by utilizing a repertoire of choreographies.Based on this common practice of choreographers, our goalin this study is to build a framework for automatic creationof dance choreographies in synchrony with the accompanyingmusic; as if they were arranged by a choreographer, throughlearning many-to-many statistical mappings from music todance. We note that the term choreography generally refersto spatial formation (circle, line, square, couples, etc.), plasticaspects of movement (types of steps, gestures, posture, grasps,etc.), and progression in space (floor patterns), whereas in thisstudy, we use the term choreography in the sense of compo-sition, i.e., the arrangement of the dance motion sequence.

A. Related Work

Music-driven dance animation schemes require, as a firststep, structural analysis of the accompanying music signal,which includes beat and tempo tracking, measure analysis, andrhythm and melody detection. There exists extensive researchin the literature on structural music analysis. Gao and Lee[2] for instance propose an adaptive learning approach toanalyze music tempo and beat based on maximum a posteriori(MAP) estimation. Ellis [3] describes a dynamic programmingsolution for beat tracking by finding the best-scoring set ofbeat times that reflect the estimated global tempo of music. Anextensive evaluation of audio beat tracking and music tempoextraction algorithms, which were included in MIREX’06, canbe found in [4]. There are also some recent studies on the openproblem of automatic musical meter detection [5], [6]. In thelast decade, chromatic scale features have become popularin musical audio analysis; especially in music informationretrieval, since introduced by Fujishima [7]. Lee and Slaney [8]describe a method for automatic chord recognition from audiousing hidden Markov models (HMMs) through supervisedlearning over chroma features. Ellis and Poliner [9] propose across-correlation based cover song identification system withchroma features and dynamic programming beat tracking. Ina very recent work, Kim et al. [10] calculate the second orderstatistics to form dynamic chroma feature vectors in modelingharmony structures for classical music opus identification.

Human body motion analysis/synthesis, as a unimodal prob-lem, has also been extensively studied in the literature in manydifferent contexts. Bregler et al. [11] for example describe

2 LEARN2DANCE: LEARNING STATISTICAL MUSIC-TO-DANCE MAPPINGS FOR CHOREOGRAPHY SYNTHESIS

a body motion recognition approach that incorporates low-level probabilistic constraints extracted from image sequencesof articulated gestures into high-level manifold and HMM-based representations. In order to synthesize data-driven bodymotion, Arikan and Forsyth [12], and Kovar et al. [13] pro-pose motion graphs representing allowable transitions betweenposes, to identify a sequence of smoothly transiting motionsegments. Li et al. [14] segment body motions into textons,each of which is modeled by a linear dynamic system, in orderto synthesize human body motion in a manner statisticallysimilar to the original motion capture data by considering thelikelihood of switching from one texton to the next. Brandand Hertzmann [15] study motion “style” transfer problem,which involves intensive motion feature analysis and learningmotion patterns via HMMs from a highly varied set ofmotion capture sequences. Min et al. [16] present a generativehuman motion model for synthesis of personalized humanmotion styles by constructing a multilinear motion model thatprovides explicit parametrized representation of human motionin terms of “style” and “identity” factors. Ruiz and Vachon[17] specifically work on dance body motion, and performanalysis of dance figures in a chain of simple steps usingHMMs to perform automatic recognition of basic movementsin the contemporary dance.

A parallel track of literature can be found in the domainof speech-driven gesture synthesis and animation. The earliestworks in this domain study speaker lip animation [18], [19].In [18], Bregler et al. use morphing of mouth regions to re-sync the existing footage to a new soundtrack. Chen shows in[19] that lip reading a speaker yields higher speech recognitionrates and provides better synchronization of speech with lipmovements for more natural lip animations. Later a largebody of work extends in the direction of synthesizing facialexpressions along with lip movements to create more naturalface animations [20]–[22]. Most of these studies adopt varia-tions of hidden Markov models to represent the relationshipbetween speech and facial gestures. The most recent studiesin this domain aim at synthesizing not only facial gesturesbut also head, hand and other body gestures for creating morerealistic speaker animations [23]–[25]. For instance, Sargin etal. develop a framework for joint analysis of prosody and headgestures using parallel branch HMM structures to synthesizeprosody-driven head gesture animations [23]. Levine et al.introduce gesture controllers for animating the body languageof avatars controlled online with the prosody of the inputspeech by training a specialized conditional random field [25].

Designing a music-driven automatic dance animation sys-tem, on the other hand, is a relatively more recent probleminvolving several open research challenges. There is actuallylittle work in the literature on multimodal dance analysis andsynthesis, and most of the existing studies focus solely onthe aspect of synchronization between a musical piece andthe corresponding dance animation. Cardle et al. [26] forinstance synchronize motion to music by locally modifyingmotion parameters using perceptual music cues, whereas Leeand Lee [27] employ dynamic programming to modify timingof both music and motion via time-scaling the music and time-warping the motion. Synchronization-based methods cannot

however (actually do not aim to) generate new dance motionsequences. In this sense, the works in [28]–[31] presentmore elaborate dance analysis and synthesis schemes, all ofwhich follow basically the same similarity-based framework:They first investigate rhythmical and/or emotional similaritiesbetween the audio segments of a given input music signaland the available dance motion segments, and then, based onthese similarities, synthesize an optimal motion sequence ona motion transition graph using dynamic programming.

In our earlier work [32], we have addressed the statisticallearning problem in a multimodal dance analysis scheme bybuilding a correlation model between music and dance. Thecorrelation model was based upon the confusion matrix ofco-occurring motion and music patterns extracted via unsu-pervised temporal segmentation, and hence, was not complexenough to handle realistic scenarios. Later in [33], we havedescribed an automatic music-driven dance animation schemebased on supervised modeling of music and dance figures.However the considered dance scenario was very simplistic,where a dance performance was assumed to have only a singledance figure to be synchronized with the musical beat. Inthis current paper, we propose a complete framework, basedon [34], for modeling, analysis, annotation and synthesis ofmultimodal dance performances, which can handle complexand realistic scenarios. Specifically, we focus on learning sta-tistical mappings, which are in general many-to-many, betweenmusical measure patterns and dance figure patterns for music-driven dance choreography animation.

B. Contributions

An open challenge in music-driven dance animation is dueto the fact that, for most dance categories, such as ballroomand folk dances, the relationship between music and danceprimitives usually exhibits a many-to-many mapping pattern.As discussed in the related work section, the previous methodsproposed in the literature for music-driven dance animation,whether similarity-based [28]–[30] or synchronization-based[26], [27], do not address this challenge. They are all deter-ministic methods and do not involve any true dance learningprocess (other than building motion transition graphs), there-fore cannot capture the many-to-many relationship existingbetween music and dance, producing always a single optimalmotion sequence given the same input music signal. In thispaper we address this open challenge by modeling the many-to-many characteristics via a statistical framework. In this re-spect, our primary contributions are i) choreography analysis:automatic learning of many-to-many mapping patterns from acollection of dance performances in a multimodal statisticalframework, and ii) choreography synthesis: automatic syn-thesis of alternative dance choreographies that are coherentto a given music signal, using these many-to-many mappingpatterns.

For choreography analysis, we introduce two statisticalmodels; one capturing a many-to-one and the other capturinga one-to-many mapping from musical primitives to danceprimitives. The former model learns different melody pat-terns associated with each dance primitive. The latter model

OFLI et al.: FINAL SUBMISSION TO IEEE TRANSACTIONS ON MULTIMEDIA 3

learns the group of candidate dance primitives that can bereplaced with one another without causing an artifact in thechoreography. To further consolidate the coherence and thequality of the synthesized dance choreographies, we introducea third model to capture the intrinsic dependencies of danceprimitives and to preserve the implicit structure existing in thecontinuum of dance motion. Combining the aforementionedthree models, we present a modified Viterbi algorithm togenerate coherent and enriched sequences of dance primitivesfor plausible choreography synthesis.

The organization of the paper is as follows: Section II firstgives an overview of our music-driven dance animation sys-tem, and then describes briefly the feature extraction modules.We present the proposed multimodal choreography analysisand synthesis framework, hence our primary contribution, inSection III. The problem of character animation for visual-ization of synthesized dance choreographies is addressed inSection IV. Section V presents the experiments and results,and finally, Section VI gives concluding remarks and discussespossible applications of the proposed framework.

II. SYSTEM OVERVIEW AND FEATURE EXTRACTION

Our music-driven dance animation scheme is musical mea-sure based, hence we regard musical measures as the musicprimitives. A measure is the smallest compositional unit ofmusic that corresponds to a time segment, which is definedas the number of beats in a given duration. We define adance figure as the dance motion trajectory corresponding to asingle measure segment. Dance figures are taken as the danceprimitives.

The overall system, as depicted in Fig. 1, comprises ofthree parts: analysis, synthesis, and animation. Audiovisualdata preparation and feature extraction modules are commonto both analysis and synthesis parts. An audiovisual dancedatabase can be pictured as a collection of measures anddance figures aligned in two parallel streams: music streamand dance stream. Fig. 2 illustrates a sample music-dancestream extracted from our folk dance database. In the datapreparation module, the input music stream is segmented byan expert into its units, i.e., musical measures. We use mt

to denote the measure segment at frame t. Measure segmentboundaries are then used by the expert to define the motionunits, i.e., dance figures. We use dt to denote the dance figuresegment corresponding to measure at frame t. The expert alsoassigns each dance figure dt a figure label lj to indicate thetype of the dance motion. The collection of lj forms the setof candidate dance figures, i.e., L = {lj |j = 1, . . . , N} whereN is the number of distinct dance figure labels that existin the audiovisual dance database. The resulting sequence ofdance figure labels lj is regarded as the original (reference)choreography, i.e., r = {rt}t=Tt=1 , where rt ∈ L and T is thenumber of musical measure segments. The feature extractionmodules compute the dance motion features Fdt and musicchroma features Fmt for each dt and mt, respectively.

We assume that the relation between music and dance prim-itives in a dance performance has a many-to-many mappingpattern. That is, a particular dance primitive (dance figure) can

Fig. 1. Block diagram of the overall multimodal dance performance analysis-synthesis framework.

Fig. 2. Audiovisual dance database is a collection of dance figure-musicalmeasure pairs. Recall that a measure is the smallest compositional unit ofmusic that corresponds to a time segment, which is defined as the number ofbeats in a given duration. On the other hand, we define a dance figure as thedance motion trajectory corresponding to a single measure segment. Hence,by definition, the boundaries of the dance figure segments coincide with theboundaries of the musical measure segments, which is in conformity withReynolds’ work [1].

be accompanied by different music primitives (measures) in adance performance. Conversely, a particular musical measurecan correspond to different dance figures. Our choreographyanalysis and synthesis framework respects the many-to-manynature of the relationship between music and dance primitivesby learning two separate statistical models; one capturing amany-to-one and the other capturing a one-to-many mappingfrom musical measures to dance figures. The former modellearns different melody patterns associated with each dancefigure. For this purpose, music chroma features Fmt are usedto train a hidden Markov model hmj for each dance figurelabel lj to create the set of musical measure models Hm.The latter model, i.e., the model capturing a one-to-manyrelation from musical measures to dance figures, learns thegroup of candidate dance figures that can be replaced with oneanother without causing an artifact in the dance performance(choreography). We call such a model as exchangeable figuresmodel X . Music chroma features Fmt are used to clustermeasure segments mt according to the harmonic similaritybetween different measure segments. Based on these measureclusters, we determine the group of dance figures that areaccompanied by the musical measures with similar harmoniccontent. We then create the exchangeable figures model Xbased on such dance figure groups. While the former modelis designed to keep the underlying correlations between mu-sical measures and dance figures as intact as possible, thelatter model is useful for allowing acceptable (or desirable)variations in the dance choreography by offering various


possibilities in the choice of dance figures that reflect thediversity in a dance performance (choreography). To furtherconsolidate the coherence and quality of the synthesized dancechoreography, we introduce a third model, i.e., the figuretransition model F , to capture the intrinsic dependencies ofthe dance figures and to preserve the implicit structure existingin the continuum of dance motion. The figure transition modelbasically appraises the figure-to-figure transition relations bycomputing n-gram probabilities from the audiovisual dancedatabase. The choreography synthesis makes use of these threemodels; namely, Hm, F , and X , to determine the output dancefigure sequence r (i.e., choreography), taking music chromafeatures as input, which are extracted from a test music signal.Here, r = {rt}t=Tt=1 , where rt ∈ L and T is the numberof musical measure segments. Specifically, the choreographysynthesis module employs a modified Viterbi decoding on adiscrete HMM, which is constructed by the musical measuremodels, Hm, and the figure transition model, F , to determinethe sequence of dance figures r subject to the exchangeablefigures model X .

In the analysis part of the animation model, the dancemotion features Fdt are used to train a hidden Markov modelhdj for each dance figure label lj to construct the set of dancefigure models Hd. Eventually, the body posture parameterscorresponding to each dance figure in the synthesized chore-ography r are generated using the dance figure models Hd toanimate a 3D character.

A. Music Feature Extraction

Unlike speech, music consists of a sequence of toneswhose frequencies are defined. Moreover, musical melody isa rhythmical succession of single tones in different patterns.In this study, we model the melodic pattern in each measuresegment with tone-related features using temporal statisticalmodels, i.e., HMMs. We extract tone-related chroma featuresto characterize the melodic/harmonic content of music. Inorder to represent the chroma scale, we project the entirespectrum onto 12 bins corresponding to the 12 distinct semi-tones of the musical octave. Theoretically, the frequency of thek-th note in the n-th octave is defined as, fnk = f00 2

n+k/12,where the pitch of the C0 note is f00 = 16.35 Hz based onShepard’s helix model in [35] and n, k ∈ Z, k = 0, . . . , 11.In this study, we extract the chroma features of 60 semi-tonesfor n = 4, . . . , 8 (over 5 octaves from the C4 note to the B8note).

We extract chroma features similar to the well-known mel-frequency cepstral coefficient (MFCC) computation [36] byapplying cepstral analysis to the semitone spectral energies.Hence our chroma features capture information of fluctuationsof semitone spectral energies. We center the triangular energywindows at the locations of the semi-tone frequencies, fnk , atdifferent octaves for k = 0, . . . , 11 and n = 4, . . . , 8. Then,we compute the first 12 DCT coefficients of the logarithmicsemitone spectral energy vector, that constitute the chromaticscale cepstral coefficient (CSCC) feature set, fm[n], for themusic frame n. We also compute the first and second timederivatives of these 12 CSCC features, using the following

regression formula:

∆fm[n] =

∑2r=−2 rf

m[n+ r]∑2r=−2 r

2. (1)

The music feature vector, fm∆ [n], is then formed by includingthe first and second time derivatives:

fm∆ [n] = [fm[n]T ∆fm[n]T∆2fm[n]

T]T. (2)

Each Fmt , therefore, corresponds to the sequence of musicfeature vectors fm∆ [n] that fall into the measure segment mt.Specifically, Fmt is a matrix of CSCC features in the form

Fmt = [fm∆ [1] fm∆ [2] . . . fm∆ [Nmt ]], (3)

where Nmt is the number of audio frames in measure segmentmt.

B. Motion Feature Extraction

We acquire multiview recordings of a dancing actor foreach dance figure in the audiovisual dance database using8 synchronized cameras. We then employ a motion capturetechnique for tracking the 3D positions of the joints of thebody based on the markers’ 2D projections on each camera’simage plane, using the color information of the markers [33].The resulting set of 3D points are used to fit a skeletonstructure to the 3D motion capture data. Through this skeletonstructure, we solve the necessary inverse kinematics equationsand calculate accurately the set of Euler angles for each jointin its local frame as well as the global translation and rotationof the skeleton structure for the motion trajectory definedby the input 3D motion capture data. We prefer joint anglesas our dance motion features due to their widespread usagein human body motion analysis-synthesis and 3D characteranimation literature. We compute 66 angular values associatedwith 27 key joints of the body as well as 6 values for theglobal rotation and translation of the body, which leads to adance motion feature vector fd[n] of dimension 72 for eachdance motion frame n. However, angular features are generallydiscontinuous at boundary values due to their 2π-periodicnature and this situation causes a problem in training statisticalmodels to capture the temporal dynamics of a sequence ofangular features. Therefore, instead of using the static set ofEuler angles fd[n], we use their first and second differencescomputed with following difference equation,

∆fd[n] =1

2(fd[n+ 1]− fd[n− 1]), (4)

where the resulting discontinuities are eliminated by the fol-lowing conditional update,

∆fd[n] =

∆fd[n]− 180 if ∆fd[n] > 90,

∆fd[n] + 180 if ∆fd[n] < −90,

∆fd[n] otherwise.(5)

Then the dynamic motion feature vector is formed as fd∆[n] =

[∆fd[n]T∆2fd[n]

T]T

of dimension 144. Hence, each Fdt isa sequence of motion feature vectors fd∆[n] that fall into dancefigure segment dt while training temporal models of motion


trajectories associated with each dance figure label lj . That is,Fdt is a matrix of body motion feature values in the form

Fdt = [fd∆[1] fd∆[2] . . . fd∆[Ndt ]], (6)

where Ndt is the number of dance motion frames within dancemotion segment dt. We also calculate the mean trajectoryfor each dance figure label lj , namely µj , by calculatingfor each motion feature an average value over all instances(realizations) of the dance figures labeled as lj . These meantrajectories (µj) are required later in choreography animationsince each dance figure model hdj capture only the temporaldynamics of the first and second differences of the Euler anglesof the key joints associated with the dance figure label lj .

III. CHOREOGRAPHY MODELING

In this section we present the proposed choreographyanalysis-synthesis framework. We first describe our choreog-raphy analysis procedure which involves statistical modelingof music-to-choreography mapping. We then explain how thisstatistical modeling is used for choreography synthesis.

A. Multimodal Choreography Analysis

We perform choreography analysis through three statisticalmodels, which together define a many-to-many mapping frommusical measures to dance figures. These choreography mod-els are: (i) Musical measure models Hm, which capture many-to-one mappings from musical measures to dance figures usingHMMs; (ii) Exchangeable figures model X , which capturesone-to-many mappings from musical measures to dance fig-ures, and hence, represents the subjective nature of the dancechoreography with possibilities in the choice of dance figuresand in their organization; and (iii) Figure transition model F ,which captures the intrinsic dependencies of dance figures.We note that these three choreography models constitute the“choreography analysis” block in Fig. 1.

1) Musical Measure Models (Hm): In a dance perfor-mance, musical measures that correspond to the same dancefigure may exhibit variations and are usually a collection ofdifferent melodic patterns. That is, different melodic patternscan accompany the same dance figure, displaying a many-to-one mapping relation from musical measures to dance figures.We capture this many-to-one mapping by employing hiddenMarkov models (HMMs) to identify and model the melodicpatterns corresponding to each dance figure. Specifically,we train an HMM hmj over the collection of measures co-occurring with the dance figure lj , using the musical measureCSCC features, Fmt . Hence, we train an HMM for each dancefigure in the dance performance. We define left-to-right HMMstructures hmj with ajik = 0 for k = i, i + 1, i + 2, whereajik is the transition probability from state qji to state qjk. Thetransitions from state qji to qji+2 account for the differencesin measure durations. Emission distributions of the chroma-based music features are modeled by Gaussian mixture densityfunctions with diagonal covariance matrices in each state ofhmj . The use of Gaussian mixture density in hmj enables usto capture different melodic patterns that correspond to aparticular dance figure. We denote the collection of musical

measure models as Hm, i.e., Hm = {hmj |j = 1, . . . , N}.Musical measure models Hm provide a tool to capture themany-to-one part of the many-to-many musical measure todance figure mapping problem.

2) Exchangeable Figures Model (X ): In a dance perfor-mance, it is possible that several distinct dance figures canbe performed equally well along with a particular musicalmeasure pattern, exhibiting a one-to-many mapping relationfrom musical measures to dance figures [1]. To represent thisone-to-many mapping relation, we introduce the notion ofexchangeable figure groups each containing a collection ofdance figures that can be replaced with one another withoutcausing an artifact in a dance performance. To learn ex-changeable figure groups, we cluster the measure segments ineach musical piece with respect to their melodic similarities.The melodic similarity sij between two different measuresegments mi and mj is computed as the local match scoreobtained from dynamic time warping (DTW) [37] of thechroma-based feature matrices Fmi and Fmj , correspondingto mi and mj , respectively, in the musical piece Sk. Then,based on the melodic similarity scores between pairs ofmusical measure segments in Sk, we form an affinity matrixYk =

(ykij

)i,j=1,...,N

where ykij = exp(−sij) if i = j, andykii = 0. Finally, we apply the spectral clustering algorithmdescribed in [38] over Yk to cluster the measure segments inSk. The spectral clustering algorithm in [38] assumes that thenumber of clusters is known a priori and employs k-meansclustering algorithm [39]. Since we do not know the numberof clusters a priori, we measure the “quality” of the partitionin the resulting clusters using the internal indexes, silhouettes[40], to determine the appropriate number of clusters. Thesilhouette value for each point is a measure of how similar thatpoint is to the points in its own cluster compared to the pointsin the other clusters, and ranges from -1 to +1. Averaging overall the silhouette values, we compute the overall quality of theclustering for a range of cluster numbers and pick the one thatresults in the highest silhouette value.

We perform separate clustering for each musical piece Sk inorder to increase the accuracy of musical measure clusteringsince similar measure patterns are likely to occur in the samemusical piece rather than spread among different musicalpieces. Once we obtain clusters of measures in all musicalpieces, we can then use all of the measure clusters in allmusical pieces to determine the exchangeable figures groupGj for each dance figure lj by collecting the dance figurelabels that co-appear with lj in any of the resulting clusters.Note that a particular dance figure can appear in more thanone musical piece (see also Fig. 5). Based on the exchangeablefigure groups Gj , we define the exchangeable figures modelxj as an indicator random variable:

xj(i) = I(li) =

{1, if li ∈ Gj0, otherwise

(7)

where Gj is the exchangeable figure group associated with thedance figure lj . The collection of xj for all dance figure labelsin L gives us the exchangeable figures model X .

The notion of exchangeable figures is the key to reflect


the subjective nature of the dance choreography with possi-bilities in the choice of dance figures and their organizationthroughout the choreography estimation process. The use ofexchangeable figures model allows us to create a differentartistic dance performance content each time we estimate adance choreography.

3) Figure Transition Model (F): The figure transitionmodel is built to capture the intrinsic dependencies of thedance figure sequences within the context of dance chore-ographies. The intrinsic dependencies of the choreographyare defined with figure-to-figure transition probabilities. Thefigure-to-figure transition probability density functions aremodeled in n-gram language models, where the probabilityof the dance figure lj at dt given the dance figure sequencei1, i2, ..., in−1 at dt−1, dt−2, . . ., dt−n+1, i.e., P (dt =lj |dt−1 = i1, . . . , dt−n+1 = in−1), defines the n-gram dancelanguage model. This model provides a number of rules thatspecify the structure of a dance choreography. For instance,a dance figure that never appears after a particular sequenceof n − 1 dance figures in the training video does not appearin the synthesized choreography either. We can also enforce adance figure to always follow a particular sequence of n − 1dance figures if it is also the case in the training video withthe help of the n-gram dance language model.

B. Multimodal Choreography Synthesis

We formulate the choreography synthesis problem as esti-mating a dance figure sequence from a sequence of musicalmeasures. The core of the choreography synthesis is definedas a Viterbi decoding process on a discrete HMM, which isconstructed by the musical measure models, Hm, and thefigure transition model, F . Furthermore, the exchangeablefigures model, X , is utilized to introduce acceptable variationsinto the Viterbi decoding process to enrich the synthesizedchoreography. Based on this Viterbi decoding process, we de-fine three separate choreography synthesis scenarios: i) singlebest path, ii) likely path, and iii) exchangeable path, whichare explained in detail in the following subsections. We notethat all these choreography synthesis scenarios are multimodal,that is, they depend on the joint statistical models of musicand choreography, and they have different refinements andcontributions to enrich the choreography synthesis.

Besides these three synthesis scenarios, we investigate twoother reference (baseline) choreography synthesis scenarios:one using only the musical measure models Hm to map eachmeasure segment in the test musical piece to a dance figurelabel (which we refer to as acoustic-only choreography), andanother one using only the figure transition model F (whichwe refer to as figure-only choreography). The acoustic-onlychoreography corresponds to a synthesis scenario in whichonly the correlations between musical measures and dance fig-ure labels are taken into account, but the correlations betweenconsecutive figures are ignored. In contrast to the acoustic-onlychoreography, the figure-only choreography scenario predictsthe dance figure for the next measure segment only accordingto figure-to-figure transition probabilities, which are modeledas bigram probabilities of F , by discarding the correlations

between musical measures and dance figures. Note that thefigure-only synthesis can be regarded as a synchronization-based technique discussed in Section I-A, such as the onesproposed in [26], [27]. In figure-only synthesis, the dancefigure sequence is generated randomly by only respectingfigure-to-figure transition probabilities to ensure visual con-tinuity of the resulting character animation. The dance figuresequences resulting from these two baseline scenarios con-stitute reference choreographies that help us comparativelyassess the contributions of our choreography analysis-synthesisframework.

1) Single best path synthesis: We construct a discreteHMM, C, using the musical measure models, Hm, and thefigure transition model, F . In the figure transition modelF , the figure-to-figure transition probability distributions arecomputed with bigram models. This choice of model is dueto the scale of choreography database that we use in training,and a much larger database can be used to model higherorder n-gram dance language models. The discrete HMM,C = (A,B, π), is defined with the following parameters:

• T is the number of time frames (measure segments). Foreach time frame (measure), the choreography synthesisprocess outputs exactly one dance figure label. Recallthat we denote the individual dance figures as dt andindividual measures as mt for t = 1, . . . , T .

• N is the number of distinct dance figure labels, i.e., ljwhere j = 1, . . . , N . Dance figure labels are the outputsof the process being modeled.

• A = {aij} is the dance figure transition probabilitydistribution with elements defined as,

aij = P (dt = lj |dt−1 = li), i, j = 1, . . . , N, (8)

where the elements, aij , are the bigram probabilitiesfrom the figure transition model F and they satisfy∑Nj=1 aij = 1.

• B = {bt(j)} is the dance figure emission distribution formeasure mt. Elements of B are defined using the musicalmeasure models Hm as,

bt(j) = P (Fmt |hmj ), j = 1, . . . , N ; t = 1, . . . , T (9)

• π = {πi} is the initial dance figure distribution where

πi = P (d1 = li), i = 1, . . . , N (10)

The discrete HMM C constructs a lattice structure, say M,as given in Fig. 3. The proposed choreography synthesis can beformulated as finding a path through the lattice M. Assuminga uniform initial figure distribution π, the single best pathsynthesis scenario decodes the Viterbi path along the latticeM to estimate the synthesized figure sequence r. The Viterbialgorithm for finding the single best path synthesis can besummarized as follows:

i. Initialization:ϕj(1) = b1(j) j = 1, . . . , Nψj(1) = 0 j = 1, . . . , N

(11)

ii. Recursion: For t = 2, . . . , T

ϕj(t) = maxi

{aijϕi(t− 1)}bt(j) j = 1, . . . , N (12)


Fig. 3. Lattice structure M of the discrete HMM C.

ψj(t) = argmaxi

{aijϕi(t− 1)} j = 1, . . . , N (13)

iii. Termination:

rT = argmaxi

{ϕi(T )} (14)

iv. Path (dance figure sequence) backtracking:

rt = ψrt+1(t+ 1) t = T − 1, T − 2, . . . , 1 (15)

Here ϕj(t) represents the partial likelihood score of perform-ing the dance figure lj at frame t, and ψj(t) is used to keeptrack of the best path retrieving the dance figure sequence. Thepath r = {rt}t=Tt=1 decodes the resulting dance figure labelsequence as the desired output choreography. Note that theresulting dance choreography r is unique for the single bestpath synthesis scenario since it is the Viterbi path along thelattice M.

2) Likely path synthesis: In the second synthesis scenario,we find a likely path along M in which we follow one of thelikely partial paths in lieu of following the partial path thathas the highest partial likelihood score at each time frame.The likely path synthesis is expected to create variations alongthe single best path synthesis, which results in an enriched setof synthesized choreography sequences with high likelihoodscores.

We modify the recursion step of the Viterbi algorithm todefine the likely path synthesis:

Recursion: For t = 2, . . . , T

i1 = argmaxi{aijϕi(t− 1)},i2 = argmaxi−{i1}{aijϕi(t− 1)}, (16)

ψj(t) = U(i1, i2) j = 1, . . . , N, (17)

ϕj(t) = ϕψj(t)(t− 1)aψj(t)jbt(j) j = 1, . . . , N, (18)

where i1 and i2 are the figure indices with the top twopartial path scores and U(i1, i2) returns randomly one of thearguments with uniform distribution. Note that i1 correspondsto ψj(t) in (13). The likely path scenario is expected tosynthesize different dance choreographies since it propagatesrandomly among top two ranking transitions at each timeframe. This intricately introduces variation into the choreog-raphy synthesis process.

3) Exchangeable path synthesis: In this scenario, we findan exchangeable path by letting the exchangeable figuresmodel X replace and update the single best path. Unlikethe likely path synthesis, the exchangeable path scenariointroduces random variations to the single best path that re-spect one-to-many mappings from musical measures to dancefigures as defined in X .

The exchangeable path synthesis is implemented with thefollowing procedure:

(i) Compute the single best path synthesis for a givenmusical measure sequence and set the measure segmentindex t = 1.

(ii) The figure li1 at measure segment t is replaced withanother figure li∗ from its exchangeable figure groupGi1 ,

i∗ = U ′({i}|li ∈ Gi1) (19)

where U ′() returns randomly one of the argu-ments according to the distribution of acoustic scoresP (Fmt |hmi ) of the dance figures li ∈ Gi1 .

(iii) The rest of the figure sequence, rt+1, . . . , rT , is updatedby determining a new single best path using the Viterbialgorithm.

(iv) The steps (ii) and (iii) are repeated for measure segmentst = 2, . . . , T .

The exchangeable path synthesis yields an alternative pathby modifying the single best path in the context of theexchangeable figures model. Its key difference from the likelypath is that the collection of the candidate dance figures thatcan replace a particular dance figure in the choreography,say lj , is constrained with the dance figures for which theexchangeable figures model xj yields 1. Hence, it is expectedto introduce more acceptable variations into the synthesizedchoreography than the likely path.

IV. ANIMATION MODELING

In this section, we address character animation of dancefigures to visualize and evaluate the proposed choreographyanalysis and synthesis framework. First we define a dancefigure model, which captures the variations in the way par-ticular dance figures are performed, by modeling the motiontrajectory of a dance figure via an HMM. Then we definea character animation system, which generates a sequence ofdance motion features from a given sequence of dance figures,so as to animate a 3D character model.

A. Dance Figure Models (Hd)

The way a dancer performs a particular dance figure mayexhibit variations in time in a dance performance. Therefore,it is important to model the temporal statistics of each dancefigure to capture the variations in the dance performance. Notethat these models will also capture the personalized dancefigure patterns of a dancer. We use the set of motion featuresFdt to train an HMM, hdj , for each dance figure label lj tocapture the dynamic behavior of the dancing body. Since adance figure contains typically a well-defined sequence ofbody movements, we employ a left-to-right HMM structure


(i.e., ajik = 0 for k = i, i + 1 where ajik is the transitionprobability from state qji to state qjk in hdj ) to model eachdance figure. Emission distributions of motion parameters aremodeled by a Gaussian density function with full covariancematrix in each state of hdj . We denote the collection of dancefigure models as Hd, i.e., Hd = {hdj |j = 1, . . . , N}.

B. Character Animation

The synthesized choreography r, (i.e., {rt}t=Tt=1 ), specifiesthe label sequence of dance figures to be performed witheach measure segment whose duration is known beforehand inthe proposed framework. The body posture parameters corre-sponding to each dance figure in the synthesized choreography{rt}t=Tt=1 are then generated such that they fit to the statisticaldance figure models Hd.

To generate body posture parameters using the dance figuremodel hdj for the dance figure lj , we first determine the numberof dance motion frames L required for the given segmentduration. Next, we distribute the required number of motionframes L among the states of the dance figure model hdjaccording to the expected state occupancy duration:

oji =1

1− ajii, 1 ≤ i ≤ P (20)

where oji is the expected duration in state qji , ajii is the self-state-transition probability for state qji (assuming ajii = 0), andP is the number of states in hdj .

In order to avoid generation of noisy parameters, we firstincrease the time resolution of the dance motion by oversam-pling the dance motion model. That is, we generate parametersfor a multiple of L, say KL, where K is an integer scale factor.Then, we generate the body motion parameters along the statesof hdj according to the distribution of KL motion frames tothese states, using the corresponding Gaussian distribution ateach state. To reverse the effect of oversampling, we performa downsampling by K that eventually yields smoother statetransitions, and hence, more realistic parameter generation thatavoid motion jerkiness.

The dance figure models hdj are trained over the first andsecond differences of the Euler angles of the joints, which aredefined in Section II-B. Therefore, to obtain the final set ofbody posture parameters for a dance figure lj , we simply needto sum the generated first differences with the mean trajectoryassociated with lj , i.e., µj . Each plot in Fig. 4 depicts asynthesized trajectory against two sample trajectories for oneof the motion features from a dance figure in the databasealong with the mean trajectory associated with the samemotion feature from the same dance figure. The length of thehorizontal solid lines represent the expected state durations (interms of the number of frames), and the corresponding angularvalues represent the means of the Gaussian distributions,associated with the particular motion feature in each state ofthe HMM structure trained for the particular dance figure. Inthese plots, the two sample trajectories exemplify the temporalvariations between different realizations of the same dancefigure. Deriving from the trained dance figure HMMs, thesynthesized dance figure trajectories mimic the underlying

5 10 15 20

−1

−0.5

0

0.5

1

Dance Figure: l2 −−− Feature: Spine2.Zrotation

Frames

Ang

le (

°)

Samp1 Samp2 Mean Synth StateMeans

5 10 15 20 25 30 35 40−10

−5

0

5

10

15

Dance Figure: l5 −−− Feature: RightToeBase.Xrotation

Frames

Ang

le (

°)


5 10 15 20 25 30−3

−2

−1

0

1

2

Dance Figure: l23

−−− Feature: LeftUpLegRoll.Xrotation

Frames

Ang

le (

°)


10 20 30 40 50 60

−8

−6

−4

−2

0

2

4

6

8

Dance Figure: l26

−−− Feature: RightArm.Zrotation

Frames

Ang

le (

°)


10 20 30 40 50 60 70 80

−8

−6

−4

−2

0

2

4

6

Dance Figure: l27

−−− Feature: RightUpLeg.Zrotation

Frames

Ang

le (

°)


10 20 30 40 50 60−6

−4

−2

0

2

4

Dance Figure: l28

−−− Feature: LeftUpLeg.Zrotation

Frames

Ang

le (

°)


Fig. 4. The plots compare a synthesized trajectory with two sampletrajectories, as well as, with the mean trajectory for different motion featuresfrom different dance figures in the database. The expected state durations,all associated with the HMM structure trained for the same dance figures,are also displayed in the plot with horizontal solid lines. The correspondingangular values of each horizontal solid line are the means of the Gaussiandistributions, associated with the particular motion feature in each state of theHMM structure trained for the given dance figure.

temporal dynamics of a given dance figure. Hence, modelingthe temporal variations using HMMs for each dance figureallows us to synthesize more realistic and personalized dancemotion trajectories.

After repeating the described procedure for each dancefigure in the synthesized choreography, the body postureparameters at the dance figure boundaries are smoothed viacubic interpolation within a ∆-neighborhood of each dancefigure boundary in order to generate smoother figure-to-figuretransitions.

We note that the use of HMMs for dance figure synthesisprovides us with the ability of introducing random variationsin the synthesized body motion patterns for each dance figure.These variations make the synthesis results look more naturaldue to the fact that humans perform slightly varying dancefigures at different times for the same dance performance.

V. EXPERIMENTS AND RESULTS

We investigate the effectiveness of our choreography anal-ysis and synthesis framework using the Turkish folk dance,


Dance Figures, lj

Mus

ical

Pie

ces,

Sk

Distribution of Dance Figures to Musical Pieces

1 2 3 4 5 6 7 8 9 10 11 1213 14 15 16 17 18 1920 21 22 23 24 2526 27 28 29 30 31

123456789

101112131415161718192021

60

50

40

30

20

10

0

Fig. 5. Distribution of dance figures to musical pieces is visualized using animage plot. Columns of the plot represent the dance figures (lj ) whereas therows represent the musical pieces (Sk) in the database. Note that l and S aredropped in the figure for clarity of the representation. Consequently, each cellin the plot indicates how many times a particular dance figure is performedwith the corresponding musical piece. Cells with different gray values in acolumn imply that the same dance figure can be accompanied with differentmusical pieces. Similarly, cells with different gray values in a row imply thatdifferent dance figures can be performed with the same musical piece.

Kasik1. The Kasik database consists of 20 dance performanceswith 20 different musical pieces with a total duration of 36minutes. There are 31 different dance figures (i.e., N = 31)and a total of 1258 musical measure segments (i.e., T = 1258).Fig. 5 shows the distribution of dance figures to differentmusical pieces where each column represents a dance figurelabel lj and each row represents a musical piece Sk. Hence,entries with different colors in a column indicate that thesame figure can be performed with different melodic patternswhereas entries with different colors in a row indicate that dif-ferent dance figures can be performed with the same melodicpattern. Therefore, Fig. 5 can be seen as a means to provideevidence for our basic assumption that there is a many-to-manyrelationship between dance figures and musical measures.

We follow a 5-fold cross-validation procedure in the exper-imental evaluations. We train musical measure models withfour fifth of the musical audio data in the analysis part and usethese musical measure models in the process of choreographyestimation for the remaining one fifth of the musical audiodata in the synthesis part. We repeat this procedure five times,each time using different parts of the musical audio data fortraining and testing. This way, we synthesize a new dancechoreography for the entire musical audio data.

A. Objective Evaluation Results

We define the following four assessment levels to evaluateeach dance figure label rt in the synthesized figure sequencer, compared to the respective figure label rt in the originaldance choreography r, assigned by the expert:

• L0 (Exact-match): rt is marked as L0 if rt matches rt.

1Kasik means spoon in English. The dance is named so, since the dancersclap spoons while dancing.

1 2 3 4 5 6 7 8 9 10 11 1213 14 15 16 17 18 1920 21 22 23 24 2526 27 28 29 30 31

123456789

10111213141516171819202122232425262728293031

L0

L1

L2

L3

Fig. 6. The matrix plot demonstrates the assessment levels associated withany pair of dance figures in the database. Assessment levels are indicatedwith different colors. The pairs of dance figures that fall into assessment levelL1 in a row correspond to the group of exchangeable dance figures for thatparticular dance figure. For instance, by looking at the first row, one can saythat the dance figure l1 is exchangeable with the dance figures l2, l5, l7, l14and l15. In other words, {l2, l5, l7, l14, l15} is the group of exchangeablefigures for l1, i.e., G1.

• L1 (X-match): rt is marked as L1 if rt does not matchrt, but it is in rt’s exchangeable figure group Grt ; i.e.,rt ∈ Grt .

• L2 (Song-match): rt is marked as L2 if rt neither matchesrt nor is in Grt ; but, rt and rt are performed within thesame musical piece; i.e., rt, rt ∈ Sk.

• L3 (No-match): rt is marked as L3 if it is not marked asone of L0 through L2.

Fig. 6 displays all assessment levels associated with anypossible pairing of dance figures in a single matrix. Note thatthe matrix in Fig. 6 defines a distance metric on dance figurepairs by mapping the four assessment levels L0 through L3into penalty scores from 0 to 3, respectively. Hence in thisdistance metric, low penalty scores indicate desirable chore-ography synthesis results. Average penalty scores are reportedto measure the “goodness” (coherence) of the resulting dancechoreography.

Recall that we propose three alternative choreography syn-thesis scenarios together with the two other reference choreog-raphy synthesis techniques, as discussed in Section III-B. Theaverage penalty scores of these five choreography synthesisscenarios are given in Table I. Furthermore, the distributionof the number of figures that fall into each assessment levelfor all synthesis scenarios are given in Fig. 7. The averagepenalty scores for the reference acoustic-only and figure-only choreography synthesis scenarios are 0.82 and 2.07,respectively. The high average penalty score of the figure-only choreography is mainly due to the randomness in thissynthesis technique. Hence, the unimodal learning phase of thefigure-only choreography synthesis, which takes into account


TABLE ITHE AVERAGE PENALTY SCORES (APS) OF VARIOUS CHOREOGRAPHY

SYNTHESIS SCENARIOS

Synthesis Scenario APSSingle Best Path 0.56

Likely Path 0.91Exchangeable Path 0.63

Acoustic-Only 0.82Figure-Only 2.07

L0 L1 L2 L30

10

20

30

40

50

60

Assessment Levels

Per

cent

age

of D

ance

Fig

ures

BEST PATH LIKELY PATH EXCHANGEABLE PATH ACOUSTIC−ONLY FIGURE−ONLY

Fig. 7. The percentage of figures that fall into each assessment level for theproposed five different synthesis scenarios.

only the figure-to-figure transition probabilities, does not carrysufficient information to automatically generate dance chore-ographies which are similar to the training Kasik database. Onthe other hand, the acoustic-only choreography, which employsthe musical measure models Hm for synthesis, attains a loweraverage penalty score. We also note that the average dancefigure recognition rate using the musical measure models Hm

through five-fold cross validation is obtained as 49.05%. Thisindicates that the musical measure models learn the correlationbetween measures and figures. However, the acoustic-onlychoreography synthesis that depends only on the correlationof measures and figures fails to sustain motion continuity atfigure transition boundaries. This creates visually unacceptablecharacter animation for the acoustic-only choreography.

The proposed single best path synthesis scenario refines theacoustic-only choreography with the inclusion of the figuretransition model F , which defines a bigram model for figure-to-figure transitions. The average penalty score of the singlebest path synthesis scenario is 0.56, which is the smallestaverage penalty score among all scenarios. This is expected,since the single best path synthesis generates the optimalViterbi path along the multimodal lattice structure M. Thelikely path synthesis introduces variation into the single bestpath synthesis. The average penalty score of the likely pathsynthesis increases to 0.91. This increase is an indicationof the variation introduced to the optimal Viterbi path. Theexchangeable path synthesis refines the likely path synthesisby introducing more acceptable random variations to the singlebest path synthesis. Recall that random variations of the ex-changeable path synthesis depend on the exchangeable figuresmodel X . The average penalty score of the exchangeable path

synthesis is 0.63, which is an improvement compared to thelikely path synthesis.

In Fig. 7, we observe that among all the assessmentlevels, the levels L0 and L1 are indicators of the diversityof alternative dance figure choreographies, rather than beingan error indicator, whereas the assessment levels L2 andL3 indicate an error in the dance choreography synthesisprocess. In this context, we observe that only 31% of thefigure-only choreography and only 83% of the acoustic-onlychoreography fall into the first three assessment levels. Onthe other hand, using the mapping obtained by our frameworkincreases this ratio to 94% and 92% for the single best pathand the exchangeable path synthesis scenarios, respectively.The percentage drops to 80% for the likely path synthesisscenario, yet it is still a high percentage of the entire dancesequence.

B. Subjective Evaluation Results

We performed a subjective A/B comparison test usingthe music-driven dance animations to measure the opinionsof the audience on the coherence of the synthesized dancechoreographies with the accompanying music. During the test,the subjects were asked to indicate their preference for eachgiven A/B test pair of synthesized dance animation segmentson a scale of (-2; -1; 0; 1; 2), where the scale correspondsto strongly prefer A, prefer A, no preference, prefer B, andstrongly prefer B, respectively. We compared dance animationsegments from five different choreographies; namely, original,single best path, likely path, exchangeable path and figure-onlychoreographies. We, therefore, had ten possible pairings ofdifferent dance choreographies, e.g., original vs. single bestpath, or likely path vs. figure-only, etc. For each possiblepair of choreographies, we used short audio segments fromthree different musical pieces from the Kasik database tosynthesize three A/B pairs of dance animation video clips forthe respective dance choreographies. This yielded us a totalof thirty A/B pairs of dance animation segments. We alsoincluded one A/B pair of dance animation segments for pairingeach choreography with itself, i.e., original vs. original, etc.,in order to test if the subjects were careful enough and showalmost no preference over five such possible self-pairings ofthe aforementioned choreographies. As a result, we extracted35 short segments from the audiovisual database, where eachsegment was approximately 15 seconds. We picked at mosttwo non-overlapping segments from each musical piece inorder to make a full coverage of the audiovisual database inthe subjective A/B comparison test.

The subjective tests are performed over 18 subjects. Theaverage preference scores for all comparison sets are presentedin Table II. Note that the rows and the columns of Table IIrespectively correspond to A and B of the A/B pairs. Also,the average preference scores that tend to favor B are givenin bold to ease the visual inspection. The first observationis that the animations for the original choreography and forthe choreographies resulting from the proposed three synthesisscenarios (i.e., single best path, likely path and exchangeablepath choreographies) are preferred over the animations for the


TABLE IITHE SUBJECTIVE A/B PAIR COMPARISON TEST RESULTS

BO SBP LP XP FO

Original (O) 0.1 -0.6 0.1 0.7 -0.2Single Best Path (SBP) 0.2 0.2 0.7 -0.2

A Likely Path (LP) 0.2 -0.3 -0.7Exchangeable Path (XP) 0.1 -0.7

Figure-Only (FO) 0.0

figure-only choreography. We also note that the likely pathand exchangeable path choreography animations are stronglypreferred against the figure-only choreography animation. Thisobservation is an evidence of the fact that audience is generallyappealed by variations in the dance choreography as long asthe overall choreography is coherent with the accompanyingmusic. Hence we observe the likely path and the exchangeablepath synthesis as the most preferable scenarios in subjectivetests, and they manage to create alternative choreographies thatare coherent and appealing to the audience.

C. Discussion

The objective evaluations carried out using the assessmentlevels in Section V-A indicate that the most successful chore-ography synthesis scenarios are the single best path and theexchangeable path scenarios. We note that the exchangeablepath synthesis as well as the likely path can be seen asvariations or refinements of the single best path synthesisapproach. On the other hand, according to the subjectiveevaluations presented in Section V-B, the most preferablescenarios are the likely path and the exchangeable path. Hencethe exchangeable path synthesis, being among the top twoaccording to both objective and subjective evaluations, can beregarded as our best synthesis approach. Note also that theexchangeable path synthesis includes all the statistical modelsdefined in Section III.

The other two scenarios (acoustic-only and figure-only) aremainly used to demonstrate the effectiveness of the musicalmeasure models and the figure transition model, hence theyboth serve as reference synthesis methods. The acoustic-only choreography however depends only on the correlationsbetween measures and figures, and therefore fails to sustainmotion continuity at figure-to-figure boundaries, which isindispensable to create realistic animations. Hence we havechosen the figure-only synthesis as the baseline method tocompare with our best result, i.e., with the exchangeable pathsynthesis, and prepared a demo video (submitted as supple-mental material) that compares side by side the animationsresulting from these two synthesis scenarios for two musicalpieces in the Kasik database.

We have also prepared another video which demonstratesthe likely path, the exchangeable path and the single bestpath synthesis results with respect to sample original danceperformances. The demo video starts with a long excerpt fromthe original and the synthesized choreographies driven by amusical piece that is available in the Kasik database. The longexcerpt is followed by several short excerpts from the originaland the synthesized choreographies driven by the musical

pieces that are available in the Kasik database. The demois concluded with two long excerpts from the synthesizedchoreographies driven by two musical pieces that are notavailable in the Kasik database. Both demo videos are alsoavailable online [41].

VI. CONCLUSIONS

We have described a novel framework for music-drivendance choreography synthesis and animation. For this purpose,we construct a many-to-many statistical mapping from musicalmeasures to dance figures based on the correlations betweendance figures and musical measures as well as the correla-tions between successive dance figures in terms of figure-to-figure transition probabilities. We then use this mapping tosynthesize a music-driven sequence of dance figure labels viaa constraint based dynamic programming procedure. With thehelp of exchangeable figures notion, the proposed frameworkis able to yield a variety of different dance figure sequences.These output sequences of dance figures can be considered asalternative dance choreographies that are in synchrony with thedriving music signal. The subjective evaluation tests indicatethat the resulting music-driven dance choreographies are plau-sible and compelling to the audience. To further evaluate thesynthesis results, we have also devised an objective assessmentscheme that measures the “goodness” of a synthesized dancechoreography with respect to the original choreography.

Although we have demonstrated our framework on a folkdance database, the proposed music-driven dance animationmethod can also be applied to other dance genres such asballroom dances, latin dances and hiphop, as long as thedance performance is musical measure-based (i.e., the metricorders in the course of music and dance structure coincide[1]), and the dance database contains sufficient amount of datato train the statistical models employed in our framework.One possible source of problem in our current frameworkmight be due to the dance genres which do not have a well-defined set of dance figures. Hiphop, which is in fact a highlymeasure-based dance genre, is a good example of this. Insuch cases, figure annotation may become a very tedious taskdue to possibly very large variations in the way a particularmovement (supposedly a dance figure) is performed. Moreimportantly, if the dance genre does not have a well-definedset of dance figure, the 3D motion capture data needed totrain the dance figure models (described in Section IV-A) mustbe carefully prepared, since otherwise joining up individualfigures smoothly during character animation can be verydifficult and the continuum of the synthesized dance motionmay not be guaranteed.

The performance of the proposed framework strongly de-pends on the quality, the complexity and the size of theaudiovisual dance database. For instance, higher-order n-grammodels can be integrated in the presence of sufficient trainingdata, to better exploit the intrinsic dependencies of the dancefigures. Currently we employ only bigrams (with n = 2) tomodel intrinsic dependencies of the dance figures. This choiceof model is mainly due to the scale of the choreographydatabase that we use in training. We also note that, in our


experiments, the bigram statistics proved to be sufficient forthe current state of the framework and for the particularTurkish folk dance genre, i.e., Kasik.

Our music-driven dance animation scheme currently sup-ports only the single dancer scenario; but the framework canbe extended to handle multiple dancers as well by learning thecorrelations between the movements of the dancers throughthe use of additional statistical models, that would howeverincrease the complexity of the overall learning process. Havingmultiple dancers will also increase the complexity of theanimation since then the spatial positioning of the dancersrelative to each other will also have to be taken into account.

The proposed framework currently requires expert inputand musical transcription prior to the audiovisual featureextraction and modeling tasks. This tedious pre-processingcan be eliminated by introducing automatic measure/dancefigure segmentation capability into the framework. However,such automatic segmentation techniques are not yet currentlyavailable in the literature, and they seem to remain as openresearch areas in the near future. In this study, HMM structuresare used to model the dance motion trajectories, since theycan represent variations among different realizations of dancefigures in personalized dance performances. However, one canconsider other methods such as style machines that will alsorepresent stylistic variations associated with dance figures.

We define a dance figure as the dance motion trajectorycorresponding to a single measure segment. The choice ofmeasures as elementary music primitives simplifies the taskof statistical modeling and allows us to use the powerfulHMM framework for music-to-dance mapping. However thischoice can also be seen as a limiting assumption that ignoreshigher levels of semantics and correlations which might existin a musical piece such as chorus and verses. This currentlimitation of our framework could be addressed by using, forexample, hierarchical statistical modeling tools and/or higherorder n-grams (with n > 2). Yet, modeling higher levels ofsemantics remains as an open challenge for further research.

The proposed framework can trigger interdisciplinary stud-ies with collaboration of dance artists, choreographers andcomputer scientists. Certainly, it has the potential of creatingappealing applications, such as fast evaluation of dance chore-ographies, dance tutoring, entertainment, and more importantlydigital preservation of folk dance heritage by safeguardingirreplaceable information that tend to perish. As a final remark,we think that the proposed framework can be modified to beused for other multimodal applications such as speech-drivenfacial expression or body gesture synthesis and animation.

ACKNOWLEDGMENT

This work has been supported by TUBITAK under projectEEEAG-106E201 and COST2102 action.

REFERENCES

[1] W. C. Reynolds, “Foundations for the analysis of the structure andform of folk dance: A syllabus,” Yearbook of the InternationalFolk Music Council, vol. 6, pp. 115–135, 1974. [Online]. Available:http://www.jstor.org/stable/767728

[2] S. Gao and C.-H. Lee, “An adaptive learning approach to music tempoand beat analysis,” Acoustics, Speech, and Signal Processing. Proc. IEEEInt. Conf. on, vol. 4, pp. 237–240, 2004.

[3] D. P. W. Ellis, “Beat tracking by dynamic programming,” Journal ofNew Music Research, vol. 36, no. 1, pp. 51–60, 2007.

[4] M. F. McKinney, D. Moelants, M. E. P. Davies, and A. Klapuri, “Eval-uation of audio beat tracking and music tempo extraction algorithms,”Journal of New Music Research, vol. 36, no. 1, pp. 1–16, 2007.

[5] A. Klapuri, A. Eronen, and J. Astola, “Analysis of the meter ofacoustic musical signals,” Audio, Speech, and Language Processing,IEEE Transactions on, vol. 14, no. 1, pp. 342 – 355, jan. 2006.

[6] M. Gainza, “Automatic musical meter detection,” in Acoustics, Speechand Signal Processing, 2009. ICASSP 2009. IEEE International Con-ference on, 19-24 2009, pp. 329 –332.

[7] T. Fujishima, “Realtime chord recognition of musical sound: A systemusing common lisp music,” in Proceedings of the International ComputerMusic Conference, 1999, pp. 464–467.

[8] K. Lee and M. Slaney, “Automatic chord recognition from audio usinga supervised hmm trained with audio-from-symbolic data,” in AMCMM’06: Proceedings of the 1st ACM workshop on Audio and musiccomputing multimedia. New York, NY, USA: ACM, 2006, pp. 11–20.

[9] D. Ellis and G. Poliner, “Identifying ‘cover songs’ with chroma featuresand dynamic programming beat tracking,” in Acoustics, Speech andSignal Processing, 2007. ICASSP 2007. IEEE International Conferenceon, vol. 4, 15-20 2007, pp. IV–1429 –IV–1432.

[10] S. Kim, P. Georgiou, and S. Narayanan, “A robust harmony structuremodeling scheme for classical music opus identification,” in Acoustics,Speech and Signal Processing, 2009. ICASSP 2009. IEEE InternationalConference on, 19-24 2009, pp. 1961 –1964.

[11] C. Bregler, S. M. Omohundro, M. Covell, M. Slaney, S. Ahmad, D. A.Forsyth, and J. A. Feldman, “Probabilistic models of verbal and bodygestures,” in Computer Vision in Man-Machine Interfaces. CambridgeUniversity Press, 1998, pp. 267–290.

[12] O. Arikan and D. A. Forsyth, “Interactive motion generation fromexamples,” ACM Trans. Graph., vol. 21, no. 3, pp. 483–490, 2002.

[13] L. Kovar, M. Gleicher, and F. Pighin, “Motion graphs,” ACM Trans.Graph., vol. 21, no. 3, pp. 473–482, 2002.

[14] Y. Li, T. Wang, and H.-Y. Shum, “Motion texture: a two-level statisticalmodel for character motion synthesis,” ACM Trans. Graph., vol. 21,no. 3, pp. 465–472, 2002.

[15] M. Brand and A. Hertzmann, “Style machines,” in SIGGRAPH ’00:Proceedings of the 27th annual conference on Computer graphics andinteractive techniques. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 2000, pp. 183–192.

[16] J. Min, H. Liu, and J. Chai, “Synthesis and editing of personalizedstylistic human motion,” in I3D ’10: Proceedings of the 2010 ACMSIGGRAPH symposium on Interactive 3D Graphics and Games. NewYork, NY, USA: ACM, 2010, pp. 39–46.

[17] A. Ruiz and B. Vachon, “Three learning systems in the reconnaissance ofbasic movements in contemporary dance,” World Automation Congress,2002. Proceedings of the 5th Biannual, vol. 13, pp. 189–194, 2002.

[18] C. Bregler, M. Covell, and M. Slaney, “Video rewrite: drivingvisual speech with audio,” in Proceedings of the 24th annualconference on Computer graphics and interactive techniques, ser.SIGGRAPH ’97. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 1997, pp. 353–360. [Online]. Available:http://dx.doi.org/10.1145/258734.258880

[19] T. Chen, “Audiovisual speech processing,” IEEE Signal ProcessingMagazine, vol. 18, no. 1, pp. 9–21, 2001.

[20] M. Brand, “Voice puppetry,” in Proceedings of the 26th annualconference on Computer graphics and interactive techniques, ser.SIGGRAPH ’99. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 1999, pp. 21–28. [Online]. Available:http://dx.doi.org/10.1145/311535.311537

[21] Y. Li and H.-Y. Shum, “Learning dynamic audio-visual mapping withinput-output hidden markov models,” Multimedia, IEEE Transactionson, vol. 8, no. 3, pp. 542 – 549, june 2006.

[22] J. Xue, J. Borgstrom, J. Jiang, L. Bernstein, and A. Alwan,“Acoustically-driven talking face synthesis using dynamic bayesian net-works,” in Multimedia and Expo, 2006 IEEE International Conferenceon, july 2006, pp. 1165 –1168.

[23] M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp, “Analysis ofhead gesture and prosody patterns for prosody-driven head-gestureanimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, pp.1330–1345, August 2008. [Online]. Available: http://dl.acm.org/citation.cfm?id=1383054.1383274


[24] M. Neff, M. Kipp, I. Albrecht, and H.-P. Seidel, “Gesture modeling andanimation based on a probabilistic re-creation of speaker style,” ACMTrans. Graph., vol. 27, pp. 5:1–5:24, March 2008. [Online]. Available:http://doi.acm.org/10.1145/1330511.1330516

[25] S. Levine, P. Krahenbuhl, S. Thrun, and V. Koltun, “Gesturecontrollers,” ACM Trans. Graph., vol. 29, pp. 124:1–124:11, July 2010.[Online]. Available: http://doi.acm.org/10.1145/1778765.1778861

[26] M. Cardle, L. Barthe, S. Brooks, and P. Robinson, “Music-drivenmotion editing:local motion transformations guided by music analysis,”Eurographics UK Conference, Annual, vol. 0, pp. 38–44, 2002.

[27] H. chul Lee and I. kwon Lee, “Automatic synchronization of backgroundmusic and motion in computer animation,” in Computer GraphicsForum, vol. 24, 2005, pp. 353–361.

[28] T.-h. Kim, S. I. Park, and S. Y. Shin, “Rhythmic-motion synthesis basedon motion-beat analysis,” ACM Trans. Graph., vol. 22, no. 3, pp. 392–401, 2003.

[29] G. Alankus, A. A. Bayazit, and O. B. Bayazit, “Automated motionsynthesis for dancing characters,” Comput. Animat. Virtual Worlds,vol. 16, no. 3-4, pp. 259–271, 2005.

[30] T. Shiratori, A. Nakazawa, and K. Ikeuchi, “Dancing-to-music characteranimation,” COMPUTER GRAPHICS FORUM, vol. 25, no. 3, pp. 449–458, 2006.

[31] J. W. Kim, H. Fouad, J. L. Sibert, and J. K. Hahn, “Perceptuallymotivated automatic dance motion generation for music,” Comput.Animat. Virtual Worlds, vol. 20, no. 2&dash;3, pp. 375–384, 2009.

[32] F. Ofli, Y. Demir, E. Erzin, Y. Yemez, and A. M. Tekalp, “Multicameraaudio-visual analysis of dance figures,” Multimedia and Expo, 2007IEEE Int. Conf. on, pp. 1703–1706, 2007.

[33] F. Ofli, Y. Demir, E. Erzin, Y. Yemez, A. M. Tekalp, K. Balci,I. Kiziloglu, L. Akarun, C. Canton-Ferrer, T. J., E. Bozkurt, andA. Erdem, “An audio-driven dancing avatar,” Journal on MultimodalUser Interfaces, vol. 2, no. 2, pp. 93–103, 01 Sep. 2008.

[34] F. Ofli, E. Erzin, Y. Yemez, and A. M. Tekalp, “Multi-modal analysis ofdance performances for music-driven choreography synthesis,” in Acous-tics Speech and Signal Processing (ICASSP), 2010 IEEE InternationalConference on, 14-19 2010, pp. 2466 –2469.

[35] R. Shepard, “Circularity in judgements of relative pitch,” Journal of theAcoustic Society of America, vol. 36, no. 12, 1964.

[36] S. Davis and P. Mermelstein, “Comparison of parametric representationsfor monosyllabic word recognition in continuously spoken sentences,”Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 28,no. 4, pp. 357 – 366, aug 1980.

[37] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimizationfor spoken word recognition,” Acoustics, Speech and Signal Processing,IEEE Transactions on, vol. 26, no. 1, pp. 43 – 49, feb 1978.

[38] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysisand an algorithm,” in Advances in Neural Information ProcessingSystems 14. MIT Press, 2001, pp. 849–856.

[39] J. B. MacQueen, “Some methods for classification and analysis ofmultivariate observations,” in Proceedings of 5th Berkeley Symposiumon Mathematical Statistics and Probability. University of CaliforniaPress, 1967, pp. 281–297.

[40] P. Rousseeuw, “Silhouettes: a graphical aid to the interpretation andvalidation of cluster analysis,” J. Comput. Appl. Math., vol. 20, no. 1,pp. 53–65, 1987.

[41] F. Ofli, E. Erzin, Y. Yemez, and A. Tekalp, “Music-driven dance chore-ography synthesis demo,” Internet: http://mvgl.ku.edu.tr/Learn2Dance,Aug. 2011.

Ferda Ofli (S07-M11) received B.Sc. degrees bothin Electrical and Electronics Engineering and Com-puter Engineering, and the Ph.D. degree in ElectricalEngineering from Koc University, Istanbul, Turkey,in 2005 and 2010, respectively. He is currently apostdoctoral researcher in the Tele-Immersion Groupof the University of California at Berkeley, Berkeley,CA. His research interests span the areas of mul-timedia signal processing, computer vision, patternrecognition and machine learning. He received Grad-uate Studies Excellence award in 2010 for outstand-

ing academic achievement at Koc University.

Engin Erzin (S88-M96-SM06) received his Ph.D.degree, M.Sc. degree, and B.Sc. degree from theBilkent University, Ankara, Turkey, in 1995, 1992and 1990, respectively, all in Electrical Engineering.During 1995-1996, he was a postdoctoral fellowin Signal Compression Laboratory, University ofCalifornia, Santa Barbara. He joined Lucent Tech-nologies in September 1996, and he was with theConsumer Products for one year as a Member ofTechnical Staff of the Global Wireless ProductsGroup. From 1997 to 2001, he was with the Speech

and Audio Technology Group of the Network Wireless Systems. Since January2001, he is with the Koc University, Istanbul, Turkey. His research interestsinclude speech signal processing, audio-visual signal processing, human-computer interaction and pattern recognition. He is serving as Associate Editorof the IEEE Transactions on Audio, Speech and Language Processing (2010-2013).

Yucel Yemez received the B.Sc. degree from MiddleEast Technical University, Ankara, in 1989, and theM.Sc. and Ph.D. degrees from Bogazici Univer-sity, Istanbul, respectively in 1992 and 1997, all inelectrical engineering. From 1997 to 2000, he wasa postdoctoral researcher in the Image and SignalProcessing Department of Telecom Paris (ENST).Currently he is an associate professor of the Com-puter Engineering Department at Koc University,Istanbul. His research is focused on various fieldsof computer vision and graphics.


A. Murat Tekalp (S80-M84-SM91-F03) receivedthe M.Sc. and Ph.D. degrees in electrical, computer,and systems engineering from Rensselaer Polytech-nic Institute (RPI), Troy, NY, in 1982 and 1984,respectively. He has been with Eastman Kodak Com-pany, Rochester, NY, from December 1984 to June1987, and with the University of Rochester fromJuly 1987 to June 2005, where he was promotedto Distinguished University Professor. Since June2001, he has been a Professor at Koc University,Istanbul, Turkey. His research interests are in the

area of digital image and video processing, including video compressionand streaming, motion-compensated video filtering for high-resolution, videosegmentation, content-based video analysis and summarization, 3DTV/videoprocessing and compression, multicamera surveillance video processing, andprotection of digital content. He authored the book Digital Video Processing(Englewood Cliffs, NJ: Prentice-Hall, 1995) and holds seven U.S. patents.His group contributed technology to the ISO/IEC MPEG-4 and MPEG- 7standards. Dr. Tekalp was named Distinguished Lecturer by the IEEE SignalProcessing Society in 1998, and awarded a Fulbright Senior Scholarship in1999. He received the TUBITAK Science Award (highest scientific awardin Turkey) in 2004. He chaired the IEEE Signal Processing Society Techni-cal Committee on Image and Multidimensional Signal Processing (January1996-December 1997). He served as an Associate Editor for the IEEETRANSACTIONS ON SIGNAL PROCESSING (1990 to 1992) and theIEEE TRANSACTIONS ON IMAGE PROCESSING (1994 to 1996), andthe Kluwer journal Multidimensional Systems and Signal Processing (1994to 2002). He was an Area Editor for Graphical Models and Image Processing(1995 to 1998). He was also on the Editorial Board of the Academic Pressjournal Visual Communication and Image Representation (1995 to 2002). Hewas appointed as the Special Sessions Chair for the 1995 IEEE InternationalConference on Image Processing, the Technical Program Co-Chair for IEEEICASSP 2000 in Istanbul, the General Chair of IEEE International Conferenceon Image Processing (ICIP) in Rochester in 2002, and Technical ProgramCo-Chair of EUSIPCO 2005 in Antalya, Turkey. He is the Founder and FirstChairman of the Rochester Chapter of the IEEE Signal Processing Society. Hewas elected as the Chair of the Rochester Section of IEEE for 1994 to 1995. Atpresent, he is the Editor-in-Chief of the EURASIP journal Signal Processing:Image Communication (Elsevier). He is serving as the Chairman of theElectronics and Informatics Group of the Turkish Science and TechnologyFoundation (TUBITAK) and as an independent expert to review projects forthe European Commission.

learn2dance: learning statistical music-to-dance … · discrete hmm and synthesize alternative...

Documents