anidance : real-time dance motion synthesize to the song · drive the 3d-space character to dance...

AniDance : Real-Time Dance Motion Synthesize to the Song

Taoran Tang∗Tsinghua University

Beijing, [email protected]

Hanyang Mao∗Tsinghua University


Jia Jia†Tsinghua University


ABSTRACTIn this paper, we present a demo named AniDance that can syn-thesize dance motions with melody in real-time. When users singa song or play one in their phone to AniDance, their melody willdrive the 3D-space character to dance to create a lively dance ani-mation. In practice, we conduct a music oriented 3D-space dancemotion dataset by capturing real dance performances, using LSTM-autoencoder to identify the relation between music and dance.Based on these technologies, users can create valid choreographiesthat capable of musical expression, witch can promote their learningability and interest in dance and music.

CCS CONCEPTS•Human-centered computing→ HCI theory, concepts and mod-els; • Computing methodologies → Supervised learning by re-gression;

KEYWORDSMultisensory interaction, Motion Synthesis, LSTM, Autoencoder,Music-dance dataset, 3D motion captureACM Reference format:Taoran Tang, Hanyang Mao, and Jia Jia. 2018. AniDance : Real-Time DanceMotion Synthesize to the Song. In Proceedings of 2018 ACM MultimediaConference, Seoul, Republic of Korea, October 22–26, 2018 (MM ’18), 3 pages.https://doi.org/10.1145/3240508.3241388

1 INTRODUCTIONHuman activities based on multisensory interaction can enhancethe transmission of information and promote the developmentof human brain[5]. Music and dance are closely related, a multi-sensory interaction of dance creation that harmoniously utilizeauditory, motor, and visual senses. For example, synthesizing dancemotions by music will lead to promotion of people’s learning abilityand interest in dance and music. Although many researchers havetried, synthesizing music-oriented dance motions still face many∗These authors contributed equally to this work and should be considered co-firstauthors†Corresponding author

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’18, October 22–26, 2018, Seoul, Republic of Korea© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5665-7/18/10. . . $15.00https://doi.org/10.1145/3240508.3241388

Figure 1: The AniDance system, the user stands in front ofthe screen, when he/she is singing to the microphone, thecharacters on the screen will dance with the voice.

challenges, namely: 1) How to achieve more efficient multisensoryinteraction; 2) How to choose appropriate dance motions and makeartistic enhancements to the choreography according to music; 3)A lack of training data.

In this paper, we will mainly consider howmelody changes affectthe way motions represented. In the real world, when the melodychanges, the dancer does not only tend to change the motion it-self, but also the motions’ speed and intensity. Traditional dancesynthesis algorithms randomly select dance motions from the data-base,therefore, the synthesized choreographies have little linguisticor emotional meanings. We promote a Real-Time Dance motion syn-thesis to the voice of singing based on LSTM-autoencoder namedAniDance, which help users creating lively 3D-space dance anima-tion by their voice of singing, as shown in Figure 1.

The contributions of this paper are summarized as follows: 1)We construct a music oriented 3D-space dance motion dataset us-ing optical motion capture equipment (Vicon), which contains 40complete dance choreographies in four types of dance. As far aswe know, this is the largest music-dance dataset. We are willing tomake our dataset open to facilitate other related research. 2) Wepropose a music-oriented dance synthesis to better understand andto identify inner patterns between acoustic features and motionfeatures, so the emotion of the music will be reflected by the syn-thesized dance. This allows us to learn how people adjust their localjoint posture and motion rhythm to express the changes in musicalemotions and the rules of choosing motions in choreography.

2 DATASETWe asked professional dancers to dance with music, and capturedtheir motions using optical motion capture equipment. Thus weconstruct a music oriented 3D-space dance motion dataset usingoptical motion capture equipment (Vicon), which contains 40 com-plete dance choreographies in four types of dance (Waltz, Tangle,

https://doi.org/10.1145/3240508.3241388

https://doi.org/10.1145/3240508.3241388

MM ’18, October 22–26, 2018, Seoul, Republic of Korea author1, author2, and Jia Jia

Figure 2: The structure of the LSTM-autoencoder networks.

Cha-Cha, and Rumba) totaling 907,200 frames and 94 minutes ingeneral. As far as we know, this is the largest music-dance dataset.We are willing to make our dataset open to facilitate other relatedresearch. 1 The original data contained 41 joints in 3D space, andwe manually picked out 21 joints best represent the dancers’ mo-tions as the motion features of the dancer. We apply smooth andnormalization algorithms onto our dataset to make our data moreregular.

3 TECHNOLOGYThe technical part of our application contains four modules, includ-ing real-time beat detection, acoustic features extraction, motionprediction, and dance exhibition.Real time beats detection. While the dance motions must stepon the beats, to generate the dance motions in time, it is necessaryto predict the exact time of the forthcoming beat. Existing researchprovides offline beats detection [2]. On the basis of previous study,a real-time beats detection tool is established using several self-adaption algorithms.Acoustic features extraction. An audio analysis library, namedlibrosa, is used for music information retrieval as proposed byMcFee [4]. We use the Mel Frequency Cepstrum Coefficient (MFCC),Constant-Q Chromagram, Tempogram, Onset strength provided inlibrosa as the acoustic features of our network.Motion prediction. The network we used for prediction is a com-bination of Autoencoder [1] and LSTM [3]. The structure of ournetwork shows in figure 2. We transform the 21 joints of motionsin our dataset into three dimensional vectors, which are used asmotion features. And train the LSTM-autoencoder networks withacoustic features as input and motion features as output.

In LSTM-autoencoder networks, an autoencoder network re-duces the dimension of acoustic features, and a masking layer wasused to prevent over-fitting, the lower-dimensional acoustic fea-tures are fed into LSTM network, the dance motion can hence besynthesized in the form of positions of the skeleton nodes.

Based on the characteristics of out tasks. We apply several tricksto the network for better performance. Several acoustic featuresrepresent essential temporal information of the songs. We pickthem out and feed them directly into the LSTM layers.

We use quantitative (euclidean distance) and qualitative exper-iments (user study) to evaluate the performance of our network.The network performs best in both quantitative and qualitativemeasurements is chosen to be used in practice.Dance exhibition. Instructed by the predicted motion features, weimplement skeleton animation on carton characters for real-timevirtual dance exhibition.

1https://github.com/Music-to-dance-motion-synthesis/dataset

Figure 3: The dance motions synthesized by our model ac-cording to different changes of music. In this figure, minorchanges in music will only affect the subtle adjustments ofdancemotions, while apparent changes inmusic will lead toentirely different motions.

4 DEMONSTRATIONIn demonstration, we use the professional recording equipment tocapture user’s voice. Microphone, SDP Cara OK effector and closedback headphone creates unprecedented experience for the users,and the dance animation are shown on a large screen. Differentoperating modes are provided in our application. Users are free tochoose the dance type, the dancer’s speed, or even ask the dancersfollow his/her speed of singing.

We tried to verify that AniDance could solve the challenges dis-cussed in Section 1. First, we added a short period of drumbeatinto a melody. It turned out the synthesized dance did change asinvoked by the drumbeats, and when the drumbeats ended, thedance went back to it’s original states. Then we tried to replace themelody with a completely new one, and the entire segment of chore-ography also changed. The results are shown in Figure 3, whichproved that AniDance is effective and efficient in synthesizing validchoreographies which are also capable of musical expression.

5 CONCLUSIONSIn this paper, we present a demo namedAniDance, based on a LSTM-autoencoder network, AniDance can synthesize dance motionswith melody in real-time. We use acoustic features as input toget the output of synthesized dance choreographies with musicwhich has a richer expression and better continuity. Our futurework will focus on three major aspects: 1) Acquiring more datato continuously enhance our dataset. 2) Taking into considerationthe users’ different preferences about the synthesized dances, and3)Leverage our model to build various applications.

More examples of dancemotions synthesized bymusic are shownin the video of the supplementary material, and Dance with Melody:An LSTM-autoencoder Approach on Music-oriented Dance Synthesisthat submitted to the full paper section of ACM Multimedia 2018.

6 ACKNOWLEDGMENTSThis work is supported by National Key Research and Develop-ment Plan (2016YFB1001200),the Innovation Method Fund of China(2016IM010200).

AniDance : Real-Time Dance Motion Synthesize to the Song MM ’18, October 22–26, 2018, Seoul, Republic of Korea

REFERENCES[1] Pierre Baldi. 2012. Autoencoders, unsupervised learning, and deep architectures.

In Proceedings of ICML Workshop on Unsupervised and Transfer Learning. 37–49.[2] Daniel PW Ellis. 2007. Beat tracking by dynamic programming. Journal of New

Music Research 36, 1 (2007), 51–60.[3] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-termmemory. Neural

computation 9, 8 (1997), 1735–1780.[4] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric

Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis inpython. In Proceedings of the 14th python in science conference. 18–25.

[5] F. H. Rauscher, G. L. Shaw, and K. N. Ky. 1995. Listening to Mozart enhancesspatial-temporal reasoning: towards a neurophysiological basis. NeuroscienceLetters 185, 1 (1995), 44–47.

anidance : real-time dance motion synthesize to the song · drive the 3d-space character to dance...

Documents