handsonor: a customizable vision-based control interface for musical...

6
HandSonor: A Customizable Vision-based Control Interface for Musical Expression Srinath Sridhar MPI Informatik and Universit¨ at des Saarlandes Campus E1.4, 66123 Saarbr¨ ucken, Germany [email protected] Figure 1: HandSonor provides users with a customizable vision-based control interface for musical expression. Using multiple cameras the articulated 3D pose of the hand is estimated. Users can map parametrized hand motion to instrument parameters and produce music just by moving their hand. Copyright is held by the author/owner(s). CHI 2013 Extended Abstracts, April 27–May 2, 2013, Paris, France. ACM 978-1-4503-1952-2/13/04. Abstract The availability of electronic audio synthesizers has led to the development of many novel control interfaces for music synthesis. The importance of the human hand as a communication channel makes it a natural candidate for such a control interface. In this paper I present HandSonor, a novel non-contact and fully customizable control interface that uses the motion of the hand for music synthesis. HandSonor uses images from multiple cameras to track the realtime, articulated 3D motion of the hand without using markers or gloves. I frame the problem of transforming hand motion into music as a parameter mapping problem for a range of instruments. I have built a graphical user interface (GUI) to allow users to dynamically select instruments and map the corresponding parameters to the motion of the hand. I present results of hand motion tracking, parameter mapping and realtime audio synthesis which show that users can play music using HandSonor. Author Keywords Musical expression; control interface; hand motion capture; computer vision; music synthesis ACM Classification Keywords H.5.2 [User Interfaces]: Auditory (non-speech) feedback; I.4.8 [Scene Analysis]: Tracking.

Upload: others

Post on 16-Jun-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HandSonor: A Customizable Vision-based Control Interface for Musical Expressionhandtracker.mpi-inf.mpg.de/projects/handsonor/handsonor... · 2013. 2. 17. · continuous sound musical

HandSonor: A CustomizableVision-based Control Interface forMusical Expression

Srinath SridharMPI Informatik andUniversitat des SaarlandesCampus E1.4, 66123Saarbrucken, [email protected]

Figure 1: HandSonor provides users with a customizable vision-based control interfacefor musical expression. Using multiple cameras the articulated 3D pose of the hand isestimated. Users can map parametrized hand motion to instrument parameters andproduce music just by moving their hand.

Copyright is held by the author/owner(s).CHI 2013 Extended Abstracts, April 27–May 2, 2013, Paris,France.ACM 978-1-4503-1952-2/13/04.

AbstractThe availability of electronic audio synthesizers has led tothe development of many novel control interfaces formusic synthesis. The importance of the human hand as acommunication channel makes it a natural candidate forsuch a control interface. In this paper I presentHandSonor, a novel non-contact and fully customizablecontrol interface that uses the motion of the hand formusic synthesis. HandSonor uses images from multiplecameras to track the realtime, articulated 3D motion ofthe hand without using markers or gloves. I frame theproblem of transforming hand motion into music as aparameter mapping problem for a range of instruments. Ihave built a graphical user interface (GUI) to allow usersto dynamically select instruments and map thecorresponding parameters to the motion of the hand. Ipresent results of hand motion tracking, parametermapping and realtime audio synthesis which show thatusers can play music using HandSonor.

Author KeywordsMusical expression; control interface; hand motioncapture; computer vision; music synthesis

ACM Classification KeywordsH.5.2 [User Interfaces]: Auditory (non-speech) feedback;I.4.8 [Scene Analysis]: Tracking.

Page 2: HandSonor: A Customizable Vision-based Control Interface for Musical Expressionhandtracker.mpi-inf.mpg.de/projects/handsonor/handsonor... · 2013. 2. 17. · continuous sound musical

IntroductionRelated Work – Music ControlInterfaces

Theremin: A non-contact,continuous sound musicalinstrument that is controlled byadjusting the distance of the handsfrom two antennae. The theremin issuitable only for generatingcontinuous sounds.

The reacTable [4]: A tangibletabletop musical instrument thattracks marked pucks on a table tosynthesize sounds. This device ismore suited for created electronicmusic.

Gesture-based Control: There aremany systems that use marker-basedoptical tracking of the hands tocontrol music [9, 3]. Markers restrictthe free movement of the hand.

Related Work – Markerless MotionTracking

Body Motion Tracking: There arenumerous techniques for markerlesstracking of the articulated 3Dmotion of humans. Some usemultiple cameras [8] while othersuse just depth sensors [1, 7].

Hand Motion Tracking: There existarticulated hand motion trackingtechniques that use discriminativelylearnt image features [2] or image &depth data together [6]. Specialdepth sensors have also beendeveloped for this purpose (e.g. IntelGesture Camera, Leap Motion).However, none of these methods orsensors provide a kinematic skeletonat interactive rates as HandSonordoes.

Electronic audio synthesizers are useful tools that providemusicians and artists with a way to mimic realinstruments or even create new timbres. Typically,synthesizers are controlled using conventional instrumentinterfaces (e.g. electric guitar). Conventional interfacesenforce a particular way of playing the instrument whichhas led musicians to explore new physical and softwarecontrol interfaces for synthesizers.

In HCI literature, many new control interfaces for musicalexpression including gesture-based control have beenproposed [4, 9] (also see sidebar). Controlling music usingarticulated hand motion (gestures) is intuitive for userssince almost all musical instruments are controlled by thehand. Moreover, the hand is an important communicationchannel with studies showing that the upper bound for theinformation capacity of the hand is 150 bits per second(bps) [5]. However, most current approaches provide rigidcontrol interfaces that cannot be customized by the user.This makes them no different from conventionalinstrument interfaces.

In this paper I introduce HandSonor, a novel, fullycustomizable non-contact control interface for playingmusical instruments using hand motion. HandSonorallows users to completely customize the correspondencebetween their hand motions and music synthesized. Thisprovides users with a unique opportunity to explore newforms of musical expression.

There are numerous challenges that need to be overcomein designing interfaces that use hand motion for musicalcontrol. The first is fast and robust tracking of the handwith minimum latency. Some systems use marker-basedoptical tracking [3] but require the user to wear glovesthat restrict the free motion of the hands. The second

challenge is the design of appropriate mapping schemesbetween hand motion and music. The theremin (seesidebar) maps distances to the pitch and volume of asound wave. The final challenge is the selection ofappropriate instruments to control. HandSonor addressesmany of these challenges. The key contributions andbenefits of my approach are listed below.

• A markerless hand motion tracking system that usesimages from multiple cameras to estimate thearticulated 3D pose of the hand in the form of akinematic skeleton. This system does not restrict themotion of the hand by enforcing users to wear gloves ormakers.

• A parameter mapping system for creatingcorrespondences between parameters of the kinematicskeleton and parameters of common musicalinstruments. Users can customize the control interfaceusing a graphical user interface (GUI) to best suit theirstyle and need.

• A realtime music synthesis system that can model manydifferent kinds of musical instruments. My approachallows users to explore different musical instrumentsand is not restricted to any one type.

In the following sections I describe the variouscomponents of HandSonor in detail. I provide results ofhand tracking, parameter mapping and music synthesis. Iplan to conduct a comprehensive user evaluation ofHandSonor but present results from a pilot study in thispaper. In order to save space, a review of related work isprovided in the sidebar on the left.

System DescriptionFigure 2 gives an overview of the HandSonor pipeline.HandSonor operates in two phases – offline and online.

Page 3: HandSonor: A Customizable Vision-based Control Interface for Musical Expressionhandtracker.mpi-inf.mpg.de/projects/handsonor/handsonor... · 2013. 2. 17. · continuous sound musical

Figure 2: HandSonor Pipeline.

In the offline phase the parameter mapping system is usedto create mapping schemes between the skeleton poseparameters (kinematic chain) and the instrumentparameters. In the online phase, the defined mappingschemes are used along with realtime tracking of thearticulated 3D hand pose and instrument synthesizers togenerate appropriate sounds.

Figure 3: A hand modelconsisting of a kinematic chainand 3D SoG.

Hand Motion Tracking SystemThe hand motion tracking technique I use is based on thefull body motion tracking system described by [8]. This isa model-based approach for tracking the articulatedmotion using multiple, calibrated and synchronizedcameras. The model consists of a kinematic skeleton andthe volume of the body approximated by a sum of 3DGaussians (3D SoG).

In order to use this method for tracking hand motion, Ibuilt a kinematic skeleton of the hand consisting of 30

joints and represented by 35 parameters, Θ = {θi}, where0 ≤ i ≤ 34 (29 joint angles, 3 global rotations and 3global translations). Each joint angle is limited to a fixedrange of angles, θi ∈ [lil , l

ih], taken from anatomical

studies of the hand. The hand model also contains 49 3DGaussians attached to the kinematic skeleton each with afixed mean, variance and colour model. For each differenthand to be tracked, a person-specific hand model needs tobe constructed. One such hand model used in tracking isshown in Figure 3.

The specific details of the tracking procedure are beyondthe scope of this paper but I provide a brief overviewhere [8]. The frames from each camera are converted into2D SoG representation using a quad-tree. The parametersof the hand model are optimized so that a similaritymeasure between the 2D SoG images and the 3D SoG ofthe hand model is maximized. An objective function isdefined that takes into account the similarity measure andthe joint limits. A modified gradient ascent optimizationof the objective function allows fast computation of theparameters.

Musical Instrument Modelling and SynthesisSynthetic models of musical instruments are usuallyrepresented by a number of parameters, ninst, which aredenoted by Ψ = {ψj}, where 0 ≤ j < ninst. Each ψj cantake continuous or discrete values depending on the typeof instrument. For synthesizing the sounds of musicalinstruments I use the Synthesis Toolkit (STK) 1 providessynthesizers for common musical instruments.

Parameter Mapping SystemThe goal of the parameter mapping system is to create amapping scheme between hand model parameters and

1https://ccrma.stanford.edu/software/stk/

Page 4: HandSonor: A Customizable Vision-based Control Interface for Musical Expressionhandtracker.mpi-inf.mpg.de/projects/handsonor/handsonor... · 2013. 2. 17. · continuous sound musical

Figure 4: A Screenshot of the GUI showing different components.

instrument parameters. For the purposes of parametermapping I classify instrument parameters into two types –continuous and discrete. Continuous instrumentparameters take continuous values within a specifiedrange of values. Discrete instrument parameters takevalues from a discrete set and can be boolean. Mostcommon instruments can be represented by eithercontinuous (e.g. violin), discrete (e.g. drum) or bothcontinuous and discrete parameters (e.g. guitar).

Figure 5: Mapping functions –linear (red), logarithmic (green)and logistic (blue).

Mapping Continuous Instrument ParametersContinuous instrument parameters, ψj , take continuous

values over a fixed range, [cjl , cjh]. This allows for a

natural mapping to be defined with the hand skeletonparameters which are continuous too. Formally, themapping can be written as,

ψkj = f(θi) ∀ k, j, (1)

where f(.) denotes the mapping function, k is theinstrument index, ψk

j is the j-th parameter of the k-thinstrument. Figure 5 shows some of the mappingfunctions used in my system. Different mapping functionsare needed because ψk

j for real instruments are distributedunevenly. Some parameters like volume can be linearlycontrolled while others like pitch work well with a sigmoidmapping function. Notes in different octaves can bemapped to a logarithmic function corresponding to thelogarithmic increase in note frequencies.

Mapping Discrete Instrument ParametersOne possible way to represent discrete instrumentparameters, ψj , is to let them take discrete values such asmusical notes. For instance, a piano with nnotes numberof notes can be modelled using just one parameter(ninst = 1) that takes nnotes discrete values. However,the mapping scheme becomes much easier if each notewere to be represented by a boolean parameter. In thiscase a piano would be modelled by ninst = nnotesparameters. The mapping can then be written as,

ψkj = 1T (Θs) ∀ k, j, (2)

where 1T is the indicator function activated when Θs, asubset of Θ, lies within a user defined activation region T .It is important to note here that the mapping is no longerone-to-one but maps one or more skeleton parameters toeach instrument parameter. All examples shown in thispaper use only translational parameters of the skeleton formapping with discrete instrument parameters.

GUI for MappingIn order for users to easily create new mapping schemes Iprovide a GUI. Figure 4 shows a screenshot of it beingused. It consists of a central display that shows a 3Dmodel of the hand skeleton. The right pane on the window

Page 5: HandSonor: A Customizable Vision-based Control Interface for Musical Expressionhandtracker.mpi-inf.mpg.de/projects/handsonor/handsonor... · 2013. 2. 17. · continuous sound musical

lists all the parameters of the skeleton model along with adescriptive name. When a particular parameter is chosenfrom the list the corresponding joint in the skeleton ishighlighted. The left pane lists available instruments andtheir parameters. Depending on the type of instrumentparameter chosen, the bottom panel displays eithermapping functions available to the user (continuous) oran option to specify an activation region (discrete).

Technical Assessment

Figure 6: Three frames from asequence of hand motion beingtracked.

My experimental setup is shown in Figure 1. I used fourcameras for tracking hand motion. Figure 6 shows threeframes from a video sequence of the hand being tracked.The tracker is robust to translations and articulations butlighting conditions and background clutter do affect theperformance of the tracker since it uses colourinformation. In my setup I used a black background toimprove the robustness of tracking results. The handmotion tracking system runs at an interactive rate of 17frames per second (fps) on a desktop computer with 8GBRAM and an Intel Core i3 CPU. The audio synthesizerbased on STK takes a negligible amount of time. Imeasured the latency from the onset of the user’s motionto the corresponding audio response to be between30-60ms. Since the latency in my system is primarily dueto the hand tracking system I intend to use GPUacceleration to reduce the computation time andconsequently the latency. I have setup a dedicatedwebpage2 that contains videos of users playing usingHandSonor.

As an example for users I created a few sample parametermapping schemes as shown in Table 1.

2www.mpi-inf.mpg.de/~ssridhar/handsonor/

Instrument ParameterName

SkeletonParameter

Mapping Func-tion

Theremin Pitch root tx Logarithmic

Volume root tz Linear

Piano Notes 1-8in Octave-0

fore ee tx,fore ee ty andfore ee tz

User define activa-tion region

Drumkit2 Cymbal &Bass

fore ee tx,fore ee ty andfore ee tz

User define activa-tion region

Table 1: Sample mapping schemes. The names of skeletonparameters are abbreviated based on their finger and jointanatomy.

Pilot StudyIn order to understand whether HandSonor meets itsobjective of letting users customize their music creationexperience I intend to conduct a comprehensive userevaluation. Currently, I have conducted a pilot study with4 subjects to understand user perception aboutHandSonor. All subjects were right-handed and three ofthem had previous experience playing at least one musicalinstrument. The study was conducted in two sessions. Auser-specific right hand model was also created for all ofthem in a manual process.

Session 1 (Playing Task): The goal of the first sessionwas to rate whether users could play music usingpre-existing mapping schemes from Table 1. For eachmapping scheme, users were given the task of playing AlleMeine Entchen, a common German children’s song. Atthe end of this session users filled-in a questionnaire withLikert items and open ended questions. I found that usersgenerally preferred to play instruments with discreteparameters than those with continuous parameters.

Session 2 (Exploration Task): The goal of the second

Page 6: HandSonor: A Customizable Vision-based Control Interface for Musical Expressionhandtracker.mpi-inf.mpg.de/projects/handsonor/handsonor... · 2013. 2. 17. · continuous sound musical

session was to evaluate whether users were able to use myparameter mapping system for creating new mappingschemes. I provided users with a fixed amount of time (30mins) to come up with a new mapping scheme and playsome music using this. Similar to the first session, usersfilled-in a questionnaire. All users successfully created newmapping schemes consisting of multiple instruments butonly one was able to play music with it in the given time.Table 2 shows sample multi-instrument combinationmappings that users created using HandSonor.

Conclusion and Future Work

User Mapping

User 1 Flute, thereminand drumkit

User 2 Piano andTheremin

User 3 Flute anddrumkit

User 4 Piano withnew activationregions

Table 2: New instrumentcombinations created by users.The corresponding skeletonparameters are not shown.

In this paper I presented HandSonor, a new customizablenon-contact control interface for musical expression.Results from hand motion tracking show that it meets therequirements for music synthesis in terms of robustnessand latency. The pilot study indicates that users like thegeneral idea of using hand motion for synthesizing music.They also found the parameter mapping system intuitiveand were able to create new mapping schemes.

There still remain challenges that need to be overcome toimprove HandSonor. By combining depth informationwith images, the hand motion tracking system can bemade faster, more robust to lighting conditions andbackground clutter. As a result of the pilot study, I gainedvaluable insights from users on improvements that couldbe made to the parameter mapping system and GUI. Inthe audio synthesis system more instrument models couldbe incorporated. Finally, I would also like to conduct athorough user evaluation of HandSonor.

AcknowledgementsI would like to thank my supervisors Prof. Dr. ChristianTheobalt and Dr. Antti Oulasvirta for their guidance. Iwould also like to thank Anna Feit and Thomas Helten.

References[1] Baak, A., Muller, M., Bharaj, G., Seidel, H.-P., and

Theobalt, C. A data-driven approach for real-time fullbody pose reconstruction from a depth camera. InIEEE ICCV 2011 (Nov. 2011), 1092 –1099.

[2] Ballan, L., Taneja, A., Gall, J., Van Gool, L., andPollefeys, M. Motion capture of hands in action usingdiscriminative salient points. In ECCV 2012,vol. 7577. 2012, 640–653.

[3] Dobrian, C., and Bevilacqua, F. Gestural control ofmusic: using the vicon 8 motion capture system.NIME ’03 (2003), 161–163.

[4] Jorda, S., Geiger, G., Alonso, M., and Kaltenbrunner,M. The reacTable: exploring the synergy between livemusic performance and tabletop tangible interfaces. InProceedings of Tangible and embedded interaction2007, TEI ’07 (2007), 139146.

[5] Mao, Z.-H., Lee, H.-N., Sclabassi, R., and Sun, M.Information capacity of the thumb and the indexfinger in communication. 1535–1545.

[6] Oikonomidis, I., Kyriazis, N., and Argyros, A. Efficientmodel-based 3D tracking of hand articulations usingkinect. British Machine Vision Association (2011),101.1–101.11.

[7] Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T.,Finocchio, M., Moore, R., Kipman, A., and Blake, A.Real-time human pose recognition in parts from singledepth images. In IEEE CVPR 2011 (June 2011),1297–1304.

[8] Stoll, C., Hasler, N., Gall, J., Seidel, H., and Theobalt,C. Fast articulated motion tracking using a sums ofgaussians body model. In IEEE ICCV 2011 (Nov.2011), 951 –958.

[9] Wanderley, M., and Depalle, P. Gestural control ofsound synthesis. 632 – 644.