discovery of syllabic percussion patterns in tabla solo recordings

DISCOVERY OF SYLLABIC PERCUSSION PATTERNS IN TABLA SOLORECORDINGS

Swapnil Gupta⇤ Ajay Srinivasamurthy⇤ Manoj Kumar†

Hema A. Murthy† Xavier Serra⇤

⇤Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain†DONlab, Indian Institute of Technology Madras, Chennai, India

ABSTRACT

We address the unexplored problem of percussion patterndiscovery in Indian art music. Percussion in Indian art mu-sic uses onomatopoeic oralmnemonic syllables for the trans-mission of repertoire and technique. This is utilized for thetask of percussion pattern discovery from audio recordings.From a parallel corpus of audio and expert curated scoresfor 38 tabla solo recordings, we use the scores to build aset of most frequent syllabic patterns of different lengths.From this set, we manually select a subset of musically rep-resentative query patterns. To discover these query patternsin an audio recording, we use syllable-level hiddenMarkovmodels (HMM) to automatically transcribe the recordinginto a syllable sequence, in which we search for the querypattern instances using a Rough Longest Common Subse-quence (RLCS) approach. We show that the use of RLCSmakes the approach robust to errors in automatic transcrip-tion, significantly improving the pattern recall rate and F-measure. We further propose possible enhancements to im-prove the results.

1. INTRODUCTION

In many music cultures, music is sometimes transmittedpartly through speech, using what are variously called vo-cables, oral mnemonics, solfège, etc [12]. In the case ofseveral percussion traditions, the choice of vowels and con-sonants is such that the syllables closely represent the un-derlying acoustic phenomenon they represent. The termacoustic-iconic mnemonic systems coined by Hughes [12]explains this mnemonic based syllable systems where thecore aspect is the similarity of the phonetic features of thesyllables with the acoustic properties of the sounds theyrepresent. A well studied example of such a system is thetabla, where the repertoire and technique is transmittedwiththe help of a system based on onomatopoeic oral sylla-bles [18]. In this paper, we explore the use of themnemonicsyllable system of tabla for the discovery of percussion pat-terns. The use of these mnemonics allows us to work with a

© Swapnil Gupta, Ajay Srinivasamurthy, Manoj Kumar,Hema A. Murthy, Xavier Serra.Licensed under a Creative CommonsAttribution 4.0 International License(CC BY 4.0). Attribution: Swapnil Gupta, Ajay Srinivasamurthy,Manoj Kumar, Hema A. Murthy, Xavier Serra. “Discovery of SyllabicPercussion Patterns in Tabla Solo Recordings”, 16th International Soci-ety for Music Information Retrieval Conference, 2015.

musically relevant representation that truly reflects the un-derlying timbre, articulation and dynamics of the patternsplayed.

Automatic discovery of patterns is a relevant Music In-formation Retrieval (MIR) task. It has applications in en-riched and informed music listening, enhanced apprecia-tion for listeners, in music training, and in aiding musicol-ogists working on such music cultures. We use the ono-matopoeic oral mnemonic syllables to represent, transcribeand search for patterns in audio recordings of tabla solos.We first build a set of query patterns from the corpus ofscores in our dataset. Given an audio recording, we auto-matically transcribe it into a sequence of syllables. We thenpropose a method for searching the query patterns in theautomatically transcribed score using approximate stringsearch. We also propose several extensions to improve thesearch performance. We first provide a brief introductionto tabla.

1.1 Tabla and its solo performances

Tabla is the main rhythm accompanying instrument in Hin-dustani music, the art music tradition from North India. Itconsists of two drums: a left hand bass drum called thebāyān or diggā and a right hand drum called the dāyān thatcan produce a variety of pitched sounds [15]. To showcasethe nuances of the tāl (the rhythmic framework of Hindus-tani music) as well as the skill of the percussionist withthe tabla, Hindustani music performances feature tabla so-los. A tabla solo is intricate and elaborate, with a varietyof pre-composed forms used for developing further elab-orations. There are specific principles that govern theseelaborations [10, p. 42]. Musical forms of tabla such asthe thēkā, kāyadā, palatā, rēlā, pēśkār and gat are a partof the solo performance and have different functional andaesthetic roles in a solo performance.

Playing a tabla is taught and learned through the useof onomatopoeic oral mnemonic syllables called the bōl,which are vocal syllables corresponding to different tim-bres that can be produced on the tabla. However, severalbōls correspond to the same stroke played on the tabla, cre-ating a many bōl to same timbre mapping, which can beexploited to discover acoustically similar patterns. Thoughthe primary function of the bōls is to provide a representa-tion system, a rhythmic vocal recitation of the bōls, whichrequires high skills, is inserted into solo performances formusic appreciation.

Tabla has different stylistic schools called gharānās. The

385

Sym. bōls Sym. bōlsDA D, DA, DAA NA N, NA, TAA, TUKI KA, KAT, KE, KI,

KIIDIN DI, DIN, DING,

KAR, GHENGE GA, GHE, GE, GHI,

GIKDA KDA, KRA, KRI,

KRUTA TA, TI, RA TIT CHAP, TIT

Table 1: The bōls used in tabla, their grouping, and thesymbol we use for the syllable group in this paper. ThesymbolsDHA,DHE,DHET,DHI,DHIN,RE,TE,TII,TIN,TRA have a one to one mapping with a syllable of the samename and hence not shown in the table.

repertoires of major gharānās of tabla differ in aspects suchas the use of specific bōls, the dynamics of strokes, orna-mentation and rhythmical phrases [4, p. 60]. But there arealso many similarities due to the fact that the same formsand standard phrases reappear across these repertoires [10,p. 52]. This enables in creation of a library of standardphrases or patterns across compositions of different gharānās.

1.2 Previous Work

Early research related to tabla focused mainly on stroketranscription, as seen in the work of Gillet [9]. Chordia [6]extended the work adding additional features and classi-fiers, using a larger and more diverse dataset. The use oftabla syllables in a predictive model for tabla stroke se-quencewas also demonstrated recently byChordia et al. [7].Recent work in transcription has been reported for Mridan-gam, the percussion accompaniment used in South IndianCarnatic music, by Kuriakose et al. [13] and Anantapad-manabhan et al. [1]. The transcription task has a definiteanalogy to speech recognition and we can apply severaltools and knowledge from this well explored research areawith many state of the art algorithms and systems [11].

There is significant literature on pattern search and re-trieval from percussion solos. Nakano et al. [16] addressthe problem of drum pattern retrieval using an HMM basedapproach using onomatopoeia as the representation for drumpatterns, retrieving known fixed sequences from a libraryof drum patterns with snare and bass drums. We use asimilar approach, the main difference being that we use amusically well grounded syllabic representation. Recently,Srinivasamurthy et al. [20] demonstrated the use of syllablelevel HMM followed by a string edit distance to transcribeand classify percussion patterns in Beijing Opera. Tsunooet al. [21] also demonstrated a music classification task us-ing K-means clustering of bar-long percussive patterns andbass lines extracted using one-pass dynamic programming.While the last two mentioned approaches aim at classifica-tion of patterns, we address the general task of retrievingpatterns from recordings of full length solo compositions.

Transcription is often inaccurate with many errors, andany pattern search on transcribed data needs to use approxi-mate string search algorithms. There are several attempts todeal with search in symbolic sequences [22]. Well exploredtechniques such as longest common subsequence (LCS) donot consider the local correlation while searching for a sub-

sequence [14]. To overcome this limitation, Lin et al. [14]proposed a novel Rough Longest Common Subsequence(RLCS) method for music matching. Dutta et al. [8] useda modified version of RLCS for motif spotting in ālāpanasof Carnatic music. We propose to use a similar approachwith minor modifications to suit the symbolic domain spe-cific to our use case. To the best of our knowledge, this isthe first work to explore syllabic pattern discovery as ap-plied to tabla solos in Hindustani music.

2. PROBLEM FORMULATION

We formulate the problem of discovery of percussion pat-terns in tabla solo recordings. We present a general frame-work for the task, while outlining some of the challenges.The approach we explore in this paper is to use syllablesto define, transcribe, and eventually search for percussionpatterns. We build a fixed set of syllabic query patterns.Given an audio recording, we obtain a time-aligned syl-labic transcription using syllable level timbral models. Foreach of the query patterns in the set, we then perform an ap-proximate search on the output transcription to obtain thelocations of the patterns in the audio recording. We de-scribe each of the steps in detail.

We first compile a comprehensive set of syllables intabla. Although, the syllables vary marginally within andacross gharānās, several bōls can represent the same strokeon the tabla. To address this issue, we grouped the full setof 41 syllables into timbrally similar groups resulting intoa reduced set of 18 syllable groups as shown in Table 1.Though each syllable on its own has a functional role, thistimbral grouping is presumed to be sufficient for discov-ery of percussion patterns. For the remainder of the paper,we limit ourselves to the reduced set of syllable groups anduse them to represent patterns. For convenience, when it isclear from the context, we call the syllable groups as justsyllables and denote them by the symbols in Table 1. Fur-ther, we use bōls and syllables interchangeably. Let the setof syllables be denoted as S = {S1, S2, · · · SM}, M =18.

A percussion pattern is not well defined and varied def-initions can exist. Here, we use a simplistic definition of apattern, as a sequence of syllables. A pattern is defined asPk = [s1, s2, · · · , sLk

] where sk 2 S and Lk is the lengthof Pk. Though, for defining patterns, it is important to con-sider the relative and absolute durations of the constituentsyllables, as well as the metrical position of the pattern inthe tāl, we use a simple definition and leave amore compre-hensive definition for future work. In this paper, we takea data driven approach to build a set of K query patterns,P = {P1, P2, · · · PK}.

Given an audio recording x[n], it is first transcribed intoa sequence of time-aligned syllables, Tx = [(t1, s1), (t2, s2),· · · , (tLx , sLx)], where ti is the onset time of syllable si.The task of syllabic transcription has a significant analogyto connected word speech recognition using word models.Syllables are analogous to words and a percussion patternto a sentence - a sequence of words. Finally, given a querypatternPk of lengthLk, we search for the pattern in the out-put syllabic transcription Tx, to retrieve the subsequencesp(n)

k in Tx (n = 1, · · · , Nk) that match the query, where

386 Proceedings of the 16th ISMIR Conference, Malaga, Spain, October 26-30, 2015

Featureextraction

Audio recordings

TranscriptionsEmbedded model

re-estimation

Viterbidecoding

TestingTraining

Featureextraction

RLCS

Test Audio

0.153 0.484 DHA0.453 0.598 KA0.598 0.738 TA.... .... ...

36.865 50.125 KI

0.186 0.523 TIN0.453 0.598 RA0.598 0.738 TA.... .... ...

42.965 50.125 KA

0.188 0.387 KI0.453 0.598 TA0.598 0.738 TA.... .... ...

36.865 50.125 KI

0.058 0.239 DHA0.239 0.465 GE0.465 0.538 TA.... .... ...

49.911 50.125 NA

Isolatedstroke models

Pattern library,

KI TA TA KATA TA KI TA... ... ...DHE RE DHE RE

Training Data

Transcription

Figure 1: The block diagram of the approach

Nk is the number of retrieved matches for Pk. We use p(n)k

and the corresponding onset times from Tx to extract au-dio segments corresponding to the retrieved syllabic pat-terns. Syllabic transcription is often not exact and it canhave common transcription errors such as insertions, sub-stitutions and deletions, to handle which we need an ap-proximate search algorithm.

3. DATASETTo evaluate our approach to percussion pattern discovery,we need a parallel corpus with time-aligned scores and au-dio recordings. These are useful both for building isolatedstroke timbre models and for a comprehensive evaluationof the approach. We built a dataset comprising audio reco-rdings, scores and time aligned syllabic transcriptions of 38tabla solo compositions of different forms in tīntāl (a met-rical cycle of 16 time units). The compositions were ob-tained from the instructional video DVD Shades Of Tablaby Pandit Arvind Mulgaonkar 1 . Out of the 120 compo-sitions in the DVD, we chose 38 representative composi-tions spanning all the gharānās of tabla (Ajrada, Benaras,Dilli, Lucknow, Punjab, Farukhabad). The booklet accom-panying the DVD provides a syllabic transcription for eachcomposition. We used Tesseract [19], an open source Opti-cal Character Recognizer (OCR) engine to convert printedscores to a machine readable format. The scores obtainedfrom OCRwere manually verified and corrected for errors,adding the the vibhāgs (sections) of the tāl to the syllabictranscription. The score for each composition has addi-tional metadata describing the gharānā, composer and itsmusical form.

We extracted audio from the DVD video and segmentedthe audio for each composition from the full audio record-ing. The audio recordings are stereo, sampled at 44.1 kHzand have a soft harmonium accompaniment. A time alignedsyllabic transcription for each score and audio file pair wasobtained using a spectral flux based onset detector [3] fol-

1

ID Pattern L Count1 DHE, RE, DHE, RE, KI, TA, TA,KI, NA, TA, TA, KI, TA, TA, KI,NA

16 47

2 TA, TA, KI, TA, TA, KI, TA, TA,KI, TA, TA, KI, TA, TA, KI, TA

16 10

3 TA, KI, TA, TA, KI, TA, TA, KI 8 614 TA, TA, KI, TA, TA, KI 6 2145 TA, TA, KI, TA 4 3796 KI, TA, TA, KI 4 4507 TA, TA, KI, NA 4 1678 DHA, GE, TA, TA 4 97

Table 2: Query Patterns, their ID (k), length (L) and thenumber of instances in the dataset (Total instances: 1425)

lowed by manual correction by the authors. The datasetcontains about 17 minutes of audio with over 8200 sylla-bles. The dataset is freely available for research purposesthrough a central online repository 2 .

4. APPROACH

The block diagram in Figure 1 shows us the overall ap-proach. It comprises three major steps: building a set ofquery patterns, transcription, and search. In the followingsections, we describe each of these in detail.

4.1 Building a set of query patterns

A data driven approach is taken to create a set of query pat-terns of length L = 4, 6, 8, 16. These lengths were chosenbased on the structure of tīntāl for different layas (tempoclasses) [4, p. 126]. Using the simple definition of a patternas a sequence of syllables, we use the scores of the compo-sitions to generate all the L length patterns that occur in thescore collection. We sort them by their frequency of occur-rence to get an ordered set of patterns for each stated length.We thenmanually choosemusically representative patternsfrom this ordered set of most commonly occurring patternsto form a set of query patterns. Table 2 shows the chosenpatterns, their length and their count in the dataset, leadingto a total of 1425 instances. We want a diverse collectionof patterns to test if the algorithms generalize. Hence wechoose patterns that have a varied set of syllables that havedifferent timbral characteristics, like syllables that are har-monic (DHA), syllables played with a flam (DHE,RE) andsyllables having bass (GE).

4.2 Transcription

Some bōls of tabla may be pronounced with a differentvowel or consonant depending on the context, without al-tering the drum stroke [5]. Furthermore, the bōls and thestrokes vary across different gharānās, making the task oftranscription of tabla solos challenging. To model the tim-bral dynamics of syllables, we build an HMM for each syl-lable (analogous to a word-HMM). We use these HMMsalong with a language model to transcribe an input audiosolo recording into a sequence of syllables.

2

Proceedings of the 16th ISMIR Conference, Malaga, Spain, October 26-30, 2015 387

The stereo audio is converted to mono, since there isno additional information in stereo channels. We use theMFCC features to model the timbre of the syllables. Tocapture the temporal dynamics of syllables, we add the ve-locity and the acceleration coefficients of the MFCC. The13 dimensional MFCC features (including the 0th coeffi-cient) are computed from the audio with a frame size of23.2 ms and a shift of 5.8 ms. We also explore the use ofenergy (as measured by the 0th MFCC coefficient) in tran-scription performance. Hence we have two sets of features,MFCC_0_D_A, the 39 dimensional feature including the0th, delta and double-delta coefficients, and MFCC_D_A,the 36 dimensional vector without the 0th coefficient.

Using the features extracted from training audio record-ings, we model each syllable Su using a 7-state left-to-rightHMM {�u}, 1 u U(= 18), including an entry andan exit non-emitting states. The emission density of eachemitting state is modeled with a three component GaussianMixture Model (GMM) to capture the timbral variability insyllables. We experimented with higher number of compo-nents in the GMMs, but with little performance improve-ment. We use the time aligned syllabic transcriptions andthe audio recordings in the parallel corpus to do an isolatedHMM training for each syllable. We then use these HMMsfurther in an embedded model Baum-Welch re-estimationto get the final syllable HMMs.

Tabla solos are built hierarchically using short phrases,and hence some bōls tend to follow a bōl more often thanothers. In such a scenario, a language model can improvetranscription. In addition to a flat language model withuniform unigram and transition probabilities, i.e. p(s1 =Su) = 1/U and p(si+1 = Sv/si = Su) = 1/U , with1 u, v U and i being the sequence index, we explorethe use of a bigram language model learned from data.

For testing, we treat the feature sequence extracted fromtest audio file to have been generated from a first ordertime-homogeneous discrete Markov chain, which can con-sist of any finite length sequence of syllables. From theextracted feature sequence, we use the HMMs {�u} and asyllable network constructed from the language model todo a Viterbi (forced) alignment, which aims to provide thebest sequence of syllables and their onsets Tx. All the tran-scription experiments were done using the HMM Toolkit(HTK) [23].

4.3 Pattern Search

The automatically transcribed output syllable sequence Tx

is used to search for the query patterns. Transcription isoften inaccurate in both the sequence of syllables and in theexact onset times of the transcribed syllables. We need tohandle both these errors in a pattern search task from audio.We primarily focus on the errors in syllabic transcriptionin this paper. We use the syllable boundaries output by theViterbi algorithm, without any additional post processing.We can improve the output syllable boundaries using anonset detector [3], but we leave this task to future work.

There are three main kinds of errors in the automaticallytranscribed syllable sequence: Insertions (I), Deletions (D),and Substitutions (B). Further, the query pattern is to besearched in the whole transcribed composition, where sev-

eral instances of the query can occur. Rough Longest Com-mon Subsequence (RLCS) method is a suitable choice forsuch a case. RLCS is a subsequence search method thatsearches for roughlymatched subsequences while retainingthe local similarity [14]. We make further enhancements toRLCS to handle the I, D and B errors in transcription.

We use a modified version of the RLCS approach asproposed by Lin et al. [14] with changes proposed by Duttaet al. [8] to handle substitution errors. We propose a fur-ther enhancement to handle insertions and deletions, andexplore its use in the current task. We first present a gen-eral form of RLCS and then discuss different variants ofthe algorithm.

Given a query pattern Pk of length Lk and a referencesequence (transcribed syllable sequence) Tx of length Lx,RLCS uses a dynamic programming approach to computea score matrix (of size Lx ⇥ Lk) between the referenceand the query with a rough length of match. We can usea threshold on the score matrix to obtain the instances ofthe query occurring in the reference. We can then use thesyllable boundaries in the output transcription and retrievethe audio segment corresponding to the match.

For the ease of notation, we index the transcribed sylla-ble sequence Tx with i and the query syllable sequence Pk

with j. We compute the rough and actual length of the sub-sequence matches similar to the way computed by Dutta etal. [8]. At every position (i, j), a syllable is included intothe matched subsequence if d(si, sj) < �, where d(si, sj)is the timbral distance between the syllables at positions iand j in the transcription and query, respectively. � is thethreshold distance below which the two syllables are saidto be equivalent. The matrices of rough length of match(C) and the actual length of match (Ca) are updated as,

C(i, j) = C(i � 1, j � 1) + (1 � d(si, sj)). d (1)Ca(i, j) = Ca(i � 1, j � 1) + d (2)

where, d is an indicator function that takes a value of 1 ifd(si, sj) < �, else 0. The matrixC thus contains the lengthof rough matches ending at all combinations of the sylla-ble positions in reference and the query. The rough lengthand an appropriate distance measure handles the substitu-tion errors during transcription. To penalize insertion anddeletion errors, we compute a “density” of match using twomeasures called the Width Across Reference (WAR) andWidth Across Query (WAQ), respectively. The WAR (R)and WAQ (Q) matrices are initialized to Ri,j = Qi,j = 0when i.j = 0, and propagated as,

Ri,j =

8

>

<

>

:

Ri�1,j�1 + 1 d(si, sj) < �

Ri�1,j + 1 d(si, sj) � �, Ci�1,j � Ci,j�1

Ri,j�1 d(si, sj) � �, Ci�1,j < Ci,j�1

(3)

Qi,j =

8

>

<

>

:

Qi�1,j�1 + 1 d(si, sj) < �

Qi�1,j d(si, sj) � �, Ci�1,j � Ci,j�1

Qi,j�1 + 1 d(si, sj) � �, Ci�1,j < Ci,j�1

(4)Here, Ri,j is the length of substring containing the subse-quencematch ending at the ith and the j th position of the ref-erence and the query, respectively. Qi,j represents a simi-


lar measure in the query. When incremented,Ri,j andQi,j

are incremented by 1 similar to the way formulated by Linet al. [14]. At the same time, the increment is done basedon the conditions formulated by Dutta et al. [8].

Using the rough length of match (C), actual length ofmatch (Ca), and width measures (R and Q), we compute ascore matrix� that incorporates penalties for substitutions,insertions, deletions, and additionally, the fraction of thequery matched.

�i,j

=

� �� ·f

�Ci,j

Ri,j

�+(1 � �)·f

�Ci,j

Qi,j

��·C(i,j)

Lkif Ca

(i,j)

Lk� �

0 otherwise(5)

where �i,j is the score for the match ending at the ith andthe jth position of the reference and the query, respectively.f is a warping function for the rough match length densi-ties Ci,j

Ri,jin the reference and Ci,j

Qi,jin the query. The pa-

rameter � controls their weights in the convex combina-tion for score computation. The term Ca

i,j

Lkis the fraction of

the query length matched and is used for thresholding theminimum fraction of the query to be matched.

Startingwith all combinations of i and j as the end pointsof the match in the reference and the query, respectively,we perform a traceback to get the starting points of thematch. RLCS algorithm outputs a match when the scoreis more than a score threshold . However, with a simplescore thresholding, we get multiple overlapping matches,from which we select the match with the highest score. Ifthe scores of multiple overlapping matches are equal, weselect the ones that have the lowest width (WAR). Thisway, we obtain a match that has the highest score density.We use these non-overlappingmatches and the correspond-ing syllable boundaries to retrieve the audio patterns.

4.3.1 Variants of RLCS

The generalized RLCS provides a framework for subse-quence search. The parameters ⇢, �, and � can be tunedto make the algorithm more sensitive to different kinds oftranscription errors. The variants we consider here use dif-ferent distance measures d(si, sj) in Eqn (1) to handle sub-stitutions and different functions f(.) in Eqn (5) to handleinsertions and deletions. We explore these variants for thecurrent task and evaluate their performance.

In a default RLCS configuration (RLCS0), we only con-sider exact syllable matches. We set � = 1 and use a binarydistance metric based on the syllable label, i.e. d(si, sj) =0 if si = sj , and 1 otherwise. Further, an identity warpingfunction, f(y) = y is used.

The rough lengthmatch densities can be transformed us-ing a non-linear warping function to penalize low densityvalues more than the higher ones, leading to another variantof RLCS (RLCS). In this paper, we only explore warpingfunctions of the form,

f(y) =ey � 1

e � 1(6)

where > 0 is a parameter to control warping, larger val-ues of lead to more deviation from an identity transfor-mation. RLCS0 is a limiting case of RLCSwhen ! 0.

We hypothesize that the substitution errors in transcrip-tion are due to the confusion between timbrally similar syl-lables. A timbral similarity (distance) measure between thesyllables can thus be used to make an RLCS algorithm ro-bust to specific kinds of substitution errors. In essence, wewant to disregard and give a greater allowance for substi-tutions between timbrally similar syllables during RLCSmatching. Computing timbral similarity is a wide area ofresearch and has many different proposed methods [17],but we restrict ourselves to a basic timbral distance mea-sure: the Mahalanobis distance between the cluster cen-ters obtained using a K-means clustering of MFCC fea-tures (with 3 clusters) from isolated audio examples of eachsyllable [2]. We call this variant of RLCS as RLCS� andexperiment with different thresholds �. For better repro-ducibility of the work in this paper, an implementation ofthe different variants of RLCS described is available 3 .

5. EXPERIMENTS AND RESULTS

We experiment with different sets of features and languagemodels for transcription. With the best performing tran-scription configuration, we experiment with different RLCSvariants and report their performance. We first describe theevaluation measures used in this paper.5.1 Evaluation measures

We use the ground truth time aligned syllabic transcrip-tions to evaluate both the transcription and pattern searchalgorithms. We evaluate transcription performance usingthe measures often used in speech recognition, Correct-ness (Corr.) and Accuracy (Accu.). Given the ground truthtranscription T ⇤

x of length N , the transcribed sequence Tx,and the number of insertions, deletions and substitutionsas NI , ND, and NB , respectively, we compute Corr. =(N−ND−NB)/N andAccu.=(N−ND−NB−NI)/N . TheCorrectness measure penalizes deletions and substitutions,while Accuracy measure additionally penalizes insertions.

For pattern retrieval, we don’t evaluate the accuracy ofboundary segmentation. However, we call a retrieved pat-tern from RLCS as correctly retrieved if it has at least a70% overlap with the pattern instance in ground truth. Toevaluate pattern search performance, we use the standardinformation retrieval measures precision (the ratio betweenthe number of correctly retrieved patterns and all retrievedpatterns) and recall (the ratio between number of correctlyretrieved patterns and the patterns in the ground truth). Theharmonicmean of precision and recall, called the F-measureis also reported.5.2 Results and Discussion

The transcription results shown in Table 3 are the meanvalues in a leave-one-out cross validation over the dataset.We experimented with the two different MFCC features(MFCC_D_A andMFCC_0_D_A) and two languagemod-els (a flat model and a bigram learnt from data). Overall,we see a best Accuracy of 53.13%, which justifies the useof a robust approximate string search algorithm for patternretrieval. The use of a bigram languagemodel learned fromdata improves the transcription performance. We see that

3


Feature Corr. Accu.Flat languagemodel

MFCC_D_A 64.07 45.01MFCC_0_D_A 64.26 49.27

Bigram languagemodel

MFCC_D_A 65.53 49.97MFCC_0_D_A 66.23 53.13

Table 3: Transcription results showing the Correctness(Corr.) and Accuracy (Accu.) measures (in percentage) fordifferent features and language models. In each column,the values in bold are statistically equivalent to the best re-sult (in a paired-sample t-test at 5% significance levels).

the Accuracy measure is lower than the Correctness mea-sure, which shows that there are a significant number ofinsertion errors in transcription. We use the output tran-scriptions from the best performing combination (MFCC_-0_D_A and a bigram language model) to report the perfor-mance of the RLCS variants.

To form a baseline for string search performance withthe output transcriptions, we used an exact string searchalgorithm and report its performance in Table 4 (shown asBaseline). We see that the baseline has a precision that issimilar to transcription performance, but a very poor recallleading to a poor F-measure.

To establish the optimum parameter settings for RLCS,we performed a grid search over the values of �, ⇢ and with RLCS0 . � and are varied in the range 0 to 1. Toensure that the minimum length of the pattern matched isat least 2, we varied ⇢ between 1.1/min(Lk) and 1.� is the convex sum parameter for the contribution of the

rough match length density of the reference and the querytowards the final score. With increasing �, we give moreweight to the reference length ratio, allowing more inser-tions. We observed a poor true positive rate with larger �,and hence we validate the observation that insertion errorscontribute to a majority of transcription errors.

The best average F-measure over all the query patternsin an experiment using RLCS0 is reported in Table 4. Wesee that RLCS0 improves the recall, but with a lower pre-cision and an improved F-measure, showing that the flex-ibility in approximate matching provided by RLCS comesat the cost of additional false positives. The values of ⇢,� and that give the best F-measure are then fixed for allsubsequent experiments to compare the performance of theproposed RLCS variants.

It is observed that the patterns composed of smaller repet-itive patterns (and hence having ambiguous boundaries) re-sult in a poor precision (e.g. P2 and P3 in Table 2 with aprecision of 0.108 and 0.239, respectively). P1 in Table 2,on the contrary, has non-ambiguous boundaries leading toa good precision of 0.692. The effect of the length of apattern on precision is also evident. Small patterns (withL = 4) that have non-ambiguous boundaries (e.g. P8 inTable 2 with a precision of 0.384) have a poor precision ascompared to longer patterns with non-ambiguous bound-aries (e.g. P1 in Table 2). The reason for this is that thesmaller patterns are more prone to errors as the search al-gorithm has to match a lower number of syllables.

The results with other variants of RLCS are also reportedin Table 4. The results from RLCS� show that the use of

Variant Parameter Precision Recall F-measureBaseline - 0.479 0.254 0.332RLCS0 � = 1 0.384 0.395 0.389RLCS� � = 0.3 0.139 0.466 0.214RLCS� � = 0.6 0.0837 0.558 0.145RLCS = 1 0.412 0.350 0.378RLCS = 4 0.473 0.268 0.342RLCS = 7 0.482 0.259 0.336RLCS = 9 0.481 0.258 0.335

Table 4: Performance of different RLCS variants using thebest performing parameter settings for RLCS0 (⇢ = 0.875,� = 0.76 and = 0.6).

a timbral syllable distance measure with higher threshold� further improves the recall, but with a much lower preci-sion and F-measure. Although we find matches that havesubstitution errors using the distance measure, we retrieveadditional matches that do not have substitution errors con-tributing to additional false positives. On the contrary, us-ing a non-linear warping function f(.) in RLCS improvesthe precision with a higher value of . The penalties onmatches with higher number of insertions and deletions ishigh and they are left out, leading to good precision at thecost of recall. We observe that both the above mentionedvariants improve either precision or recall at the cost of theother measure. They need further exploration with bettertimbral similarity measures to be combined in an effectiveway to improve the search performance.

6. SUMMARY

We addressed the unexplored problem of a discovering syl-labic percussion patterns in Tabla solo recordings. The pre-sented formulation used a parallel corpus of audio record-ings and syllabic scores to create a set of query patterns, thatwere searched in an automatically transcribed (into sylla-bles) piece of audio. We used a simplistic definition of apattern and exploredRLCS based subsequence search algo-rithm, using an HMMbased automatic transcription. Com-pared to a baseline, we showed that the use of approximatestring search algorithms improved the recall at the cost ofprecision. Additionally, proposed variants improved eitherthe precision or recall, but do not provide a significant im-provement in the F-measure over the basic RLCS.

For future work, we aim to improve syllable boundariesoutput by transcription using onset detection. Inclusion ofthe rhythmic information can be an interesting aspect indefining and discovering percussion patterns, and will helpin comprehensively evaluating the task of pattern discov-ery. The next steps would be to incorporate better timbralsimilarity measures and inclusion of segment boundariesinto the RLCS algorithm that effectively combines the pro-posed variants.Acknowledgments

This work is partly supported by the European ResearchCouncil under the European Union’s Seventh FrameworkProgram, as a part of the CompMusic project (ERC grantagreement 267583). The authors thank Pandit ArvindMul-gaonkar for sharing the DVD of Tabla solo recordings.


7. REFERENCES

[1] A. Anantapadmanabhan, A. Bellur, and H. A. Murthy.Modal analysis and transcription of strokes of the mri-dangam using non-negative matrix factorization. InProc. of the 38th IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP),pages 181–185, Vancouver, Canada, May 2013.

[2] J. J. Aucouturier and F. Pachet. Music similarity mea-sures: What’s the use? In Proc. of 3rd InternationalConference on Music Information Retrieval, pages157–163, Paris, France, 2002.

[3] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury,M. Davies, andMark B. Sandler. A tutorial on onset de-tection in music signals. IEEE Transactions on Speechand Audio Processing, 13(5):1035–1047, Sept 2005.

[4] S. Beronja. The Art of the Indian tabla. Rupa and Co.New Delhi, 2008.

[5] A. Chandola. Music as Speech: An Ethnomusicolin-guistic Study of India. Navrang, 1988.

[6] P. Chordia. Segmentation and recognition of tablastrokes. In Proc. of the 6th International Conference onMusic Information Retrieval (ISMIR), pages 107–114,London, UK, September 2005.

[7] P. Chordia, A. Sastry, and S. Şentürk. Predictivetabla modelling using variable-length markov and hid-den markov models. Journal of New Music Research,40(2):105–118, 2011.

[8] S. Dutta and H. A. Murthy. A modified rough longestcommon subsequence algorithm for motif spotting inan alapana of carnatic music. In Proc. of the 20th Na-tional Conference on Communications (NCC), pages1–6, Kanpur, India, February 2014.

[9] O. Gillet and G. Richard. Automatic labelling of tablasignals. In Proc. of the 4th International Conference onMusic Information Retrieval (ISMIR), Baltimore, USA,October 2003.

[10] R. S. Gottlieb. Solo Tabla Drumming of North In-dia: Its Repertoire, Styles, and Performance Practices.Motilal Banarsidass Publishers, 1993.

[11] X. Huang and L. Deng. An overview of modernspeech recognition. In N. Indurkhya and F. J. Damerau,editors, Handbook of Natural Language Processing,Chapman & Hall/CRC Machine Learning & PatternRecognition, pages 339–366. Chapman and Hall/CRC,2nd edition, February 2010.

[12] D. Hughes. No nonsense: the logic and power ofacoustic-iconic mnemonic systems. British Journal ofEthnomusicology, 9(2):93–120, 2000.

[13] J. Kuriakose, J. C. Kumar, P. Sarala, H. A. Murthy, andU. K. Sivaraman. Akshara transcription of mrudangamstrokes in carnatic music. In Proc. of the 21st NationalConference on Communication (NCC), Mumbai, India,February 2015.

[14] H. Lin, H. Wu, and C.Wang. Music matching based onrough longest common subsequence. Journal Informa-tion Science and Engineering, 27(1):95–110, 2011.

[15] M. Miron. Automatic Detection of Hindustani Talas.Master’s thesis, Music Technology Group, UniversitatPompeu Fabra, 2011.

[16] T. Nakano, J. Ogata, M. Goto, and Y. Hiraga. A drumpattern retrieval method by voice percussion. In Proc.of the 5th International Conference on Music Informa-tion Retrieval (ISMIR), pages 550–553, October 2004.

[17] F. Pachet and J. J. Aucouturier. Improving timbre simi-larity: How high is the sky. Journal of negative resultsin speech and audio sciences, 1(1):1–13, 2004.

[18] A. D. Patel and J. R. Iversen. Acoustic and perceptualcomparison of speech and drum sounds in the north in-dian tabla tradition: An empirical study of sound sym-bolism. In Proc. of the 15th International Congress ofPhonetic Sciences (ICPhS), pages 925–928, Barcelona,Spain, 2003.

[19] R. Smith. An Overview of the Tesseract OCR Engine.In Proc. of the Ninth International Conference on Doc-ument Analysis and Recognition (ICDAR), volume 2,pages 629–633, Washington, DC, USA, 2007.

[20] A. Srinivasamurthy, R. Caro, H. Sundar, and X. Serra.Transcription and recognition of syllable based percus-sion patterns: The case of Beijing Opera. In Proc. ofthe 15th International Society for Music InformationRetrieval Conference (ISMIR), pages 431–436, Taipei,Taiwan, October 2014.

[21] E. Tsunoo, G. Tzanetakis, N. Ono, and S. Sagayama.Beyond timbral statistics: Improving music classifi-cation using percussive patterns and bass lines. IEEETransactions on Audio, Speech, and Language Pro-cessing, 19(4):1003–1014, 2011.

[22] R. Typke, F. Wiering, and R. C Veltkamp. A surveyof music information retrieval systems. In Proc. of the6th International Conference onMusic Information Re-trieval (ISMIR), pages 153–160, London, UK, Septem-ber 2005.

[23] S. J. Young, G. Evermann, M. J. F. Gales, T. Hain,D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey,V. Valtchev, and P. C. Woodland. The HTK Book, ver-sion 3.4. Cambridge University Engineering Depart-ment, Cambridge, UK, 2006.


discovery of syllabic percussion patterns in tabla solo recordings

Documents