chapter 5 bol identification of tabla 5.1...

97

CHAPTER 5

BOL IDENTIFICATION OF TABLA

5.1 Introduction

The importance of tabla in Indian Classical music has been lived through

centuries and so is the desire of Indians to learn and master it. Rhythmic structure is an

essential component in representing and understanding tabla. This rhythmic structure is

represented by tempo of a taal. Generally there are three different tempos in which the

various taals can be played on a tabla. These are Vilambit, Madhya and Drut. Taal is

structured into two or more sections, each having the same or different numbers of bols.

The particular arrangement of bols and silence defines the unique character of each taal.

This research work explores the use of signal processing techniques to provide

accurate identification of tempo and transcription of tabla bols. Transcription refers to the

process of converting the audio recording to a symbolic representation similar to a

musical score. Thus students of tabla could use the transcription system to slow down

and learn a fast complicated rhythm recorded by their guru (master). The system can be

used to verify if the bols played are correct and according to the desired tempo. Further

the system can also be used for evaluation of the confidence level of learners and

providing some suggestions for improvisations. Another use of this system may be,

someone needing a particular taal and retrieving all the music having composed of that

taal. Pre-recorded audio databases of taals could also be used in music compositions.

Such databases typically gather a large number of taals referenced by their tempo and

general style. Searching an appropriate taal only based on tempo and style becomes

98

rather tedious. There is, therefore, a need for content-based methods that would allow

searching in these databases more efficiently that is with more natural or specific queries.

An essential aspect of such a searching tool is the necessity to obtain beforehand an

automatic transcription of tabla bols. This work aims to describe the relevance of a new

form of instruction for tabla, essentially by an electronic tabla tutor.

5.2 Description of Tabla Bols

Tabla is the most well-known of all Indian drums. Amir Khusro created it when

he split the ancient Indian drum, the Pakhawaj into two parts. The tabla shown in Fig 5.1

is a pair of drums traditionally used for accompaniment of Indian classical music [21].

The bayan is the metallic bass drum, played by left hand. It is used to produce a loud

resonant sound or a damped non resonant one. The dayan is the wooden treble drum,

played by right hand. A larger variety of sounds is produced on this drum. The dayan has

a black patch in the centre which is made of a mixture of iron, iron oxides, resin, gum etc.

and is stuck firmly on to the membrane. The thickness of this patch decreases radially

outwards. The bayan has a wider membrane, and has a black patch similar to the one in

the right drum, except that it is asymmetrically placed on one side of the membrane. It is

worth noting here that the bayan is not used to produce harmonics but to provide lower

frequencies in the overall sound while the tabla is being played.

99

Fig 5.1 Indian Tabla

The key difference between Indian and Western drum is the absence of central

loading in the case of Western drums. Indian Classical music is such that a recital usually

consists of a single performer, and a couple of accompaniments. In this situation, the

accompaniment (usually a drum), the tabla for instance, must be harmonic, since

otherwise, the aharmonicity would completely disrupt the recital. Western drums,

however, play a very different role in Western Classical music. They provide a rhythm to

the music produced by the rest of the instruments, and (strictly speaking) do not need to

produce harmonics. The tabla is a percussion instrument widely used in (North) Indian

classical and semi-classical music, and which has gained more and more recognition

among fusion musicians. As for a number of percussive instruments (such as congas,

drum), the characteristics of the audio signal produced is rather impulsive and successive

events are therefore rather easily detected. Such instruments produce only limited sound

mixtures (1 or 2 events at the same time where in polyphonic music over 50 notes can be

played simultaneously). However, the transcription of tabla signals is quite challenging

Bayan Dayan

100

because there are strong time dependencies that are essential to take into account for a

successful transcription. There are two types of time dependencies:

1. Firstly, the instrument can produce long, resonant strokes which will overlap and alter

the timbre of the following strokes.

2. Secondly, the symbol used to represent a stroke can depend on the context in which it

is used.

A mnemonic syllable or bol is associated to each of these strokes. Since the

musical tradition in India is mostly oral, compositions are often transmitted from masters

to students as sung bols. Common bols are : Ghe, Ke (bayan bols), Na, Ta, Kat, Din, TiTa

(dayan bols). Strokes on the bayan and dayan can be combined, to form compound bols:

Dha, Dhin, Tin, and TiRaKiTa. Because of these characteristics, even if two drums are

played simultaneously, the transcription can always be considered as monophonic - a

single symbol is used even if the corresponding stroke is compound. A specification of

this notation system is that two different bols can sound very similar. For example, Ti and

Te nearly sound the same, but are played with different fingers. Some bols exhibit

another characteristic in which different bols are concatenated, e.g. TiRaKiTa. Such

words are used as compositional units and make the recitation and memorization of a

piece easier.

A tabla taal is composed of successive bols of different durations (note, half-note,

quarter note). A taal can have different number of bols (for example 6, 7, 8, 10, 12, 16)

which are grouped in vibhaags. Each vibhaag is separated by a vertical bar, for example

Dha Dhin Dhin Dha | Dha Dhin Dhin Dha|

101

Taal is a framework in time. Taal is structured into two or more sections, each

having the same or different numbers of bols. The particular arrangement of audible

sounds and silence defines the unique character of each Taal. Since drums are used to

maintain the flow of Taal in music, the character of Taal becomes vivid when manifested

on a Tabla. The technical term for this manifestation of Taal on a Tabla is Theka. Ek

Taal, Chau Taal, Deepchandi, Jhap Taal, Rupak Taal, Teen Taal, Dadra Taal, Kherva

Taal, Bhajani Taal, Sul Taal, Dhammar Taal, Jhumra Taal, Khempta Taal, Pashtu

Taal are the different types of Taals in Indian Classical Music.

Here are some Thekas of popular Indian Taals[63].

Dadra: Dadra is in 6 beats divided into 3+3. (Dha Dhin Na, Dha Tin Na).

Teentaal: Teentaal is in 16 beats divided into 4+4+4+4. (Dha Dhin Dhin Dha, Dha Dhin Dhin Dha, Dha Tin Tin Ta, Ta Dhin Dhin Dha).

Ektaal : Ektaal is in 12 beats divided into 2+2+2+2+2+2. (Dhin Dhin, DhaGe TiRaKiTa, Tu Na, Kat Tin, DhaGe TiRaKiTa, Dhi Na).

Tevara : Tevara is in 7 beats divided into 3+2+2. (Dha Din Ta, Tit Kat, Gadi Gana)

Rupak: Rupak is in 7 beats divided into 3+2+2. (Tin Tin Na, Dhi Na, Dhi Na).

Keherva: Keherva is in 4 beats divided into 2+2. (DhaGe NaTi, NaKa DhiNa).

Deepchandi: Deepchandi is in 14 beats divided into 3+4+3+4. (Dha Dhin -, Dha Dha Tin -, Ta Tin -, Dha Dha Dhin -).

5.3Transcription of Tabla Bols

The aim of this system is to transcribe Tabla bols into a higher level of

representation for indexing and retrieval applications. Hence a complete taal can also be

transcribed. The input audio signal is recorded at 16 KHz, 16-bit mono and stored in .wav

102

file format. Initially an autocorrelation technique is used to extract the tempo of recorded

taal. Based on the knowledge of tempo of taal, and using the gradient function an

individual bol is extracted from the recorded data. This individual separate bol is then

used for MFCC feature extraction. The system is initially trained to form a database

which consists of MFCC features of individual bols. Whenever any arbitrary input taal is

to be tested, above process is followed up to extraction of the MFCC features and then

classification techniques such as DTW and VQ are applied to transcribe the bol. The

system architecture is thus based on three major parts:

1. Tempo Extraction

2. Bol Segmentation

3. Features Extraction and Classification.

The overall architecture of the system is depicted in Fig 5.2.

Fig 5.2 System architecture

5.3.1 Tempo Identification

Most automatic beat detection systems provide a running estimate of the main

beat and an estimate of its strength. One of the common automatic beat detector

103

structures consists of filter bank decomposition, followed by an envelope extraction step

and finally a periodicity detection algorithm which is used to detect the lag at which the

signal’s envelope is most similar to itself. Such an algorithm is described in [22].

However extracting beat information and subsequently the tempo information faithfully

using this algorithm requires a recorded signal greater than approximately 30 seconds due

to which the tempo identification is delayed and it hampers the real time processing. So

autocorrelation function (ACF) is used to extract the tempo of recorded taal. This

technique is faster as compared to other tempo finding algorithms.

The tempo finding algorithm is described below. Initially the recorded taal in

.wav format is read. Fig 5.3 shows the recorded signal of Teentaal.

Fig 5.3 Recorded signal of Teentaal

Then average energy based normalization of the signal is done. The average

energy normalized signal is shown in Fig 5. 4.

0 0.5 1 1.5 2 2.5 3 3.5

x 104

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Samples

Am

plitu

de

104

Fig 5.4 Average energy normalized signal

Energy of the signal is calculated frame wise by dividing the signal into frames of

size 32 samples (2 ms). The purpose of keeping the frame size so small is to capture the

bol onsets and offsets faithfully. Then, mean of the energy signal is set as threshold. All

the frames having energy below the threshold are set to zero. This removes the unwanted

oscillations towards the end of a bol and makes the algorithm more effective in finding

the tempo. Fig 5.5 shows the energy of signal after thresholding.

Fig 5.5 Energy of the audio signal after thresholding

0 0.5 1 1.5 2 2.5 3 3.5

x 104

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Am

plitu

de

Samples

0.5 1 1.5 2 2.5 3

x 104

0

0.5

1

1.5

2

2.5

3

Ene

rgy

Samples

105

After the thresholding, autocorrelation of the energy signal computed. The

autocorrelation plot is shown in Fig 5.6.

Fig 5.6 Autocorrelation of Energy signal

However the autocorrelation obtained is not smooth and hence peak finding

algorithm cannot be applied directly. The autocorrelation is first smoothened and then

peak finding algorithm is applied. After applying the smoothening function to

autocorrelation signal in Fig 5.6 the resultant smoothened autocorrelation signal is shown

in Fig 5.7.

Fig 5.7 Smoothened autocorrelation of Energy signal

0 100 200 300 400 500 600 700-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Frames

Am

plitu

de

0 100 200 300 400 500 600 700-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

X: 166

Y: 0.4407

Am

plitu

de

Frames

106

Peak finding algorithm is then applied to the smoothened autocorrelation signal

whose first peak acf�� gives the tempo of recorded taal. The tempo (bols per second)

is formulated as follows

�� =16000

�� ∗ 32 (5.3.1.1)

where, 16000 is sampling frequency in Hz and 32 is the number of samples in a frame.

Tempo Finding Algorithm using ACF

The flowchart below in Fig 5.8 describes the algorithm for finding tempo of a taal

using autocorrelation function.

Fig 5.8 Tempo finding using ACF

107

5.3.2 Bol Segmentation

Due to the impulsiveness of the bols, it seems appropriate to segment the taal

signal into individual events. Each segment then corresponds to a bol on the tabla and

can be labeled accordingly. To segment the bols, an onset detection algorithm based on

gradient function is applied. The tempo of the recorded taal is identified using procedure

in section 5.3.1. Gradient which is the first difference of the energy signal is calculated as

shown in Fig 5.9.

Fig 5.9 Gradient of Energy signal

Peak finding algorithm is then applied to the gradient signal; however multiple

peaks are obtained due to the resonant nature of the signal. Hence only those peaks are

considered which are separated by minimum number of samples equal to 0.4* acf��.

This threshold of 0.4 is chosen so as to obtain peaks of bols comprising of 2 half matras,

whose peaks will be around 0.5* acf��. Finally the peaks obtained represent the onsets

of bols. The intermediate onsets of second half matra of any mixed bol are eliminated

and it is considered as one bol. (e.g. Dhage is treated as one bol). The samples between

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Gra

dien

t of

Ene

rgy

Frames

108

two successive onsets represent a bol. Thus the bols are segmented from the recorded

taal. These onsets of bols of a taal obtained are shown in Fig 5.10.

Fig 5.10 Onsets of bols in a taal

5.3.3 Features Extraction

In order to extract the information for transcription of Tabla bols, MFCC features

are used. The motivation for using MFCC was due to the fact that the auditory response

of the human ear resolves frequencies non-linearly. MFCC has been widely used in

speech processing applications. MFCC has the benefit that, it has the capability of

capturing the phonetically important characteristics of speech. Also band limiting can be

employed easily as discussed in section 3.4.6.

5.4 Bol Classification Techniques

The performance of two classification techniques popularly used in the area of

speaker identification, speech recognition is evaluated for automatic transcription of tabla

bols. These are Dynamic Time Warping and Vector Quantization.

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Ons

ets

Frames

109

5.4.1 Dynamic Time Warping (DTW)

Dynamic time warping is an algorithm for measuring similarity between two

sequences which may vary in time or speed. A well known application has been

automatic speech recognition, to cope with different speaking speeds. In general, DTW is

a method that allows a computer to find an optimal match between two given sequences

(e.g. time series) with certain restrictions. The sequences are "warped" non-linearly in the

time dimension to determine a measure of their similarity independent of certain non-

linear variations in the time dimension. Continuity is less important in DTW than in other

pattern matching algorithms; DTW is an algorithm particularly suited to matching

sequences with missing information, provided there are long enough segments for

matching to occur. Matching of tabla bols is a similar problem and DTW can be well

utilized to classify the bols.

Templates are prepared from the extracted MFCC coefficients and stored in the

bol’s database for each tempo. Once the tempo of the arbitrary input taal is computed

and bols are separated, template matching algorithm using the DTW technique is

applied to classify the bols. As the tempo is known comparison is done with all bol

templates of that tempo only, which reduces the computational time significantly. DTW

technique follows a dynamic programming approach. DTW takes care of the following

things:

1. Different number of frames in input and template. (i.e. slight tempo variations/ onset

variations)

2. Stressing/slowing down on particular bol/matra.

3. Different Tabla players.

4. Deals with surrounding silence regions.

http://en.wikipedia.org/wiki/Speech_recognition

http://en.wikipedia.org/wiki/Time_series

http://en.wikipedia.org/wiki/Pattern_matching

110

Design consideration for DTW

1. The main modes of distortion are stretching and shrinking of bol segments, owing to

different styles of playing tabla. The distortion is not known apriori.

2. Also, since tabla signals are continuous valued instead of discrete, there is really no

notion of insertion.

3. Every input frame must be matched to some template frame.

4. For meaningful comparison of two different path costs, their lengths must be kept the

same.

5. So, every input frame is to be aligned to a template frame.

DTW for Bol Template Matching In Fig 5.11 a typical warp path for a sequence of bols segments is shown. Ideally

the path should start at top left and move diagonally towards bottom right. The deviation

from diagonal indicates dissimilarity. The typical transitions from one frame to other are

shown in Fig 5.12.

Fig 5.11 2-D matrix of template-input frames of bols

111

Fig 5.12 Typical Transitions used in DTW

Fig 5.13 shows how DTW takes care of templates and inputs of various length.

Fig 5.13 Use of Transition Types

Other kinds of transition methods are also possible.

112

1. Skipping more than one template frame (greater shrink rate).

2. Vertical transitions: the same input frame matches more than one template frame.

This is less often used, as it can lead to different path lengths, making their costs

difficult to compare.

DTW Edge/Node Costs

Typically, there are no edge costs; any edge can be taken with no cost.

Local node costs measure the dissimilarity or distance between the respective

input and template frames.

Since the frame content is a multi-dimensional feature-vector, a simple measure is

Euclidean distance; i.e. geometrically how far one point is from the other in the

multi-dimensional vector space.

For two vectors x = (x1, x2, x3…xN), and y = (y1, y2, y3…yN), the Euclidean

distance between them is:

��(�� − ��)�) � = 1,2,… . �. (5.4.1.1)

Thus, if x and y are the same point, the Euclidean distance = 0.

The farther apart x and y are, the greater the distance.

DTW Path Cost Calculation

Let Pi,j = the best path cost from origin to node [i, j] (i.e., where ith template frame

aligns with j th input frame)

Let Ci,j = the local node cost of aligning template frame i to input frame j

(Euclidean distance between the two vectors).

Then, by the Dynamic Programming formulation:

Pi,j= min (Pi,j-1+ Ci,j , Pi-1,j-1+ Ci,j, Pi-2,j-1+ Ci,j)

Pi,j = min (Pi,j-1, Pi-1,j-1, Pi-2,j-1) + Ci,j

Edge costs are 0, otherwise they should be added to the costs.

If the template is m frames long and the input is n frames long, the best alignment

of the two has the cost = Pm,n.

Note: Any path leading to an internal node (i.e. not m,n) is called a partial path.

113

Fig 5.14 shows the DTW applied for comparing a input bol ‘dha’ with a template of bol

‘dha’. The optimal path is almost a straight line and the minimum distance is zero

indicating that input and template are same. Thus the input bol is identified as dha.

(a) Separated bols

(b) Mel Frequency Cepstral Coefficients

(c) DTW matrix and optimal path

Fig 5.14 DTW path of reference Template ‘dha’ and Input Template ‘dha’

114

Fig 5.15 shows the DTW applied for comparing a input bol ‘dhin’ with a template

bol ‘dha’. The optimal path is not a straight line and the minimum distance is not zero but

some finite value indicating that input and template are different.

(a) Separated bols

(b) Mel Frequency Cepstral Coefficients

(c) DTW matrix and optimal path

Fig 5.15 DTW of reference Template ‘dhin’ and Input Template ‘dha’

Likewise the input bol is compared with all the bol templates and the template

with minimum path cost is selected as the best match for the input bol.

115

5.4.2 Vector Quantization

Block Diagram of VQ

Fig 5.16 Vector Quantization Process

Vector quantization [33, 34, 48, and 49] is used to obtain the requisite codebooks that act

as an acoustic model for each bol. Vector quantization (VQ) technique a one of the most

important and widely used method in speech processing, is formulated as follows. It

assumes that x is a k –dimensional vector whose components are real valued random

variables. In VQ, input x is mapped on to another k-dimensional y, thus

� = q(�) (5.4.2.1)

As y takes on one of the finite set of values Y= {yi} (1≤ i ≤ K). The set Y is referred to

as a codebook and {yi} are code vectors for template. The size K of the codebook is

referred to as a number of levels. To design such a code book, the k-dimensional space of

x is partitioned into K regions {Ci} (1≤ i ≤K) with yi being associated with each region

Ci. When x is quantized as y, a quantization distortion measure d(x, y) can be defined

between x and y. The overall average distortion is then represented by the eqn. (5.4.2.2).

D = lim�→∞

1

M� d[x(n),y(n)]

�

��

(5.4.2.2)

Clustering Algorithm (k-means/LBG)

Codebook M=2k vectors

Quantizer

Training Set of Vectors

Input music/speech vectors Codebook

116

Two conditions are necessary for optimality of K-level quantizer. The first is that the

quantizer be realized by using a minimum-distortion or nearest-neighbor selection rule

shown by equation (5.4.2.3).

q(x) = y� iff d(x,y�) ≤ d�x,y��,j≠ i,1 ≤ j≤ K (5.4.2.3)

The second condition is that each code vector yi be chosen to minimize the average

distortion in region Ci. Such a vector is called the centroid of the region, which depends

on the definition of distortion measure. The codebook is prepared by using the MFCC

coefficients as input data. The codebook preparation is done using either by using

Lloyd’s K-means or LBG algorithm.

Certain advantages or the special features for the use of VQ are given as follows:

Reduced storage for spectral analysis information. Potentially VQ is very efficient as

compared to other existing systems.

Computationally less costly as compared to DTW. Lesser computations are required

to find the perfect match. In many of the comparisons VQ gets represented in the

form of lookup tables. This is another important feature of VQ.

Discrete representation of speech signals. By associating Phonetic labels to each of

the clusters each phoneme can be represented individually in the vector space. A

similar technique is used here for transcription of bols.

Features of VQ Training Set

To robustly train the VQ codebook, the training set of vectors is encompassed

with the possibilities listed below.

1. Taals played by various artists are recorded.

117

2. Taals of different Tempos are recorded.

3. Taals are recorded on different tablas.

4. Taals are recorded at different moments of time.

Making these changes at proper intervals will give a very robust set of feature vectors

which form the very core of the VQ algorithm.

Algorithms for Vector Quantization

The main motive of VQ algorithm is to design an optimal minimum distortion

quantizer. For this purpose two important algorithms have been stated. They are

discussed as follows.

1. Lloyd K-means Algorithm.

Step 1: Initialization.

Set m=0, where m is iterative index.

Choose a set of initial code vectors {yi(0)} where 1 ≤ i ≤ K

Step 2: Classification

Classify the set of training vector { x(n) } ( 1≤ n ≤ M)

(M is number of clusters.)

Classify on the basis of Nearest Neighbor (NN)

i.e. x € Ci (m) iff d [x, yi(m)] ≤ d[x,yj(m)] for j ≠ i

Step 3: Code Vector Updating

Update the code vector of every cluster by computing the centroid of

training vectors in each cluster.

Calculate overall distortion and reiterate the process.

118

Step 4: Termination

If D(m)-D(m-1) ≤ Threshold then stop the process.

Fig. 5.17 Partitioned vector space along with the centroids

2. LBG Algorithm

It is known as Linde- Buzo- Gray (LBG) Algorithm.

Start with just one cluster covering the whole region

Find the Centroid of this region.

Distort the Centroid by ±Δ value.

Find the distance of respective clusters from both the centroids + Δ and – Δ.

After iterating again the centroid will split in two and the process will repeat.

Thus there will be 2, 4, 8, 16 etc. clusters.

The process terminates when there will be no effective change in the code

vectors.

119

5.5 Results

5.5.1 Database

A Tabla taal database consists 20 sets of recordings collected from 10 different

tabla players on different tablas. Each set consists of 4 popular taals in three tempos

recorded as clips of 30 seconds each. A bol database was then prepared consisting of all

bols extracted from these recordings. Since these bols were extracted from recordings of

different taals, the database includes almost all the variations which make the training

robust. The results of the classification models using DTW and VQ are shown in the form

of confusion matrix and also plotted in graphs below. The database comprises of in all 11

bols Dha, Dhin, Dhage, Kat, Na/Ta, Tin/Din, Tita, Gadi, Gana, Kata, Tirakita. Conflict

of identification in bols such as Na and Ta, Tin and Din is due to very subtle changes in

methods of playing. These subtle changes are due to slight changes in position of finger,

which is hardly reconciled by the listener. So the ambiguity between Tin/Din and Na/Ta

has been disregarded.

5.5.2 DTW Results:

The training bols were isolated by the above mentioned bol extraction methods

and the bols to be tested are compared with reference bols. It is to be noted that the

classification and identification done by DTW method is tempo-dependent. Hence, the

confusion matrices of the three tempos are given separately.

The expertise of a tabla player here is classified into three levels.

The first level requires the player to play only the bols on tabla (dayan) and dagga

(bayan).

120

The second level evaluates bols of dayan, bayan and dayan+bayan.

The third level takes all the bols in the first two levels and the bols of quarter and

half matra.

While formulating the confusion matrix each value filled in the matrix is termed

as Classification Ratio. Each row in the confusion matrix represents a bol under test and

each column represents a bol from expert database. Multiple samples of the test bol are

tested and classified among the bols from the database.

Classification ratio is found out as follows.

Cii = No. of bols of ith class matched with ith reference

Total no. of bols of ith class tested (5.5.1.1. )

Classification Accuracy is then calculated as

Classi�ication Accuracy= ∑ Cii

Total no. of rows X 100 % (5.5.1.2)

The confusion matrices for the three tempos at three levels are tabulated as follows:

Table 5.1: Tempo 1 Level 1 ta/na tita Din kat ghe Ke

ta/na 1 0 0 0 0 0

Tita 0 1 0 0 0 0

Din .3 0 .7 0 0 0

Kat 0 0 0 1 0 0

Ghe 0 0 0 0 1 0

Ke 0 0 0 0 0 1

Classification Accuracy: 95%

121

Table 5.2: Tempo 1 Level 2 ta/na Tita Din/tin kat Ghe ke dha Dhin

ta/na 1 0 0 0 0 0 0 0

tita 0 1 0 0 0 0 0 0

Din/tin 0 0 1 0 0 0 0 0

Kat 0 0 0 1 0 0 0 0

Ghe 0 0 0 0 1 0 0 0

Ke 0 0 0 0 0 1 0 0

Dha 0.5 0 0 0 0 0 0.5 0

Dhin 0 0 0.1 0 0 0 0 0.9

Classification Accuracy: 92.5%

Table 5.3: Tempo 1 Level 3 dha Dhin Dhage din/tin gadi gana kat kata na/ta Tirakita tita

Dha 0.5 0 0 0 0 0 0 0 0.5 0 0

Dhin 0 0.9 0 0.1 0 0 0 0 0 0 0

Dhage 0 0 1 0 0 0 0 0 0 0 0

din/tin 0 0 0 1 0 0 0 0 0 0 0

Gadi 0 0 0 0 1 0 0 0 0 0 0

Gana 0 0 0 0 0 1 0 0 0 0 0

Kat 0 0 0 0 0 0 1 0 0 0 0

Kata 0 0 0 0 0 0 0 1 0 0 0

na/ta 0 0 0 0 0 0 0 0 1 0 0

Tirakita 0 0 0 0 0 0 0 0 0 1 0

Tita 0 0 0 0 0 0 0 0 0 0 1


122

Table 5.4: Tempo 2 Level 1 ta/na Tita Din kat ghe Ke

ta/na 1 0 0 0 0 0

Tita 0 1 0 0 0 0

Din 0 0 1 0 0 0

Kat 0 0 0 1 0 0

Ghe 0 0 0 0 1 0

Ke 0 0 0 0 0 1


Table 5.5: Tempo 2 Level 2 ta/na Tita Din kat Ghe Ke dha Dhin

ta/na 1 0 0 0 0 0 0 0

Tita 0 1 0 0 0 0 0 0

Din/tin 0 0 1 0 0 0 0 0

kat 0 0 0 1 0 0 0 0

ghe 0 0 0 0 1 0 0 0

Ke 0 0 0 0 0 1 0 0

dha 0 0 0 0 0 0 1 0

dhin 0 0 0 0 0 0 0 1


123

Table 5.6: Tempo 2 Level 3

dha Dhin Dhage din/tin gadi gana kat kata na/ta tirakita tita

Dha 1 0 0 0 0 0 0 0 0 0 0

Dhin 0 1 0 0 0 0 0 0 0 0 0

Dhage 0 0 0.3 0 0.7 0 0 0 0 0 0

din/tin 0 0 0 1 0 0 0 0 0 0 0

Gadi 0 0 0 0 1 0 0 0 0 0 0

Gana 0 0 0 0 0 1 0 0 0 0 0

Kat 0 0 0 0 0 0 1 0 0 0 0

Kata 0 0 0 0 0 0 0 1 0 0 0

na/ta 0 0 0 0 0 0 0 0 1 0 0

tirakita 0 0 0 0 0 0 0 0 0 0.7 0.3

Tita 0 0 0 0 0 0 0 0 0 0 1


Table 5.7: Tempo 3 Level 1

ta/na Tita Din Kat ghe Ke

ta/na 1 0 0 0 0 0

Tita 0 1 0 0 0 0

Din 0.1 0 0.9 0 0 0

Kat 0 0 0 1 0 0

Ghe 0 0 0 0 1 0

Ke 0 0 0 0 0 1


124

Table 5.8: Tempo 3 Level 2 ta/na Tita Din/tin Kat Ghe ke dha Dhin

ta/na 1 0 0 0 0 0 0 0

Tita 0 1 0 0 0 0 0 0

Din/tin 0.15 0 0.85 0 0 0 0 0

Kat 0 0 0 1 0 0 0 0

Ghe 0 0 0 0 1 0 0 0

Ke 0 0 0 0 0 1 0 0

Dha 0.2 0 0 0 0 0 0.7 0

Dhin 0 0 0 0 0 0 0 0.9


Table 5.9: Tempo 3 Level 3 dha Dhin dhage din/tin Gadi gana Kat kata na/ta Tirakita tita

Dha 0.5 0 0.3 0.1 0 0 0 0 0.1 0 0

Dhin 0 0.9 0 0.1 0 0 0 0 0 0 0

Dhage 0 0.1 0.2 0.2 0.4 0 0 0 0.1 0 0

din/tin 0 0 0 0.9 0 0 0 0 0.1 0 0

Gadi 0 0 0 0 1 0 0 0 0 0 0

Gana 0 0 0 0.2 0 0.8 0 0 0 0 0

Kat 0 0 0 0 0 0 1 0 0 0 0

Kata 0 0 0 0 0 0 0 1 0 0 0

na/ta 0 0 0 0 0 0 0 0 1 0 0

Tirakita 0 0 0 0 0 0 0 0 0 1 0

Tita 0 0 0 0 0 0 0 0 0 0 1


125

Table 5.10: Bol Classification Accuracies

A graph shown in Fig 5.18 gives bol classification accuracy for Tempo1 (Slow),

Tempo2 (Medium), and Tempo3 (Fast) using DTW. The classification based on DTW

reconciled bols with slight temporal variations but was not capable of handling large

temporal variations.

Fig 5.18 Bol Classification Accuracy using DTW

Level 1 Level 2 Level 3

Tempo 1 95% 92.5% 94.5%

Tempo 2 100% 100% 90.9%

Tempo 3 98.3% 93.12% 84.5%

126

5.5.3 VQ Results

In case of vector quantization, identification is done in both tempo-dependent and

tempo independent forms .For tempo dependent, codebook formation is done by using

bols of the same tempo. Thus there are different codebooks for a bol in each tempo. The

tempo of the input clip is extracted and comparison is done with templates of that tempo

only. In case of the tempo independent VQ model, all training bols from all three tempos

are used to form codebooks. Thus there is only one codebook for each bol irrespective of

the tempo. Results for level 3 in all 3 tempos are given below in form of confusion

matrix.

Table 5.11: Tempo 1 Dha Dhin dhage din/tin gadi gana kat Kata na/ta Tirakita tita

Dha 1 0 0 0 0 0 0 0 0 0 0

Dhin 0 1 0 0 0 0 0 0 0 0 0

Dhage 0 0 1 0 0 0 0 0 0 0 0

din/tin 0 0 0 1 0 0 0 0 0 0 0

Gadi 0 0 0 0 1 0 0 0 0 0 0

Gana 0 0 0 0 0 1 0 0 0 0 0

Kat 0 0 0 0 0 0 1 0 0 0 0

Kata 0 0 0 0 0 0 0 1 0 0 0

na/ta 0 0 0 0 0 0 0 0 1 0 0

Tirakita 0 0 0 0 0 0 0 0 0 1 0

Tita 0 0 0 0 0 0 0 0 0 0 1

Classification Accuracy 100%

127

Table 5.12: Tempo 2 Dha Dhin dhage din/tin gadi gana kat Kata na/ta Tirakita tita

Dha 1 0 0 0 0 0 0 0 0 0 0

Dhin 0 0.9 0 0 0 0 0.1 0 0 0 0

Dhage 0 0 1 0 0 0 0 0 0 0 0

din/tin 0 0 0 1 0 0 0 0 0 0 0

Gadi 0 0 0 0 1 0 0 0 0 0 0

Gana 0 0 0 0 0 1 0 0 0 0 0

Kat 0 0 0 0 0 0 1 0 0 0 0

Kata 0 0 0 0 0 0 0 1 0 0 0

na/ta 0 0 0 0 0 0 0 0 1 0 0

Tirakita 0 0 0 0 0 0 0 0 0 1 0

Tita 0 0 0 0 0 0 0 0 0 0 1

Classification Accuracy 99%

Table 5.13: Tempo 3

Dha Dhin dhage din/tin gadi gana kat Kata na/ta Tirakita Tita

Dha 0.8 0.1 0 0.1 0 0 0 0 0 0 0

Dhin 0 1 0 0 0 0 0 0 0 0 0

Dhage 0 0 1 0 0 0 0 0 0 0 0

din/tin 0 0 0 .9 0 0 0 0 0.1 0 0

Gadi 0 0 0 0 1 0 0 0 0 0 0

Gana 0 0 0 0 0 1 0 0 0 0 0

Kat 0 0 0 0 0 0 1 0 0 0 0

Kata 0 0 0 0 0 0 0 1 0 0 0

na/ta .1 0 0 0 0 0 0 0 .9 0 0

Tirakita 0 0 0 0 0 0 0 0 0 1 0

Tita 0 0 0 0 0 0 0 0 0 0 1

Classification Accuracy 96.4%

128

Table 5.14: Tempo Independent VQ Dha Dhin Dhage din/tin gadi gana kat kata na/ta tirakita Tita

Dha 0.85 0.04 0 0.04 0 0 0 0 0.07 0 0

Dhin 0 0.89 0 0.04 0 0 0 0 0 0.07 0

Dhage 0 0 1 0 0 0 0 0 0 0 0

din/tin 0 0.025 0 .95 0 0 0 0 0.025 0 0

Gadi 0 0 0 0 1 0 0 0 0 0 0

Gana 0 0 0 0 0 1 0 0 0 0 0

Kat 0 0 0 0 0 0 1 0 0 0 0

Kata 0 0 0 0 0 0 0 1 0 0 0

na/ta 0.07 0 0.04 0 0 0 0 0 0.89 0 0

tirakita 0 0 0 0 0 0 0 0 0 1 0

Tita 0 0 0 0 0 0 0 0 0 0 1

Classification Accuracy 96.2%

VQ provided more robust classification over DTW as evident from the graphs

shown in Fig 5.19 and bol classification accuracies in Table 5.15. Only level 3 bol

classification accuracies are compared in the table 5.15.

129

Fig 5.19 Bol Classification Accuracy using VQ

Table 5.15: Comparison of bol classification accuracies using DTW and VQ Tempo 1 Tempo 2 Tempo 3 Tempo

Independent

DTW 94.5% 90.9% 84.5% NA

VQ 100% 99% 96.4% 96.2%

Moreover, complete temporal independencies were achieved in VQ with accuracy

of about 96.20% which is far better than DTW. The results are state of the art of their

own kind.

As seen from the Table 5.16 below, experimentation is done by changing the

codebook size and the number of MFCC coefficients. As the codebook size increases

from 16 to 32, the bol identification accuracy increases. As the number of MFCC

coefficients reduces, the accuracy decreases. Hence we choose the optimum values for

codebook size as 32 and MFCC coefficients as 10 for accurate classification of bols.

130

Table 5.16: Accuracy Variations according to the number of Clusters and MFCC Number of Clusters Number of MFCC

Coefficients

Bol Identification

Accuracy

16 12

10

8

93.2%

93.2%

82.1%

32 12

10

8

96.4%

96.4%

87.6%

5.6 Summary

A very robust bol extraction algorithm is developed. The proposed algorithm for

finding tempo of tabla taal recordings, based on autocorrelation technique is very fast

and computationally efficient as compared to other techniques such as beat histogram.

The novel gradient method proposed for ‘Bol onset detection’ in a tabla taal recordings is

very robust and capable of giving precise segmentation of bols (Tolerance of 16 samples

at 16 KHz sampling rate).

Dynamic Time Warping (DTW) can reconcile bols with slight temporal

variations. But DTW is not capable of handling large temporal variations. DTW based

method cannot be used for tempo independent bol classification. Vector Quantization

(VQ) provides more robust classification over DTW. Moreover, complete temporal

independencies are achieved in VQ with accuracy of about 96.2%, which is far better

than DTW.

The system of Tabla Bol identification can be deployed successfully as Tabla

Tutor which could be used to gauge the accuracy with which an artist is playing Tabla.

chapter 5 bol identification of tabla 5.1...

Documents