chapter 5 bol identification of tabla 5.1...
TRANSCRIPT
97
CHAPTER 5
BOL IDENTIFICATION OF TABLA
5.1 Introduction
The importance of tabla in Indian Classical music has been lived through
centuries and so is the desire of Indians to learn and master it. Rhythmic structure is an
essential component in representing and understanding tabla. This rhythmic structure is
represented by tempo of a taal. Generally there are three different tempos in which the
various taals can be played on a tabla. These are Vilambit, Madhya and Drut. Taal is
structured into two or more sections, each having the same or different numbers of bols.
The particular arrangement of bols and silence defines the unique character of each taal.
This research work explores the use of signal processing techniques to provide
accurate identification of tempo and transcription of tabla bols. Transcription refers to the
process of converting the audio recording to a symbolic representation similar to a
musical score. Thus students of tabla could use the transcription system to slow down
and learn a fast complicated rhythm recorded by their guru (master). The system can be
used to verify if the bols played are correct and according to the desired tempo. Further
the system can also be used for evaluation of the confidence level of learners and
providing some suggestions for improvisations. Another use of this system may be,
someone needing a particular taal and retrieving all the music having composed of that
taal. Pre-recorded audio databases of taals could also be used in music compositions.
Such databases typically gather a large number of taals referenced by their tempo and
general style. Searching an appropriate taal only based on tempo and style becomes
98
rather tedious. There is, therefore, a need for content-based methods that would allow
searching in these databases more efficiently that is with more natural or specific queries.
An essential aspect of such a searching tool is the necessity to obtain beforehand an
automatic transcription of tabla bols. This work aims to describe the relevance of a new
form of instruction for tabla, essentially by an electronic tabla tutor.
5.2 Description of Tabla Bols
Tabla is the most well-known of all Indian drums. Amir Khusro created it when
he split the ancient Indian drum, the Pakhawaj into two parts. The tabla shown in Fig 5.1
is a pair of drums traditionally used for accompaniment of Indian classical music [21].
The bayan is the metallic bass drum, played by left hand. It is used to produce a loud
resonant sound or a damped non resonant one. The dayan is the wooden treble drum,
played by right hand. A larger variety of sounds is produced on this drum. The dayan has
a black patch in the centre which is made of a mixture of iron, iron oxides, resin, gum etc.
and is stuck firmly on to the membrane. The thickness of this patch decreases radially
outwards. The bayan has a wider membrane, and has a black patch similar to the one in
the right drum, except that it is asymmetrically placed on one side of the membrane. It is
worth noting here that the bayan is not used to produce harmonics but to provide lower
frequencies in the overall sound while the tabla is being played.
99
Fig 5.1 Indian Tabla
The key difference between Indian and Western drum is the absence of central
loading in the case of Western drums. Indian Classical music is such that a recital usually
consists of a single performer, and a couple of accompaniments. In this situation, the
accompaniment (usually a drum), the tabla for instance, must be harmonic, since
otherwise, the aharmonicity would completely disrupt the recital. Western drums,
however, play a very different role in Western Classical music. They provide a rhythm to
the music produced by the rest of the instruments, and (strictly speaking) do not need to
produce harmonics. The tabla is a percussion instrument widely used in (North) Indian
classical and semi-classical music, and which has gained more and more recognition
among fusion musicians. As for a number of percussive instruments (such as congas,
drum), the characteristics of the audio signal produced is rather impulsive and successive
events are therefore rather easily detected. Such instruments produce only limited sound
mixtures (1 or 2 events at the same time where in polyphonic music over 50 notes can be
played simultaneously). However, the transcription of tabla signals is quite challenging
Bayan Dayan
100
because there are strong time dependencies that are essential to take into account for a
successful transcription. There are two types of time dependencies:
1. Firstly, the instrument can produce long, resonant strokes which will overlap and alter
the timbre of the following strokes.
2. Secondly, the symbol used to represent a stroke can depend on the context in which it
is used.
A mnemonic syllable or bol is associated to each of these strokes. Since the
musical tradition in India is mostly oral, compositions are often transmitted from masters
to students as sung bols. Common bols are : Ghe, Ke (bayan bols), Na, Ta, Kat, Din, TiTa
(dayan bols). Strokes on the bayan and dayan can be combined, to form compound bols:
Dha, Dhin, Tin, and TiRaKiTa. Because of these characteristics, even if two drums are
played simultaneously, the transcription can always be considered as monophonic - a
single symbol is used even if the corresponding stroke is compound. A specification of
this notation system is that two different bols can sound very similar. For example, Ti and
Te nearly sound the same, but are played with different fingers. Some bols exhibit
another characteristic in which different bols are concatenated, e.g. TiRaKiTa. Such
words are used as compositional units and make the recitation and memorization of a
piece easier.
A tabla taal is composed of successive bols of different durations (note, half-note,
quarter note). A taal can have different number of bols (for example 6, 7, 8, 10, 12, 16)
which are grouped in vibhaags. Each vibhaag is separated by a vertical bar, for example
Dha Dhin Dhin Dha | Dha Dhin Dhin Dha|
101
Taal is a framework in time. Taal is structured into two or more sections, each
having the same or different numbers of bols. The particular arrangement of audible
sounds and silence defines the unique character of each Taal. Since drums are used to
maintain the flow of Taal in music, the character of Taal becomes vivid when manifested
on a Tabla. The technical term for this manifestation of Taal on a Tabla is Theka. Ek
Taal, Chau Taal, Deepchandi, Jhap Taal, Rupak Taal, Teen Taal, Dadra Taal, Kherva
Taal, Bhajani Taal, Sul Taal, Dhammar Taal, Jhumra Taal, Khempta Taal, Pashtu
Taal are the different types of Taals in Indian Classical Music.
Here are some Thekas of popular Indian Taals[63].
Dadra: Dadra is in 6 beats divided into 3+3. (Dha Dhin Na, Dha Tin Na).
Teentaal: Teentaal is in 16 beats divided into 4+4+4+4. (Dha Dhin Dhin Dha, Dha Dhin Dhin Dha, Dha Tin Tin Ta, Ta Dhin Dhin Dha).
Ektaal : Ektaal is in 12 beats divided into 2+2+2+2+2+2. (Dhin Dhin, DhaGe TiRaKiTa, Tu Na, Kat Tin, DhaGe TiRaKiTa, Dhi Na).
Tevara : Tevara is in 7 beats divided into 3+2+2. (Dha Din Ta, Tit Kat, Gadi Gana)
Rupak: Rupak is in 7 beats divided into 3+2+2. (Tin Tin Na, Dhi Na, Dhi Na).
Keherva: Keherva is in 4 beats divided into 2+2. (DhaGe NaTi, NaKa DhiNa).
Deepchandi: Deepchandi is in 14 beats divided into 3+4+3+4. (Dha Dhin -, Dha Dha Tin -, Ta Tin -, Dha Dha Dhin -).
5.3Transcription of Tabla Bols
The aim of this system is to transcribe Tabla bols into a higher level of
representation for indexing and retrieval applications. Hence a complete taal can also be
transcribed. The input audio signal is recorded at 16 KHz, 16-bit mono and stored in .wav
102
file format. Initially an autocorrelation technique is used to extract the tempo of recorded
taal. Based on the knowledge of tempo of taal, and using the gradient function an
individual bol is extracted from the recorded data. This individual separate bol is then
used for MFCC feature extraction. The system is initially trained to form a database
which consists of MFCC features of individual bols. Whenever any arbitrary input taal is
to be tested, above process is followed up to extraction of the MFCC features and then
classification techniques such as DTW and VQ are applied to transcribe the bol. The
system architecture is thus based on three major parts:
1. Tempo Extraction
2. Bol Segmentation
3. Features Extraction and Classification.
The overall architecture of the system is depicted in Fig 5.2.
Fig 5.2 System architecture
5.3.1 Tempo Identification
Most automatic beat detection systems provide a running estimate of the main
beat and an estimate of its strength. One of the common automatic beat detector
103
structures consists of filter bank decomposition, followed by an envelope extraction step
and finally a periodicity detection algorithm which is used to detect the lag at which the
signal’s envelope is most similar to itself. Such an algorithm is described in [22].
However extracting beat information and subsequently the tempo information faithfully
using this algorithm requires a recorded signal greater than approximately 30 seconds due
to which the tempo identification is delayed and it hampers the real time processing. So
autocorrelation function (ACF) is used to extract the tempo of recorded taal. This
technique is faster as compared to other tempo finding algorithms.
The tempo finding algorithm is described below. Initially the recorded taal in
.wav format is read. Fig 5.3 shows the recorded signal of Teentaal.
Fig 5.3 Recorded signal of Teentaal
Then average energy based normalization of the signal is done. The average
energy normalized signal is shown in Fig 5. 4.
0 0.5 1 1.5 2 2.5 3 3.5
x 104
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Samples
Am
plitu
de
104
Fig 5.4 Average energy normalized signal
Energy of the signal is calculated frame wise by dividing the signal into frames of
size 32 samples (2 ms). The purpose of keeping the frame size so small is to capture the
bol onsets and offsets faithfully. Then, mean of the energy signal is set as threshold. All
the frames having energy below the threshold are set to zero. This removes the unwanted
oscillations towards the end of a bol and makes the algorithm more effective in finding
the tempo. Fig 5.5 shows the energy of signal after thresholding.
Fig 5.5 Energy of the audio signal after thresholding
0 0.5 1 1.5 2 2.5 3 3.5
x 104
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
Am
plitu
de
Samples
0.5 1 1.5 2 2.5 3
x 104
0
0.5
1
1.5
2
2.5
3
Ene
rgy
Samples
105
After the thresholding, autocorrelation of the energy signal computed. The
autocorrelation plot is shown in Fig 5.6.
Fig 5.6 Autocorrelation of Energy signal
However the autocorrelation obtained is not smooth and hence peak finding
algorithm cannot be applied directly. The autocorrelation is first smoothened and then
peak finding algorithm is applied. After applying the smoothening function to
autocorrelation signal in Fig 5.6 the resultant smoothened autocorrelation signal is shown
in Fig 5.7.
Fig 5.7 Smoothened autocorrelation of Energy signal
0 100 200 300 400 500 600 700-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Frames
Am
plitu
de
0 100 200 300 400 500 600 700-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
X: 166
Y: 0.4407
Am
plitu
de
Frames
106
Peak finding algorithm is then applied to the smoothened autocorrelation signal
whose first peak acf���� gives the tempo of recorded taal. The tempo (bols per second)
is formulated as follows
����� =16000
������� ∗ 32 (5.3.1.1)
where, 16000 is sampling frequency in Hz and 32 is the number of samples in a frame.
Tempo Finding Algorithm using ACF
The flowchart below in Fig 5.8 describes the algorithm for finding tempo of a taal
using autocorrelation function.
Fig 5.8 Tempo finding using ACF
107
5.3.2 Bol Segmentation
Due to the impulsiveness of the bols, it seems appropriate to segment the taal
signal into individual events. Each segment then corresponds to a bol on the tabla and
can be labeled accordingly. To segment the bols, an onset detection algorithm based on
gradient function is applied. The tempo of the recorded taal is identified using procedure
in section 5.3.1. Gradient which is the first difference of the energy signal is calculated as
shown in Fig 5.9.
Fig 5.9 Gradient of Energy signal
Peak finding algorithm is then applied to the gradient signal; however multiple
peaks are obtained due to the resonant nature of the signal. Hence only those peaks are
considered which are separated by minimum number of samples equal to 0.4* acf����.
This threshold of 0.4 is chosen so as to obtain peaks of bols comprising of 2 half matras,
whose peaks will be around 0.5* acf����. Finally the peaks obtained represent the onsets
of bols. The intermediate onsets of second half matra of any mixed bol are eliminated
and it is considered as one bol. (e.g. Dhage is treated as one bol). The samples between
0 100 200 300 400 500 600 700 800 900 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Gra
dien
t of
Ene
rgy
Frames
108
two successive onsets represent a bol. Thus the bols are segmented from the recorded
taal. These onsets of bols of a taal obtained are shown in Fig 5.10.
Fig 5.10 Onsets of bols in a taal
5.3.3 Features Extraction
In order to extract the information for transcription of Tabla bols, MFCC features
are used. The motivation for using MFCC was due to the fact that the auditory response
of the human ear resolves frequencies non-linearly. MFCC has been widely used in
speech processing applications. MFCC has the benefit that, it has the capability of
capturing the phonetically important characteristics of speech. Also band limiting can be
employed easily as discussed in section 3.4.6.
5.4 Bol Classification Techniques
The performance of two classification techniques popularly used in the area of
speaker identification, speech recognition is evaluated for automatic transcription of tabla
bols. These are Dynamic Time Warping and Vector Quantization.
0 100 200 300 400 500 600 700 800 900 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Ons
ets
Frames
109
5.4.1 Dynamic Time Warping (DTW)
Dynamic time warping is an algorithm for measuring similarity between two
sequences which may vary in time or speed. A well known application has been
automatic speech recognition, to cope with different speaking speeds. In general, DTW is
a method that allows a computer to find an optimal match between two given sequences
(e.g. time series) with certain restrictions. The sequences are "warped" non-linearly in the
time dimension to determine a measure of their similarity independent of certain non-
linear variations in the time dimension. Continuity is less important in DTW than in other
pattern matching algorithms; DTW is an algorithm particularly suited to matching
sequences with missing information, provided there are long enough segments for
matching to occur. Matching of tabla bols is a similar problem and DTW can be well
utilized to classify the bols.
Templates are prepared from the extracted MFCC coefficients and stored in the
bol’s database for each tempo. Once the tempo of the arbitrary input taal is computed
and bols are separated, template matching algorithm using the DTW technique is
applied to classify the bols. As the tempo is known comparison is done with all bol
templates of that tempo only, which reduces the computational time significantly. DTW
technique follows a dynamic programming approach. DTW takes care of the following
things:
1. Different number of frames in input and template. (i.e. slight tempo variations/ onset
variations)
2. Stressing/slowing down on particular bol/matra.
3. Different Tabla players.
4. Deals with surrounding silence regions.
110
Design consideration for DTW
1. The main modes of distortion are stretching and shrinking of bol segments, owing to
different styles of playing tabla. The distortion is not known apriori.
2. Also, since tabla signals are continuous valued instead of discrete, there is really no
notion of insertion.
3. Every input frame must be matched to some template frame.
4. For meaningful comparison of two different path costs, their lengths must be kept the
same.
5. So, every input frame is to be aligned to a template frame.
DTW for Bol Template Matching In Fig 5.11 a typical warp path for a sequence of bols segments is shown. Ideally
the path should start at top left and move diagonally towards bottom right. The deviation
from diagonal indicates dissimilarity. The typical transitions from one frame to other are
shown in Fig 5.12.
Fig 5.11 2-D matrix of template-input frames of bols
111
Fig 5.12 Typical Transitions used in DTW
Fig 5.13 shows how DTW takes care of templates and inputs of various length.
Fig 5.13 Use of Transition Types
Other kinds of transition methods are also possible.
112
1. Skipping more than one template frame (greater shrink rate).
2. Vertical transitions: the same input frame matches more than one template frame.
This is less often used, as it can lead to different path lengths, making their costs
difficult to compare.
DTW Edge/Node Costs
Typically, there are no edge costs; any edge can be taken with no cost.
Local node costs measure the dissimilarity or distance between the respective
input and template frames.
Since the frame content is a multi-dimensional feature-vector, a simple measure is
Euclidean distance; i.e. geometrically how far one point is from the other in the
multi-dimensional vector space.
For two vectors x = (x1, x2, x3…xN), and y = (y1, y2, y3…yN), the Euclidean
distance between them is:
��(�� − ��)�) � = 1,2,… . �. (5.4.1.1)
Thus, if x and y are the same point, the Euclidean distance = 0.
The farther apart x and y are, the greater the distance.
DTW Path Cost Calculation
Let Pi,j = the best path cost from origin to node [i, j] (i.e., where ith template frame
aligns with j th input frame)
Let Ci,j = the local node cost of aligning template frame i to input frame j
(Euclidean distance between the two vectors).
Then, by the Dynamic Programming formulation:
Pi,j= min (Pi,j-1+ Ci,j , Pi-1,j-1+ Ci,j, Pi-2,j-1+ Ci,j)
Pi,j = min (Pi,j-1, Pi-1,j-1, Pi-2,j-1) + Ci,j
Edge costs are 0, otherwise they should be added to the costs.
If the template is m frames long and the input is n frames long, the best alignment
of the two has the cost = Pm,n.
Note: Any path leading to an internal node (i.e. not m,n) is called a partial path.
113
Fig 5.14 shows the DTW applied for comparing a input bol ‘dha’ with a template of bol
‘dha’. The optimal path is almost a straight line and the minimum distance is zero
indicating that input and template are same. Thus the input bol is identified as dha.
(a) Separated bols
(b) Mel Frequency Cepstral Coefficients
(c) DTW matrix and optimal path
Fig 5.14 DTW path of reference Template ‘dha’ and Input Template ‘dha’
114
Fig 5.15 shows the DTW applied for comparing a input bol ‘dhin’ with a template
bol ‘dha’. The optimal path is not a straight line and the minimum distance is not zero but
some finite value indicating that input and template are different.
(a) Separated bols
(b) Mel Frequency Cepstral Coefficients
(c) DTW matrix and optimal path
Fig 5.15 DTW of reference Template ‘dhin’ and Input Template ‘dha’
Likewise the input bol is compared with all the bol templates and the template
with minimum path cost is selected as the best match for the input bol.
115
5.4.2 Vector Quantization
Block Diagram of VQ
Fig 5.16 Vector Quantization Process
Vector quantization [33, 34, 48, and 49] is used to obtain the requisite codebooks that act
as an acoustic model for each bol. Vector quantization (VQ) technique a one of the most
important and widely used method in speech processing, is formulated as follows. It
assumes that x is a k –dimensional vector whose components are real valued random
variables. In VQ, input x is mapped on to another k-dimensional y, thus
� = q(�) (5.4.2.1)
As y takes on one of the finite set of values Y= {yi} (1≤ i ≤ K). The set Y is referred to
as a codebook and {yi} are code vectors for template. The size K of the codebook is
referred to as a number of levels. To design such a code book, the k-dimensional space of
x is partitioned into K regions {Ci} (1≤ i ≤K) with yi being associated with each region
Ci. When x is quantized as y, a quantization distortion measure d(x, y) can be defined
between x and y. The overall average distortion is then represented by the eqn. (5.4.2.2).
D = lim�→∞
1
M� d[x(n),y(n)]
�
���
(5.4.2.2)
Clustering Algorithm (k-means/LBG)
Codebook M=2k vectors
Quantizer
Training Set of Vectors
Input music/speech vectors Codebook
116
Two conditions are necessary for optimality of K-level quantizer. The first is that the
quantizer be realized by using a minimum-distortion or nearest-neighbor selection rule
shown by equation (5.4.2.3).
q(x) = y� iff d(x,y�) ≤ d�x,y��,j≠ i,1 ≤ j≤ K (5.4.2.3)
The second condition is that each code vector yi be chosen to minimize the average
distortion in region Ci. Such a vector is called the centroid of the region, which depends
on the definition of distortion measure. The codebook is prepared by using the MFCC
coefficients as input data. The codebook preparation is done using either by using
Lloyd’s K-means or LBG algorithm.
Certain advantages or the special features for the use of VQ are given as follows:
Reduced storage for spectral analysis information. Potentially VQ is very efficient as
compared to other existing systems.
Computationally less costly as compared to DTW. Lesser computations are required
to find the perfect match. In many of the comparisons VQ gets represented in the
form of lookup tables. This is another important feature of VQ.
Discrete representation of speech signals. By associating Phonetic labels to each of
the clusters each phoneme can be represented individually in the vector space. A
similar technique is used here for transcription of bols.
Features of VQ Training Set
To robustly train the VQ codebook, the training set of vectors is encompassed
with the possibilities listed below.
1. Taals played by various artists are recorded.
117
2. Taals of different Tempos are recorded.
3. Taals are recorded on different tablas.
4. Taals are recorded at different moments of time.
Making these changes at proper intervals will give a very robust set of feature vectors
which form the very core of the VQ algorithm.
Algorithms for Vector Quantization
The main motive of VQ algorithm is to design an optimal minimum distortion
quantizer. For this purpose two important algorithms have been stated. They are
discussed as follows.
1. Lloyd K-means Algorithm.
Step 1: Initialization.
Set m=0, where m is iterative index.
Choose a set of initial code vectors {yi(0)} where 1 ≤ i ≤ K
Step 2: Classification
Classify the set of training vector { x(n) } ( 1≤ n ≤ M)
(M is number of clusters.)
Classify on the basis of Nearest Neighbor (NN)
i.e. x € Ci (m) iff d [x, yi(m)] ≤ d[x,yj(m)] for j ≠ i
Step 3: Code Vector Updating
Update the code vector of every cluster by computing the centroid of
training vectors in each cluster.
Calculate overall distortion and reiterate the process.
118
Step 4: Termination
If D(m)-D(m-1) ≤ Threshold then stop the process.
Fig. 5.17 Partitioned vector space along with the centroids
2. LBG Algorithm
It is known as Linde- Buzo- Gray (LBG) Algorithm.
Start with just one cluster covering the whole region
Find the Centroid of this region.
Distort the Centroid by ±Δ value.
Find the distance of respective clusters from both the centroids + Δ and – Δ.
After iterating again the centroid will split in two and the process will repeat.
Thus there will be 2, 4, 8, 16 etc. clusters.
The process terminates when there will be no effective change in the code
vectors.
119
5.5 Results
5.5.1 Database
A Tabla taal database consists 20 sets of recordings collected from 10 different
tabla players on different tablas. Each set consists of 4 popular taals in three tempos
recorded as clips of 30 seconds each. A bol database was then prepared consisting of all
bols extracted from these recordings. Since these bols were extracted from recordings of
different taals, the database includes almost all the variations which make the training
robust. The results of the classification models using DTW and VQ are shown in the form
of confusion matrix and also plotted in graphs below. The database comprises of in all 11
bols Dha, Dhin, Dhage, Kat, Na/Ta, Tin/Din, Tita, Gadi, Gana, Kata, Tirakita. Conflict
of identification in bols such as Na and Ta, Tin and Din is due to very subtle changes in
methods of playing. These subtle changes are due to slight changes in position of finger,
which is hardly reconciled by the listener. So the ambiguity between Tin/Din and Na/Ta
has been disregarded.
5.5.2 DTW Results:
The training bols were isolated by the above mentioned bol extraction methods
and the bols to be tested are compared with reference bols. It is to be noted that the
classification and identification done by DTW method is tempo-dependent. Hence, the
confusion matrices of the three tempos are given separately.
The expertise of a tabla player here is classified into three levels.
The first level requires the player to play only the bols on tabla (dayan) and dagga
(bayan).
120
The second level evaluates bols of dayan, bayan and dayan+bayan.
The third level takes all the bols in the first two levels and the bols of quarter and
half matra.
While formulating the confusion matrix each value filled in the matrix is termed
as Classification Ratio. Each row in the confusion matrix represents a bol under test and
each column represents a bol from expert database. Multiple samples of the test bol are
tested and classified among the bols from the database.
Classification ratio is found out as follows.
Cii = No. of bols of ith class matched with ith reference
Total no. of bols of ith class tested (5.5.1.1. )
Classification Accuracy is then calculated as
Classi�ication Accuracy= ∑ Cii
Total no. of rows X 100 % (5.5.1.2)
The confusion matrices for the three tempos at three levels are tabulated as follows:
Table 5.1: Tempo 1 Level 1 ta/na tita Din kat ghe Ke
ta/na 1 0 0 0 0 0
Tita 0 1 0 0 0 0
Din .3 0 .7 0 0 0
Kat 0 0 0 1 0 0
Ghe 0 0 0 0 1 0
Ke 0 0 0 0 0 1
Classification Accuracy: 95%
121
Table 5.2: Tempo 1 Level 2 ta/na Tita Din/tin kat Ghe ke dha Dhin
ta/na 1 0 0 0 0 0 0 0
tita 0 1 0 0 0 0 0 0
Din/tin 0 0 1 0 0 0 0 0
Kat 0 0 0 1 0 0 0 0
Ghe 0 0 0 0 1 0 0 0
Ke 0 0 0 0 0 1 0 0
Dha 0.5 0 0 0 0 0 0.5 0
Dhin 0 0 0.1 0 0 0 0 0.9
Classification Accuracy: 92.5%
Table 5.3: Tempo 1 Level 3 dha Dhin Dhage din/tin gadi gana kat kata na/ta Tirakita tita
Dha 0.5 0 0 0 0 0 0 0 0.5 0 0
Dhin 0 0.9 0 0.1 0 0 0 0 0 0 0
Dhage 0 0 1 0 0 0 0 0 0 0 0
din/tin 0 0 0 1 0 0 0 0 0 0 0
Gadi 0 0 0 0 1 0 0 0 0 0 0
Gana 0 0 0 0 0 1 0 0 0 0 0
Kat 0 0 0 0 0 0 1 0 0 0 0
Kata 0 0 0 0 0 0 0 1 0 0 0
na/ta 0 0 0 0 0 0 0 0 1 0 0
Tirakita 0 0 0 0 0 0 0 0 0 1 0
Tita 0 0 0 0 0 0 0 0 0 0 1
Classification Accuracy: 94.5%
122
Table 5.4: Tempo 2 Level 1 ta/na Tita Din kat ghe Ke
ta/na 1 0 0 0 0 0
Tita 0 1 0 0 0 0
Din 0 0 1 0 0 0
Kat 0 0 0 1 0 0
Ghe 0 0 0 0 1 0
Ke 0 0 0 0 0 1
Classification Accuracy: 100%
Table 5.5: Tempo 2 Level 2 ta/na Tita Din kat Ghe Ke dha Dhin
ta/na 1 0 0 0 0 0 0 0
Tita 0 1 0 0 0 0 0 0
Din/tin 0 0 1 0 0 0 0 0
kat 0 0 0 1 0 0 0 0
ghe 0 0 0 0 1 0 0 0
Ke 0 0 0 0 0 1 0 0
dha 0 0 0 0 0 0 1 0
dhin 0 0 0 0 0 0 0 1
Classification Accuracy: 100%
123
Table 5.6: Tempo 2 Level 3
dha Dhin Dhage din/tin gadi gana kat kata na/ta tirakita tita
Dha 1 0 0 0 0 0 0 0 0 0 0
Dhin 0 1 0 0 0 0 0 0 0 0 0
Dhage 0 0 0.3 0 0.7 0 0 0 0 0 0
din/tin 0 0 0 1 0 0 0 0 0 0 0
Gadi 0 0 0 0 1 0 0 0 0 0 0
Gana 0 0 0 0 0 1 0 0 0 0 0
Kat 0 0 0 0 0 0 1 0 0 0 0
Kata 0 0 0 0 0 0 0 1 0 0 0
na/ta 0 0 0 0 0 0 0 0 1 0 0
tirakita 0 0 0 0 0 0 0 0 0 0.7 0.3
Tita 0 0 0 0 0 0 0 0 0 0 1
Classification Accuracy: 90.9%
Table 5.7: Tempo 3 Level 1
ta/na Tita Din Kat ghe Ke
ta/na 1 0 0 0 0 0
Tita 0 1 0 0 0 0
Din 0.1 0 0.9 0 0 0
Kat 0 0 0 1 0 0
Ghe 0 0 0 0 1 0
Ke 0 0 0 0 0 1
Classification Accuracy: 98.3%
124
Table 5.8: Tempo 3 Level 2 ta/na Tita Din/tin Kat Ghe ke dha Dhin
ta/na 1 0 0 0 0 0 0 0
Tita 0 1 0 0 0 0 0 0
Din/tin 0.15 0 0.85 0 0 0 0 0
Kat 0 0 0 1 0 0 0 0
Ghe 0 0 0 0 1 0 0 0
Ke 0 0 0 0 0 1 0 0
Dha 0.2 0 0 0 0 0 0.7 0
Dhin 0 0 0 0 0 0 0 0.9
Classification Accuracy: 93.12%
Table 5.9: Tempo 3 Level 3 dha Dhin dhage din/tin Gadi gana Kat kata na/ta Tirakita tita
Dha 0.5 0 0.3 0.1 0 0 0 0 0.1 0 0
Dhin 0 0.9 0 0.1 0 0 0 0 0 0 0
Dhage 0 0.1 0.2 0.2 0.4 0 0 0 0.1 0 0
din/tin 0 0 0 0.9 0 0 0 0 0.1 0 0
Gadi 0 0 0 0 1 0 0 0 0 0 0
Gana 0 0 0 0.2 0 0.8 0 0 0 0 0
Kat 0 0 0 0 0 0 1 0 0 0 0
Kata 0 0 0 0 0 0 0 1 0 0 0
na/ta 0 0 0 0 0 0 0 0 1 0 0
Tirakita 0 0 0 0 0 0 0 0 0 1 0
Tita 0 0 0 0 0 0 0 0 0 0 1
Classification Accuracy: 84.5%
125
Table 5.10: Bol Classification Accuracies
A graph shown in Fig 5.18 gives bol classification accuracy for Tempo1 (Slow),
Tempo2 (Medium), and Tempo3 (Fast) using DTW. The classification based on DTW
reconciled bols with slight temporal variations but was not capable of handling large
temporal variations.
Fig 5.18 Bol Classification Accuracy using DTW
Level 1 Level 2 Level 3
Tempo 1 95% 92.5% 94.5%
Tempo 2 100% 100% 90.9%
Tempo 3 98.3% 93.12% 84.5%
126
5.5.3 VQ Results
In case of vector quantization, identification is done in both tempo-dependent and
tempo independent forms .For tempo dependent, codebook formation is done by using
bols of the same tempo. Thus there are different codebooks for a bol in each tempo. The
tempo of the input clip is extracted and comparison is done with templates of that tempo
only. In case of the tempo independent VQ model, all training bols from all three tempos
are used to form codebooks. Thus there is only one codebook for each bol irrespective of
the tempo. Results for level 3 in all 3 tempos are given below in form of confusion
matrix.
Table 5.11: Tempo 1 Dha Dhin dhage din/tin gadi gana kat Kata na/ta Tirakita tita
Dha 1 0 0 0 0 0 0 0 0 0 0
Dhin 0 1 0 0 0 0 0 0 0 0 0
Dhage 0 0 1 0 0 0 0 0 0 0 0
din/tin 0 0 0 1 0 0 0 0 0 0 0
Gadi 0 0 0 0 1 0 0 0 0 0 0
Gana 0 0 0 0 0 1 0 0 0 0 0
Kat 0 0 0 0 0 0 1 0 0 0 0
Kata 0 0 0 0 0 0 0 1 0 0 0
na/ta 0 0 0 0 0 0 0 0 1 0 0
Tirakita 0 0 0 0 0 0 0 0 0 1 0
Tita 0 0 0 0 0 0 0 0 0 0 1
Classification Accuracy 100%
127
Table 5.12: Tempo 2 Dha Dhin dhage din/tin gadi gana kat Kata na/ta Tirakita tita
Dha 1 0 0 0 0 0 0 0 0 0 0
Dhin 0 0.9 0 0 0 0 0.1 0 0 0 0
Dhage 0 0 1 0 0 0 0 0 0 0 0
din/tin 0 0 0 1 0 0 0 0 0 0 0
Gadi 0 0 0 0 1 0 0 0 0 0 0
Gana 0 0 0 0 0 1 0 0 0 0 0
Kat 0 0 0 0 0 0 1 0 0 0 0
Kata 0 0 0 0 0 0 0 1 0 0 0
na/ta 0 0 0 0 0 0 0 0 1 0 0
Tirakita 0 0 0 0 0 0 0 0 0 1 0
Tita 0 0 0 0 0 0 0 0 0 0 1
Classification Accuracy 99%
Table 5.13: Tempo 3
Dha Dhin dhage din/tin gadi gana kat Kata na/ta Tirakita Tita
Dha 0.8 0.1 0 0.1 0 0 0 0 0 0 0
Dhin 0 1 0 0 0 0 0 0 0 0 0
Dhage 0 0 1 0 0 0 0 0 0 0 0
din/tin 0 0 0 .9 0 0 0 0 0.1 0 0
Gadi 0 0 0 0 1 0 0 0 0 0 0
Gana 0 0 0 0 0 1 0 0 0 0 0
Kat 0 0 0 0 0 0 1 0 0 0 0
Kata 0 0 0 0 0 0 0 1 0 0 0
na/ta .1 0 0 0 0 0 0 0 .9 0 0
Tirakita 0 0 0 0 0 0 0 0 0 1 0
Tita 0 0 0 0 0 0 0 0 0 0 1
Classification Accuracy 96.4%
128
Table 5.14: Tempo Independent VQ Dha Dhin Dhage din/tin gadi gana kat kata na/ta tirakita Tita
Dha 0.85 0.04 0 0.04 0 0 0 0 0.07 0 0
Dhin 0 0.89 0 0.04 0 0 0 0 0 0.07 0
Dhage 0 0 1 0 0 0 0 0 0 0 0
din/tin 0 0.025 0 .95 0 0 0 0 0.025 0 0
Gadi 0 0 0 0 1 0 0 0 0 0 0
Gana 0 0 0 0 0 1 0 0 0 0 0
Kat 0 0 0 0 0 0 1 0 0 0 0
Kata 0 0 0 0 0 0 0 1 0 0 0
na/ta 0.07 0 0.04 0 0 0 0 0 0.89 0 0
tirakita 0 0 0 0 0 0 0 0 0 1 0
Tita 0 0 0 0 0 0 0 0 0 0 1
Classification Accuracy 96.2%
VQ provided more robust classification over DTW as evident from the graphs
shown in Fig 5.19 and bol classification accuracies in Table 5.15. Only level 3 bol
classification accuracies are compared in the table 5.15.
129
Fig 5.19 Bol Classification Accuracy using VQ
Table 5.15: Comparison of bol classification accuracies using DTW and VQ Tempo 1 Tempo 2 Tempo 3 Tempo
Independent
DTW 94.5% 90.9% 84.5% NA
VQ 100% 99% 96.4% 96.2%
Moreover, complete temporal independencies were achieved in VQ with accuracy
of about 96.20% which is far better than DTW. The results are state of the art of their
own kind.
As seen from the Table 5.16 below, experimentation is done by changing the
codebook size and the number of MFCC coefficients. As the codebook size increases
from 16 to 32, the bol identification accuracy increases. As the number of MFCC
coefficients reduces, the accuracy decreases. Hence we choose the optimum values for
codebook size as 32 and MFCC coefficients as 10 for accurate classification of bols.
130
Table 5.16: Accuracy Variations according to the number of Clusters and MFCC Number of Clusters Number of MFCC
Coefficients
Bol Identification
Accuracy
16 12
10
8
93.2%
93.2%
82.1%
32 12
10
8
96.4%
96.4%
87.6%
5.6 Summary
A very robust bol extraction algorithm is developed. The proposed algorithm for
finding tempo of tabla taal recordings, based on autocorrelation technique is very fast
and computationally efficient as compared to other techniques such as beat histogram.
The novel gradient method proposed for ‘Bol onset detection’ in a tabla taal recordings is
very robust and capable of giving precise segmentation of bols (Tolerance of 16 samples
at 16 KHz sampling rate).
Dynamic Time Warping (DTW) can reconcile bols with slight temporal
variations. But DTW is not capable of handling large temporal variations. DTW based
method cannot be used for tempo independent bol classification. Vector Quantization
(VQ) provides more robust classification over DTW. Moreover, complete temporal
independencies are achieved in VQ with accuracy of about 96.2%, which is far better
than DTW.
The system of Tabla Bol identification can be deployed successfully as Tabla
Tutor which could be used to gauge the accuracy with which an artist is playing Tabla.