speech & nlp (fall 2014): basics of phonology & audio processing, zero crossing rate,...
DESCRIPTION
TRANSCRIPT
Speech & NLP
Basics of Phonology & Audio Processing,
Zero Crossing Rate,
Dynamic Time Warping
Vladimir Kulyukin
www.vkedco.blogspot.com
Outline
Phonology Basics
Audio Processing
Zero Crossing Rate
Dynamic Time Warping
Phonology Basics
International Phonetic Alphabet
http://en.wikipedia.org/wiki/International_Phonetic_Alphabet
Phones, Allophones, Phonemes
Phone is a unit of speech sound
Allophone is a member of a set of phones used
to pronounce a phoneme
For example, in English, /p/ is a phoneme with
two allophones [ph] and [p]
[ph] is an aspirated (breath is used) allophone
of /p/
[p] is an unaspirated (breath is not used)
allophone of /p/
Intonation, Tone, & Prosody
Text To Speech
TTS Engine Anatomy
Audio Processing
Samples
Samples are successive snapshots of a
specific signal
Audio files are samples of sound waves
Microphones convert acoustic signals into
analog electrical signals and then analog-to-
digital converter transform analog signals
into digital samples
Digital Audio Signal
time
Sound
pressure
Amplitude
Amplitude (in audio processing) is a
measure of sound pressure
Amplitude is measured at a specific rate
Amplitude measures result in digital
samples
Some samples have positive values
Some samples have negative values
Digital Approximation Accuracy
Any digitization of analog signals carries some
inaccuracy
Approximation accuracy depends on two
factors: 1) sampling rate and 2) resolution
In audio processing, sampling is reduction of
continuous signal to discrete signal
Sampling rate is the number of samples per unit
of time
Resolution is the size of a sample (e.g., the
number of bits)
Sampling Rate & Resolution
Sampling rate is measured in Hertz (Hz)
Hz are measured in samples per second
For example, if the audio is sampled at a
rate of 44,100 samples per second, then its
sampling rate is 44,100Hz
Typical resolutions (sample lengths) are 8
bits, 16 bits, and 32 bits
Nyquist-Shannon Sampling Theorem
This theorem states that perfect reconstruction of
a signal is possible if the sampling frequency is
greater than two times the maximum frequency of
the signal being sampled
For example, if a signal has a maximum frequency
of 50Hz, then it can, theoretically, be
reconstructed if sampled at a rate of 100Hz and
avoid aliasing (aka the effect of indistinguishable
sounds)
Audio File Formats
WAVE (WAV) is often associated with Windows but are
now implemented on other platforms
AIFF is common on Mac OS
AU is common on Unix/Linux
These are similar formats that vary in how they
represent data, pack samples (e.g., little-endian vs. big-
endian), etc.
Some Java examples of how to manipulate Wav files are
WavFileManip.java; if the link does not work, the url is https://github.com/VKEDCO/AudioTrials/blob/master/org.vkedco.nlp.audiotrials/WavFileManip.java
Zero Crossing Rate
What is Zero Crossing Rate (ZCR)?
Zero Crossing Rate (ZCR) is a measure of the
number of times, in a given sample, when
amplitude crosses the horizontal line at 0
ZCR can be used to detect silence vs. non-
silence, voice vs. unvoiced, speaker’s identity,
etc.
ZCR is essentially the count of successive
samples changing algebraic signs
ZCR Source
public class ZeroCrossingRate {
public static double computeZCR01(double[] signals, double normalizer)
{
long numZC = 0;
for(int i = 1; i < signals.length; i++) {
if ( (signals[i] >= 0 && signals[i-1] < 0) ||
(signals[i] < 0 && signals[i-1] >= 0) ) {
numZC++;
}
}
return numZC/normalizer;
}
}
ZCR in Voiced vs. Unvoiced Speech
Voiced speech is produced when vowels are spoken
Voiced speech is characterized of constant
frequency tones of some duration
Unvoiced speech is produced when consonants are
spoken
Unvoiced speech is non-periodic, random-like
because air passes through a narrow constriction of
the vocal tract
ZCR in Voiced vs. Unvoiced Speech
Phonetic theory states that voiced speech
has a smooth air flow through the vocal tract
whereas unvoiced speech has a turbulent air
flow that produces noise
Thus, voiced speech should have a low ZCR
whereas unvoiced speech should have a high
ZCR
Amplitude of Voiced vs. Unvoiced Speech
Amplitude of unvoiced speech tends to be
lower
Amplitude of voiced speech tends to be
higher
Given a digital sample, we can use average
amplitude as a measure of the sample’s
energy
This can be used to classify samples as
vowels and consonants
ZCR & Amplitude of Voiced & Unvoiced Speech
ZCR Amplitude
Voiced LOW HIGH
Unvoiced HIGH LOW
Detection of Silence & Non-Silence
silence_buffer = [];
non_silence_buffer = [];
buffer = [];
while ( there are still frames left ) {
Read a specific number of frames into buffer;
Compute ZCR and average amplitude of buffer;
if ( ZCR and average amplitude are below specific thresholds ) {
add the buffer to silence_buffer;
}
else {
add the buffer to non_silence_buffer;
}
}
source code in https://github.com/VKEDCO/AudioTrials/blob/master/org.vkedco.nlp.audiotrials/WavFileManip.java
Dynamic Time Warping
Introduction
Dynamic Time Warping (DTW) is a method to
find an optimal alignment between two time-
dependent sequences (series)
DTW aligns (“warps”) two sequences in a non-
linear way to match each other
DTW has been successfully used in automatic
speech recognition (ASR), bioinformatics
(genetic sequence matching), and video
analysis
Sample Sequences
Sample Alignment
Basic Definitions
There are two sequences:
𝑋 = 𝑥1, … , 𝑥𝑁 and 𝑌 = 𝑦1, … , 𝑦𝑀
There is a feature space F such that:
𝑥𝑖 ∈ 𝐹 & 𝑦𝑗 ∈ 𝐹 where 1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑗 ≤ 𝑀
There is a local cost measure mapping 2-
tuples of features to non-negative reals:
𝑐: 𝐹 x 𝐹 → 𝑅 ≥ 0
X
Cost Matrix DTW(N, M)
Y
1 2 …. i … N
M
1
2
…
𝑑𝑡𝑤 𝑖, 𝑗 is the cost of warping X[1:i] with Y[1:j]
j
…
X and Y are sequences X[1:N] and Y[1:M]
Warping Path
𝑃 = 𝑝1, … , 𝑝𝐿 , where 𝑝 = 𝑛𝑗 , 𝑚𝑗 ∈ 1, 𝑁 × [1,𝑀] and
𝑗 ∈ 1, 𝐿 is a warping path if
1) 𝑝1 = 1,1 and 𝑝𝐿 = 𝑁,𝑀 2) 𝑛1 ≤ 𝑛2 ≤ … ≤ 𝑛𝑁 and 𝑚1 ≤ 𝑚2 ≤ … ≤ 𝑚𝑀
3) 𝑝𝑙+1 − 𝑝𝑙 ∈ 1, 0 , 0, 1 , 1, 1 , 1 ≤ 𝑙 ≤ 𝐿 − 1
Valid Warping Path
1 2 3 4
1
2
3
4
5
𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, 𝑝6 , where
𝑝1 = 1, 1 , 𝑝2 = 1, 2 , 𝑝3 = 2, 3 , 𝑝4 = 2, 4 , 𝑝5 = 3, 5 , 𝑝6 = (4, 5)
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5 𝑝6
Invalid Warping Path
1 2 3 4
1
2
3
4
5
𝑝1 ≠ 1, 1 so constraint 1 is not satisfied
𝑝1 𝑝2
𝑝3
𝑝4
𝑝5 𝑝6
Invalid Warping Path
1 2 3 4
1
2
3
4
5
𝑝3 = 3, 3 , 𝑝4 = 2, 4 , 3 > 2 so 2nd constraint is not satisfied
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5 𝑝6
Invalid Warping Path
1 2 3 4
1
2
3
4
5
𝑝2 = 2, 2 , 𝑝3 = 3, 4 , 𝑝3 − 𝑝2 = 3,4 − 2,2 = 1, 2 ∉1, 0 , 0, 1 , 1, 1 so 3rd condition is not satisfied
𝑝1
𝑝2
𝑝3
𝑝4 𝑝5
Total Cost of a Warping Path
𝑃 = 𝑝1, … , 𝑝𝐿 , is a warping path between sequences X
and Y, then its total cost is
𝑐𝑝 𝑋, 𝑌 = 𝑐(𝑥𝑛𝑗 , 𝑦𝑚𝑗)
𝐿
𝑗=1
Example
1 2 3 4
1
2
3
4
5 Assume that 𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, 𝑝6 , where 𝑝1 = 1, 1 , 𝑝2 = 1, 2 , 𝑝3 =2, 3 , 𝑝4 = 2, 4 , 𝑝5 = 3, 5 , 𝑝6 = 4, 5 ,
is a warping path b/w X[1:4] and Y[1:5].
Then the total cost of P is
𝑐 𝑥1, 𝑦1 + 𝑐 𝑥1, 𝑦2 + 𝑐 𝑥2, 𝑦3 +𝑐 𝑥2, 𝑦4 + 𝑐 𝑥3, 𝑦5 + 𝑐 𝑥4, 𝑦5 .
This notation 𝑐 𝑥𝑖 , 𝑦𝑗 can be simplified
to read 𝑐(𝑖, 𝑗) or 𝑐 𝑋 𝑖 , 𝑌 𝑗 .
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5 𝑝6
X
Y
DTW(X, Y) – Cost of an Optimal Warping Path
𝐷𝑇𝑊 𝑋, 𝑌 = min 𝑐𝑝 𝑋, 𝑌 𝑝 is a warping path}
Remarks on DTW(X, Y)
There may be several warping paths of the
same warping cost, i.e., DTW(X, Y)
DTW(X, Y) is symmetric whenever the local
cost measure is symmetric, i.e., DTW(X, Y) =
DTW(Y, X)
DTW(X, Y) does not necessarily satisfy the
triangle inequality (the sum of the lengths of
two sides is greater than the length of the
remaining side)
X
DTW Equations: Base Cases
Y
1 2 …. i … N
M
1
2
…
Initial condition: 𝑑𝑡𝑤 1,1 = 𝑐(1,1)
j
…
1st Row: 𝑑𝑡𝑤 𝑖, 1 = 𝑑𝑡𝑤 𝑖 − 1,1 + 𝑐(𝑖, 1)
1st Column:
𝑑𝑡𝑤 1, 𝑗 = 𝑑𝑡𝑤 1, 𝑗 − 1 +𝑐(1, 𝑗)
X
DTW Equations: Recursion
Y
1 2 … i … N
M
1
2
…
j
…
Inner Cell: 𝑑𝑡𝑤 𝑖, 𝑗 = min 𝑑𝑡𝑤 𝑖 − 1, 𝑗 , 𝑑𝑡𝑤 𝑖 − 1, 𝑗 − 1 , 𝑑𝑡𝑤 𝑖, 𝑗 − 1 + 𝑐(𝑖, 𝑗)
Interpretation: Cost of
warping X[1:i] with Y[1:J] is
the cost of warping X[i] with
Y[j] plus the minimum of
the following three costs: 1)
the cost of warping X[1:i-1]
with Y[1:j]; 2) the cost of
warping X[1:i-1] with Y[1:j-
1]; 3) the cost of warping
X[1:i] with Y[1:j-1]
DTW(X, Y) Examples
Sample Feature Space & Sequences
Let the sequences be:
𝑋 = 𝑎, 𝑏, 𝑔 𝑌 = 𝑎, 𝑏, 𝑏, 𝑔 𝑍 = (𝑎, 𝑔, 𝑔)
Let the feature space 𝐹 = 𝑎, 𝑏, 𝑔 .
Let the local cost measure be
defined as follows:
𝑐 𝑥, 𝑦 = 0 𝑖𝑓 𝑥 = 𝑦1 𝑖𝑓 𝑥 ≠ 𝑦
Let us compute dtw(X,Y), dtw(Y,Z), and dtw(X, Z).
Work it out on paper.
DTW(X, Y) = DTW((a, b, g), (a, b, b, g))
Example: DTW(1,1)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0
Example: DTW(2,1)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 2,1 = 𝑐 2,1 + 𝑑𝑡𝑤 1,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1
1
Example: DTW(3,1)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 3,1 = 𝑐 3,1 + 𝑑𝑡𝑤 2,1= 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2
1 2
Example: DTW(1,2)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 1,2 = 𝑐 1,2 + 𝑑𝑡𝑤 1,1= 𝑐 𝑎, 𝑏 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1
1 2
1
Example: DTW(1,3)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 1,3 = 𝑐 1,3 + 𝑑𝑡𝑤 1,2= 𝑐 𝑎, 𝑏 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2
1 2
1
2
Example: DTW(1,4)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 1,4 = 𝑐 1,4 + 𝑑𝑡𝑤 1,3= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,3= 1 + 2 = 3
1 2
1
2
3
Example: DTW(2,2)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 2,2= 𝑐 2,2
+ min𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 1,1 ,𝑑𝑡𝑤 2,1
= 𝑐 𝑏, 𝑏 + min 1,0,1 = 0 + 0= 0 1 2
1
2
3
0
Example: DTW(3,2)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 3,2= 𝑐 3,2
+ min𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 2,1 ,𝑑𝑡𝑤 3,1
= 𝑐 𝑔, 𝑏 + min 0,1,2 = 1 + 0= 1 1 2
1
2
3
0 1
Example: DTW(2,3)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 2,3= 𝑐 2,2
+ min𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 2,2
= 𝑐 𝑏, 𝑏 + min 2,1,0 = 0 + 0= 0 1 2
1
2
3
0 1
0
Example: DTW(3,3)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 3,3= 𝑐 3,3
+ min𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 3,1
= 𝑐 𝑔, 𝑏 + min 0,0,1 = 1 + 0= 1 1 2
1
2
3
0 1
0 1
Example: DTW(2,4)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 2,4= 𝑐 2,4
+ min𝑑𝑡𝑤 1,4 ,𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 2,3
= 𝑐 𝑏, 𝑔 + min 3,2,0 = 1 + 0= 1 1 2
1
2
3
0 1
0 1
1
Example: DTW(3,4)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 3,4= 𝑐 3,4
+ min𝑑𝑡𝑤 2,4 ,𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 3,3
= 𝑐 𝑔, 𝑔 +min 1,0,1 = 0 + 0= 0
So DTW(X,Y) = 0
1 2
1
2
3
0 1
0 1
1 0
Example: DTW(3,4)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
DTW(X, Y) = 0.
Optimal Warping Path
(OWP) P can be found by
chasing pointers (red
arrows): P = ((1,1), (2, 2),
(2, 3), (3, 4)). 1 2
1
2
3
0 1
0 1
1 0
DTW(Y, Z) = DTW((a, b, b, g), (a, g, g))
DTW(1, 1)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0 Z
DTW(2, 1)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 2,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1
Z
1
DTW(3, 1)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 3,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2
Z
1 2
DTW(4, 1)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 4,1= 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 3,1= 1 + 2 = 3
Z
1 2 3
DTW(1, 2)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 1,2= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1
Z
1 2 3
1
DTW(1, 3)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 1,3= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2
Z
1 2 3
1
2
DTW(2, 2)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 2,2= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 1,2 ,
𝑑𝑡𝑤 1,1 , 𝑑𝑡𝑤 2,1 }
= 1 + 0 = 1
Z
1 2 3
1
2
1
DTW(3, 2)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 3,2= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 2,2 ,
𝑑𝑡𝑤 2,1 , 𝑑𝑡𝑤 3,1 }
= 1 +min 1,1,2 = 1 + 1 = 2
Z
1 2 3
1
2
1 2
DTW(4, 2)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 4,2= 𝑐 𝑔, 𝑔+ min {𝑑𝑡𝑤 3,2 ,
𝑑𝑡𝑤 3,1 , 𝑑𝑡𝑤 4,1 }
= 0 +min 2,2,3 = 0 + 2 = 2
Z
1 2 3
1
2
1 2 2
DTW(2, 3)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 2,3= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 1,3 ,
𝑑𝑡𝑤 1,2 , 𝑑𝑡𝑤 2,2 }
= 1 +min 2,1,1 = 1 + 1 = 2
Z
1 2 3
1
2
1 2 2
2
DTW(3, 3)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 3,3= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 2,3 ,
𝑑𝑡𝑤 2,2 , 𝑑𝑡𝑤 3,2 }
= 1 +min 2,1,2 = 1 + 1 = 2
Z
1 2 3
1
2
1 2 2
2 2
DTW(4, 3)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 4,3= 𝑐 𝑔, 𝑔+ min {𝑑𝑡𝑤 3,4 ,
𝑑𝑡𝑤 3,2 , 𝑑𝑡𝑤 4,2 }
= 0 +min 2,2,2 = 0 + 2 = 2
Z
1 2 3
1
2
1 2 2
2 2 2
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
DTW(Y, Z) = 2.
Optimal Warping Path (OWP) P
can be found by chasing pointers
(red arrows): P = ((1,1), (2, 2), (3,
2), (4, 3)).
Z
1 2 3
1
2
1 2 2
2 2 2
DTW(X, Z) = DTW((a, b, g), (a, g, g))
DTW(1, 1)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0
0
DTW(2, 1)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 2,1 = 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1
0 1
DTW(3, 1)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 3,1 = 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2
0 1 2
DTW(1, 2)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 1,2 = 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1
0 1 2
1
DTW(1, 3)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 1,3 = 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2
0 1 2
1
2
DTW(2, 2)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 2,2= 𝑐 𝑏, 𝑔
+ min𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 1,1 ,𝑑𝑡𝑤 2,1
= 1 +min 1,0,1= 1 + 0 = 1
0 1 2
1
2
1
DTW(3, 2)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 3,2= 𝑐 𝑔, 𝑔
+min𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 2,1 ,𝑑𝑡𝑤 3,1
= 0 +min 1,1,2= 0 + 1 = 1
0 1 2
1
2
1 1
DTW(2, 3)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 2,3= 𝑐 𝑏, 𝑔
+ min𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 2,2
= 1 +min 2,1,1= 1 + 1 = 2
0 1 2
1
2
1 1
2
DTW(3, 3)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 3,3= 𝑐 𝑔, 𝑔
+min𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 3,2
= 0 +min 2,1,2= 0 + 1 = 1
0 1 2
1
2
1 1
2 1
DTW(X, Z)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1 0 1 2
1
2
1 1
2 1 DTW(X, Z) = 1.
Optimal Warping Path (OWP)
P can be found by chasing
pointers (red arrows): P =
((1,1), (2, 2), (3, 3)).
DTW Optimizations
Window Optimization
The computation of DTW can be optimized so that only the
cells within a specific window are considered
Smaller Matrix Optimization
You may have realized by now that if we care
only about the total cost of warping sequence X
with sequence Y, we do not need to compute
the entire N x M cost matrix – we need only two
columns
The storage savings are huge, but the running
time remains the same – O(N x M)
We can also normalize the DTW cost by N x M
to keep it low
References
M. Muller. “Information Retrieval for Music and
Motion,”, Ch.04. Springer, ISBN 978-3-540-74047-6
Bachu, R. G., et al. “Separation of Voiced and
Unvoiced using Zero Crossing Rate and Energy of
the Speech Signal." American Society for Engineering
Education (ASEE) Zone Conference Proceedings. 2008.