computer science department a speech / music discriminator using rms and zero-crossings costas...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Computer Science Department
A Speech / Music Discriminator using RMS and A Speech / Music Discriminator using RMS and Zero-crossingsZero-crossings
Costas Panagiotakis and George Tziritas
Department of Computer Science
University of Crete
Heraklion Greece
Computer Science Department
Presentation Organization
I. Introduction
II. Segmentation
III. Classification
IV. Results
V. Conclusion
EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France 11
Computer Science Department
Introduction (1/3)Input
Figure 1: Original Sound Signal (44100 or 22050 sample rate)
Output
Figure 2: Real time Segmentation and Classification (Speech,Music,Silence)
EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France 22
Computer Science Department
Introduction (2/3)Approaches
Basic purpose
•Features extraction (energy,frequency)
•Feature based Segmentation and Classification
•Real time segmentation and classification
•Algorithmic - computation constraints
•Low feature number
•Low change extraction error (20 msec)
•Low minimum distance between two changes (1 sec)
•High accuracy (95 %)
33EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France
Computer Science Department
Introduction (3/3)
Root Mean Square (RMS)
Basic Features
Zero Crossings (ZC)
•Computed every 20 msec
•Independent characteristics
Signal energy
Figure 3: RMS in music Figure 4: RMS in speech
Figure 5: ZC in music Figure 6: ZC in speech
Mean frequency
1
N
n
x n( )2
=A =
44EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France
Computer Science Department
Segmentation (1/3)
Basic characteristics RMS based
χ2 distribution fits well the RMS histograms
Two stage algorithmStage 1
•1 sec accuracy (low computation cost)
Stage 2 •20 msec accuracy (high computation cost)
m : mean , s2 : variance
Figure 8: Histogram RMS in speech, approximation by χ2 distribution
Figure 7: Histogram RMS in speech, approximation by χ2 distribution
p(x) = x
ae
bx
ba 1
Gamma a 1( )x 0
a = m
2
s2
1 b =s2
m
Γ(Γ( a + 1) a + 1)
55EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France
Stage 1•Partitioning in 1 sec frames (50 RMS values)
•Change in Frame i Frame i-1 and Frame i+1 have to differ
•Computation of frame distance D (Matusita Distance) using frame similarity (p)
•Frame i is candidate for Stage 2 (there is a change)
If D(i) > threshold and D(i) local maximal
p x( ) xp1
x( ) p2
x( ) d D i( ) 1 p pi 1 pi 1
Computer Science Department
Segmentation (2/3)
p( pp( p11 , p , p2 2 ))
66EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France
RMSRMS
timetimeFrame i-1 Frame i+1
HIGH
Frame i Frame i+21 sec frames1 sec frames
DistanceDistance
Change in frame iChange in frame i
LOW
Computer Science Department
Segmentation (3/3)
Stage 2•20 msec accuracy
•for each candidate frame (i) from stage 1
1. move 2 successive frames (1 sec) located before and after frame (i)
2. find the time instant where the 2 successive frames have the maximum Matusita distance in RMS distribution
•Possible oversegmentation
Figure 10: The RMS data and the distance D
Figure 11: The segmentation result and the RMS data
77EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France
Computer Science Department
Classification (1/4)
Basic purpose Segment classification in one of following classes
•Music
•Speech
•Silence
Main Algorithm •Hypothesis
Segmentation gives homogenous segments
•Input Basic characteristics RMS, ZC
•Actual features computation of segment
•Classification based on actual features values
88EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France
Computer Science Department
Classification (2/4)
Actual Features specification •Normalized RMS variance, σ2
Α
σ2Α =
Usually (86 %) σ2Α(music) < σ2
Α (speech)
•The probability of null ZC, ZC0
Always ZC0 (music) = 0 Usually (40%) ZC0 (speech) > 0
•Maximal mean frequency, max(ZC)
Almost always in speech max(ZC) < 2.4 kHz In 2% of the cases in music max(ZC) > 2.4 kHz
var RMS( )
mean RMS( )( )2
99EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France
Computer Science Department
•Joint RMS/ZC measure, Cz
Speech : High correlation RMS, ZC many void intervals low RMS and ZC
Music : Essentially independent RMS, ZC
•Void intervals frequency, FuVoid intervals detection ( 20 msec ):
(RMS < T1) && (RMS < 0.1•max(RMS(i)) && (RMS < T2) || (ZC = 0)
Group neighborly silent intervals
Fu : frequency of grouped silent intervals
Always in speech Fu > 0.6
In at least 65% of music Fu < 0.6
iA
Actual Features specification Classification (3/4)
1010EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France
Computer Science Department
Silence segment recognition
Segment is silence E < Threshold
E 0.7 median RMS i( )( )
0.3
i
RMS i( )
A
A
i A
Classification (4/4)
Decision making algorithm
ομιλία
Silence segment check
Actual features check Silence
speech music
1111EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France
Computer Science Department
Data Data source
Segmentation performance
Results
11.328 sec speech 3.131 sec music
70% audio CDs15% WWW15% recordings
Actual features performance
•97% detection probability
•Change accuracy ~ 0.2 sec
FeaturesFeatures
1212EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France
σσ22ΑΑ
Cz Cz Cz Cz σσ22
ΑΑ ZC0 ZC0 σσ22
ΑΑ
Fu Fu σσ22
ΑΑ
AllAll CzCz
Acc
ura
cyA
ccu
racy
ZC0ZC0 σσ22ΑΑ ,
ZC0 ZC0 σσ22
ΑΑ
FeaturesFeatures
Computer Science Department
Complexity Conclusion
Summary
•Minimum complexity O(N)•Low computation cost
•Real time segmentation and classification in three classes•Energy distribution (RMS) suffices for segmentation•RMS – ZC suffices for classification•Purpose : minimum cost and high performance
Future extension•Content-based indexing and retrieval audio signals•Pre-processing stage for speech recognition
1313EUSIPCO 2002, Toulouse FranceEUSIPCO 2002, Toulouse France