real-time prediction of removable breaks in human...
TRANSCRIPT
Real-Time Prediction of Removable Breaks in Human Speech
Mike Hoffmann
ECE 557: Engineering Data Analysis & Modeling
Portland State University
Outline
• Introduction - Why Predict Breaks in Speech?
• Modeling Speech Parameters
• Application of Models
• Final Algorithm
• Testing and Results
• Conclusions
Speech Classification
• Talking– Majority of communicated information
• Not Talking– Does contain information
• Alertness, intent, etc…
– Information is only dependent on time length of silence
Why Predict Breaks in Speech?
• Silence makes up between 60 and 80% of every conversation– Listening, pauses for breath, pauses for thought
• Breaks in speech contain no sound information– If they could be predicted, they could be removed
• Transmitting useless data is bad for bandwidth
Sample Conversation
0 5 10 15 20 25 30 35 40 45 50 55-1
-0.5
0
0.5
1
Time (seconds)
Am
plitu
de
Voice Data of Conversation Speaker 1
0 5 10 15 20 25 30 35 40 45 50 55-1
-0.5
0
0.5
1
Time (seconds)
Am
plitu
de
Voice Data of Conversation Speaker 2
The Goal
• Remove a large percentage of silence in speech by predicting when breaks will occur
• Develop a real-time algorithm to control when to stop and resume transmission based on predictions
• Retain “good” sound quality
• Zero loss of comprehension by the listener
Constraining Factors
• Removing all silence in speech would noticeably decrease the sound quality– Breaks can be as short as 20ms– Excessive starting and stopping of transmission is very
noticeable
• Turning off transmission while the person is talking would result in terrible sound quality
Overcoming the Obstacles
• Define “removable breaks” as those of sufficient length that there removal would not degrade the overall sound quality
• Define a new goal of trying to predict removable breaks– Removing only the long breaks will still result in a large
decrease in total transmitted information
Defining a Removable Break
0 5 0 1 0 0 1 5 0 2 0 00
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
1 0 0 0
I n d e x o f B r e a k ( l o c a t i o n i n s t a t e m e n t )
Leng
th o
f Bre
ak (m
illis
econ
ds)
B r e a k T i m e D a t a f o r R a n d o m S p e a k e r
• Noticeable threshold at 200ms break length– All breaks longer than 200ms tend to be a great deal
longer than 200ms
Choosing Parameters
• What can be used to predict removable breaks in speech?– The obvious two:
• How long the person has been talking• How long the person has been silent
• Think of speech as a progression – Speaking, silence, speaking, silence, …– Each occurrence with corresponding length in time
0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 9 0 0 1 0 0 0
0
0 . 2
0 . 4
0 . 6
0 . 8
1
Modeling the Distributions
• 10, one-minute statements from different speakers were sampled and analyzed
• The goal is to model the likelihood of a removable break’s occurrence as a function of talk time and silent time
• First, model the likelihood as a function of the individual parameters
Modeling the Distributions (Talk Time)
0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Empirical Distributrion Function & MMSE Model Distribution Function
Pr(R
eal B
reak
)
Talk Time So Far (seconds)
EDFMMSE Model
• Exponentially distributed– λ parameter chosen that minimizes the mean-squared error
Modeling the Distributions (Break Time)
• Uniformly distributed– Linear model used to create a model CDF that minimized
the mean-squared error
0 20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Empirical Distribution Function and Linear Model Uniform Distribution
Pr(R
eal B
reak
)
Break Time So Far (milliseconds)
EDFMMSE Model
Verifying the Results
• Kolmogrov test, α = 0.05– Talk time
• Ho: The data is drawn from the model distribution• H1: The data is not drawn from the model distribution• Result
– Do not reject the null hypothesis, p-value = 0.34083
– Break time• Ho: The data is drawn from the model distribution• H1: The data is not drawn from the model distribution• Result
– Do not reject the null hypothesis, p-value = 0.07
Modeling the Distributions• Pr{RB|Tsf,Bsf}
Application of Models
• We now have a good idea of the likelihood of a removable break with respect to the two parameters
• How do we obtain the parameters in real-time?– Use counters
• If A=normalized amplitude is above 0.06 – Add 1 to talk time counter– If counter exceeds 50, reset break time counter
• If A is below 0.06 – Add 1 to break time counter– If counter exceeds fs*(0.2 seconds), reset talk time counter
Application of Models
• When do we stop and resume transmission?– Break up talk time into intervals due to exponential
distribution• Example: 0-1 seconds, 1-3 seconds, etc…
– For each interval, pick a threshold break time based upon desired probability of error
• Example: If the talk time up to that point is between 1-3 seconds and the person goes silent for 100ms, turn off
– Once off, pick a threshold number of samples such that if the signal goes high for a set amount of time, turn on
• Goal is to only turn on for resumed talking, not spikes of noise
Final Algorithm
• Designer chooses an acceptable probability of error, the thresholds are set accordingly
Picking a Probability of Error
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 120
25
30
35
40
45
50
55
60
65
Probability of Error (Pr(off when no real break occurs))
Per
cent
of I
nfor
mat
ion
Rem
oved
Savings vs Probability of Error with Conceptual Sound Quality
SavingsSound Quality
• 10 individual speakers, no listening time• Sound quality plot is purely conceptual
Testing and Results
0 5 10 15 20 25 30 35 40 45-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Time (seconds)
Amplitude
Voice Data of Random Speaker
0 5 10 15 20 25 30 35 40 45-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Time (seconds)
Amplitude
Voice Data of Random Speaker With Algorithm Applied
• Single Speaker - Over 40% of data was removed, without noticeable change in sound quality
Testing and Results• Conversation Data
0 5 10 15 20 25 30 35 40 45 50 55-1
-0.5
0
0.5
1
Time (seconds)
Am
plitu
de
Voice Data of Conversation Speaker 1
0 5 10 15 20 25 30 35 40 45 50 55-1
-0.5
0
0.5
1
Time (seconds)
Am
plitu
de
Voice Data of Conversation Speaker 2
Testing and Results
• Most widely used current method
– If one person is talking and the other is silent, turn off transmission of silent user
– When silent user goes active, resume transmission
– Maximum savings in transmitted information of 50%• One user talks the whole time, the other user never talks
Testing and Results• Conversation data with current method applied
0 5 10 15 20 25 30 35 40 45 50 55-1
-0.5
0
0.5
1
Time (seconds)
Am
plitu
de
Voice Data of Conversation Speaker 1 with Phone Company Algorithm Applied
0 5 10 15 20 25 30 35 40 45 50 55-1
-0.5
0
0.5
1
Time (seconds)
Am
plitu
de
Voice Data of Conversation Speaker 2 with Phone Company Algorithm Applied
Testing and Results• Conversation data with new algorithm applied
0 5 10 15 20 25 30 35 40 45 50 55-1
-0.5
0
0.5
1
Time (seconds)
Am
plitu
de
Voice Data of Conversation Speaker 1 with Algorithm Applied
0 5 10 15 20 25 30 35 40 45 50 55-1
-0.5
0
0.5
1
Time (seconds)
Am
plitu
de
Voice Data of Conversation Speaker 2 with Algorithm Applied
Significance of Results
• Sample of 10, one-minute conversations
• Current method– Average removed information = 47.901%
• New Method– Average removed information = 73.748%
• It is assumed that the average removed information for both cases is drawn from a normal distribution
Significance of Results
• Comparison of two population means, α = 0.05
– H0: μsavings-new < μsavings-current
– H1: μsavings-new > μsavings-current
– Result: Reject the null hypothesis, p-value = 7.0844x10-16
– Strongly supports the alternative hypothesis• Supports the claim that the new algorithm performs better, on
average, than the most widely used current method
Discussion of Results
• Surprises– High probability of error still resulted in good sound
quality• Predicting removable breaks • Algorithm never turns off during speaking
– New algorithm is only dependent on single speaker• Current method is dependent on both speakers• Could possibly be programmed directly into a device such as
a wireless phone
Conclusions• The new algorithm performs significantly better than
the most widely used current method
• There is no significant reduction in sound quality
• There is no loss of comprehension by the listener
• Application of the algorithm would result in a large decrease in the number of active users on a communications channel at any give time
Questions?