real-time prediction of removable breaks in human...

Real-Time Prediction of Removable Breaks in Human Speech

Mike Hoffmann

ECE 557: Engineering Data Analysis & Modeling

Portland State University

Outline

• Introduction - Why Predict Breaks in Speech?

• Modeling Speech Parameters

• Application of Models

• Final Algorithm

• Testing and Results

• Conclusions

Speech Classification

• Talking– Majority of communicated information

• Not Talking– Does contain information

• Alertness, intent, etc…

– Information is only dependent on time length of silence

Why Predict Breaks in Speech?

• Silence makes up between 60 and 80% of every conversation– Listening, pauses for breath, pauses for thought

• Breaks in speech contain no sound information– If they could be predicted, they could be removed

• Transmitting useless data is bad for bandwidth

Sample Conversation

0 5 10 15 20 25 30 35 40 45 50 55-1

-0.5

0

0.5

1

Time (seconds)

Am

plitu

de

Voice Data of Conversation Speaker 1

0 5 10 15 20 25 30 35 40 45 50 55-1

-0.5

0

0.5

1

Time (seconds)

Am

plitu

de


The Goal

• Remove a large percentage of silence in speech by predicting when breaks will occur

• Develop a real-time algorithm to control when to stop and resume transmission based on predictions

• Retain “good” sound quality

• Zero loss of comprehension by the listener

Constraining Factors

• Removing all silence in speech would noticeably decrease the sound quality– Breaks can be as short as 20ms– Excessive starting and stopping of transmission is very

noticeable

• Turning off transmission while the person is talking would result in terrible sound quality

Overcoming the Obstacles

• Define “removable breaks” as those of sufficient length that there removal would not degrade the overall sound quality

• Define a new goal of trying to predict removable breaks– Removing only the long breaks will still result in a large

decrease in total transmitted information

Defining a Removable Break

0 5 0 1 0 0 1 5 0 2 0 00

1 0 0

2 0 0

3 0 0

4 0 0

5 0 0

6 0 0

7 0 0

8 0 0

9 0 0

1 0 0 0

I n d e x o f B r e a k ( l o c a t i o n i n s t a t e m e n t )

Leng

th o

f Bre

ak (m

illis

econ

ds)

B r e a k T i m e D a t a f o r R a n d o m S p e a k e r

• Noticeable threshold at 200ms break length– All breaks longer than 200ms tend to be a great deal

longer than 200ms

Choosing Parameters

• What can be used to predict removable breaks in speech?– The obvious two:

• How long the person has been talking• How long the person has been silent

• Think of speech as a progression – Speaking, silence, speaking, silence, …– Each occurrence with corresponding length in time

0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 9 0 0 1 0 0 0

0

0 . 2

0 . 4

0 . 6

0 . 8

1

Modeling the Distributions

• 10, one-minute statements from different speakers were sampled and analyzed

• The goal is to model the likelihood of a removable break’s occurrence as a function of talk time and silent time

• First, model the likelihood as a function of the individual parameters

Modeling the Distributions (Talk Time)

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Empirical Distributrion Function & MMSE Model Distribution Function

Pr(R

eal B

reak

)

Talk Time So Far (seconds)

EDFMMSE Model

• Exponentially distributed– λ parameter chosen that minimizes the mean-squared error

Modeling the Distributions (Break Time)

• Uniformly distributed– Linear model used to create a model CDF that minimized

the mean-squared error

0 20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Empirical Distribution Function and Linear Model Uniform Distribution

Pr(R

eal B

reak

)

Break Time So Far (milliseconds)

EDFMMSE Model

Verifying the Results

• Kolmogrov test, α = 0.05– Talk time

• Ho: The data is drawn from the model distribution• H1: The data is not drawn from the model distribution• Result

– Do not reject the null hypothesis, p-value = 0.34083

– Break time• Ho: The data is drawn from the model distribution• H1: The data is not drawn from the model distribution• Result

– Do not reject the null hypothesis, p-value = 0.07

Modeling the Distributions• Pr{RB|Tsf,Bsf}

Application of Models

• We now have a good idea of the likelihood of a removable break with respect to the two parameters

• How do we obtain the parameters in real-time?– Use counters

• If A=normalized amplitude is above 0.06 – Add 1 to talk time counter– If counter exceeds 50, reset break time counter

• If A is below 0.06 – Add 1 to break time counter– If counter exceeds fs*(0.2 seconds), reset talk time counter

Application of Models

• When do we stop and resume transmission?– Break up talk time into intervals due to exponential

distribution• Example: 0-1 seconds, 1-3 seconds, etc…

– For each interval, pick a threshold break time based upon desired probability of error

• Example: If the talk time up to that point is between 1-3 seconds and the person goes silent for 100ms, turn off

– Once off, pick a threshold number of samples such that if the signal goes high for a set amount of time, turn on

• Goal is to only turn on for resumed talking, not spikes of noise

Final Algorithm

• Designer chooses an acceptable probability of error, the thresholds are set accordingly

Picking a Probability of Error

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 120

25

30

35

40

45

50

55

60

65

Probability of Error (Pr(off when no real break occurs))

Per

cent

of I

nfor

mat

ion

Rem

oved

Savings vs Probability of Error with Conceptual Sound Quality

SavingsSound Quality

• 10 individual speakers, no listening time• Sound quality plot is purely conceptual

Testing and Results

0 5 10 15 20 25 30 35 40 45-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Time (seconds)

Amplitude

Voice Data of Random Speaker

0 5 10 15 20 25 30 35 40 45-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Time (seconds)

Amplitude

Voice Data of Random Speaker With Algorithm Applied

• Single Speaker - Over 40% of data was removed, without noticeable change in sound quality

Testing and Results• Conversation Data

0 5 10 15 20 25 30 35 40 45 50 55-1

-0.5

0

0.5

1

Time (seconds)

Am

plitu

de


0 5 10 15 20 25 30 35 40 45 50 55-1

-0.5

0

0.5

1

Time (seconds)

Am

plitu

de


Testing and Results

• Most widely used current method

– If one person is talking and the other is silent, turn off transmission of silent user

– When silent user goes active, resume transmission

– Maximum savings in transmitted information of 50%• One user talks the whole time, the other user never talks

Testing and Results• Conversation data with current method applied

0 5 10 15 20 25 30 35 40 45 50 55-1

-0.5

0

0.5

1

Time (seconds)

Am

plitu

de

Voice Data of Conversation Speaker 1 with Phone Company Algorithm Applied

0 5 10 15 20 25 30 35 40 45 50 55-1

-0.5

0

0.5

1

Time (seconds)

Am

plitu

de

Voice Data of Conversation Speaker 2 with Phone Company Algorithm Applied

Testing and Results• Conversation data with new algorithm applied

0 5 10 15 20 25 30 35 40 45 50 55-1

-0.5

0

0.5

1

Time (seconds)

Am

plitu

de

Voice Data of Conversation Speaker 1 with Algorithm Applied

0 5 10 15 20 25 30 35 40 45 50 55-1

-0.5

0

0.5

1

Time (seconds)

Am

plitu

de

Voice Data of Conversation Speaker 2 with Algorithm Applied

Significance of Results

• Sample of 10, one-minute conversations

• Current method– Average removed information = 47.901%

• New Method– Average removed information = 73.748%

• It is assumed that the average removed information for both cases is drawn from a normal distribution

Significance of Results

• Comparison of two population means, α = 0.05

– H0: μsavings-new < μsavings-current

– H1: μsavings-new > μsavings-current

– Result: Reject the null hypothesis, p-value = 7.0844x10-16

– Strongly supports the alternative hypothesis• Supports the claim that the new algorithm performs better, on

average, than the most widely used current method

Discussion of Results

• Surprises– High probability of error still resulted in good sound

quality• Predicting removable breaks • Algorithm never turns off during speaking

– New algorithm is only dependent on single speaker• Current method is dependent on both speakers• Could possibly be programmed directly into a device such as

a wireless phone

Conclusions• The new algorithm performs significantly better than

the most widely used current method

• There is no significant reduction in sound quality

• There is no loss of comprehension by the listener

• Application of the algorithm would result in a large decrease in the number of active users on a communications channel at any give time

Questions?

real-time prediction of removable breaks in human...

Documents