a study on the video scene retrieving system

A Study on the Video Scene Retrieving System

with a Speech Recognizer

2013. 5. 14

Yoshika OSAWA

Kohno Lab.

Outline1. Introduction

2. Aim of Study

3. Composition of Systemi. Voice Divide Section

ii. Speech Recognize Section

iii. Scene Retrieve Section

4. Evaluation Experiment

5. Conclusion

1. Introduction

• A variety of video data are being generated, stored, and accessed with advances in the Internet.

• To make search a video scene quickly from the data, an efficient technique is needed.

1. Introduction• Multimedia Annotations

oNagao(2001)

1. Introduction• A Subtitling System for Broadcast

Programs with a Speech Recognizer

oAndo et al.(2001)

1. Introduction• Extracting voices from the video.

• The advantage of voice :

Easy to Make texts.

Simple association.

Apply the speech recognition to the scene retrieving.


2. Aim of Study





5. Conclusion

2. Aim of Study

Implement a scene retrieving system, then verify the accuracy and

check the operations.

Make annotations with the speech recognition automatically.


2. Aim of Study





5. Conclusion

3. Composition of System

Start

End

Select a Video

Speech Recognize Section

Input a Keyword

Scene Retrieve Section

Output the resultVoice Divide Section

i. Voice Divide Section• Focus on the Amplitude

oUse signals while exceeding the threshold value of the amplitude.

o Reject because it is not possible to recognize if it is too short.

oDerive threshold based on experiment.

axis threshold

Amplitude 10[%]

Time 1000[ms]

(1) Pre-Processing Unit• Digitization

o Sampling frequency: 16kHz

oQuantization bit : 16bit

• Noise Reductiono Additive: Subtract the difference between the silence

o Multiplicative: Subtract in the log axis

Microphone characteristics of SM57

(2) Feature Extraction Unit

Resonant frequency is effective as a feature value

• Resolution of human hearing

oHigher sensitivity in lower frequency

• Filter that matches the human hearing

Mel-frequency


• Inverse Fourier transform in the Mel-frequency axiso New axis: Cepstrum

o Separate the voice pitch and resonance frequency

• MFCC（Mel Frequency Cepstrum Coefficient)o Information of vowel

• ΔMFCCo Infromation of consonant

• Feature vectoro （Average power, MFCC, ΔMFCC）


(3) Identification Unit

From Bayes' theorem

(3) Identification UnitSpeech waveform : Observable

Character information: Unobservable directly

Estimate the character information from the waveform by using HMM (Hidden Markov Models)

Maximum likelihood calculation : Viterbi algorithmMachine learning : Baum-Welch algorithm

iii. Scene Retrieve Section• Matching keyword and text

1. Input a keyword

2. Matching the keyword by String searching

3. Extract scene that the keyword was spoken.

4. Output a thumbnail


2. Aim of Study





5. Conclusion

4. Evaluation Experiment1. Compare the result with the word I heard

2. Calculate the recognition rate

3. Evaluate it by each number of charactersSample data

Video NHK news

Time 3 minutes

Number 30 videos

Words 457 words

Engine Julius


Total average rate is 68%.

67%73%

69%

46% 45%40%

0%

20%

40%

60%

80%

Recognition Rate

1 2 3 4 5 6 words

4. Evaluation Experiment• Verify the correspondence between

keyword and the seek destination

o Select thumbnail and play from the scene

oCheck whether the keyword was spoken.

4. Evaluation Experiment• Recognition rate decrease when number

of characters increase.

• The retrieved scene is corresponding to the keyword.

• Recognition error in weak consonant part

oNeed improvement in Voice Devide Section

oMust also improve the recognition accuracy


2. Aim of Study





5. Conclusion

5. Conclusion• System for efficient watching video

oUse Speech Recognition

oMake Annotations automatically

• Future work

oAdopt the Zero-Crossing Number in Voice Devide Section

o Take in latest Speech Recognition technology.

o Incorporate Image Recognition.

Thank you for your attention!

a study on the video scene retrieving system

Technology

scene retrieve section4

voice divide sectionii

scene retrievingsystem

extract scene

speech recognition

speech recognizer2013

speech recognizesection

composition of systemi