a study on the video scene retrieving system
DESCRIPTION
Recently, a variety of video data are being generated, stored, and accessed with advances in computer technology and the Int ernet. To make search a video, or a video scene quickly from the data, an efficient and effective technique is needed. So I proposed a video scene retrieval system based on speech recognition which is using HMM(Hidden Markov Model). The proposed system is applied to scene retrieval experiments that evaluate a recognition rate for 457 short words. Experiment result shows average detection accuracy is 68%.TRANSCRIPT
A Study on the Video Scene Retrieving System
with a Speech Recognizer
2013. 5. 14
Yoshika OSAWA
Kohno Lab.
Outline1. Introduction
2. Aim of Study
3. Composition of Systemi. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
1. Introduction
• A variety of video data are being generated, stored, and accessed with advances in the Internet.
• To make search a video scene quickly from the data, an efficient technique is needed.
1. Introduction• Multimedia Annotations
oNagao(2001)
1. Introduction• A Subtitling System for Broadcast
Programs with a Speech Recognizer
oAndo et al.(2001)
1. Introduction• Extracting voices from the video.
• The advantage of voice :
Easy to Make texts.
Simple association.
Apply the speech recognition to the scene retrieving.
Outline1. Introduction
2. Aim of Study
3. Composition of Systemi. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
2. Aim of Study
Implement a scene retrieving system, then verify the accuracy and
check the operations.
Make annotations with the speech recognition automatically.
Outline1. Introduction
2. Aim of Study
3. Composition of Systemi. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
3. Composition of System
Start
End
Select a Video
Speech Recognize Section
Input a Keyword
Scene Retrieve Section
Output the resultVoice Divide Section
i. Voice Divide Section• Focus on the Amplitude
oUse signals while exceeding the threshold value of the amplitude.
o Reject because it is not possible to recognize if it is too short.
oDerive threshold based on experiment.
axis threshold
Amplitude 10[%]
Time 1000[ms]
ii. Speech Recognize Section
(1) Pre-Processing Unit• Digitization
o Sampling frequency: 16kHz
oQuantization bit : 16bit
• Noise Reductiono Additive: Subtract the difference between the silence
o Multiplicative: Subtract in the log axis
Microphone characteristics of SM57
(2) Feature Extraction Unit
Resonant frequency is effective as a feature value
• Resolution of human hearing
oHigher sensitivity in lower frequency
• Filter that matches the human hearing
Mel-frequency
(2) Feature Extraction Unit
• Inverse Fourier transform in the Mel-frequency axiso New axis: Cepstrum
o Separate the voice pitch and resonance frequency
• MFCC(Mel Frequency Cepstrum Coefficient)o Information of vowel
• ΔMFCCo Infromation of consonant
• Feature vectoro (Average power, MFCC, ΔMFCC)
(2) Feature Extraction Unit
(3) Identification Unit
From Bayes' theorem
(3) Identification UnitSpeech waveform : Observable
Character information: Unobservable directly
Estimate the character information from the waveform by using HMM (Hidden Markov Models)
Maximum likelihood calculation : Viterbi algorithmMachine learning : Baum-Welch algorithm
iii. Scene Retrieve Section• Matching keyword and text
1. Input a keyword
2. Matching the keyword by String searching
3. Extract scene that the keyword was spoken.
4. Output a thumbnail
Outline1. Introduction
2. Aim of Study
3. Composition of Systemi. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
4. Evaluation Experiment1. Compare the result with the word I heard
2. Calculate the recognition rate
3. Evaluate it by each number of charactersSample data
Video NHK news
Time 3 minutes
Number 30 videos
Words 457 words
Engine Julius
4. Evaluation Experiment
Total average rate is 68%.
67%73%
69%
46% 45%40%
0%
20%
40%
60%
80%
Recognition Rate
1 2 3 4 5 6 words
4. Evaluation Experiment• Verify the correspondence between
keyword and the seek destination
o Select thumbnail and play from the scene
oCheck whether the keyword was spoken.
4. Evaluation Experiment• Recognition rate decrease when number
of characters increase.
• The retrieved scene is corresponding to the keyword.
• Recognition error in weak consonant part
oNeed improvement in Voice Devide Section
oMust also improve the recognition accuracy
Outline1. Introduction
2. Aim of Study
3. Composition of Systemi. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
5. Conclusion• System for efficient watching video
oUse Speech Recognition
oMake Annotations automatically
• Future work
oAdopt the Zero-Crossing Number in Voice Devide Section
o Take in latest Speech Recognition technology.
o Incorporate Image Recognition.
Thank you for your attention!