speech recognition techniques

Acoustic Speech Recognition Techniques

Audio Signal Recognized Text

Sonu Kumar Mishra

BE Comp 2015-16

Introduction Historical Survey Motivation General Model of ASR Feature Extraction Hidden Markov Model Existing Systems Developing ASR Systems Revolution in ASR

Human

Computer

Speech Speech

Text Text

Meaning

Input Output

UnderstandingGeneration

The first speech recognition system (Audrey) was developed at bell laboratories in 1952. It could recognise numbers spoken by one person.

In 1970s Carnegie Mellon came out with HARPY system which could recognise 1011 words with different pronunciation.

In 1980s new systems based on Hidden Markov Model was introduced. HMM was statistical approach and more robust than the earlier technology.

Language is the fundamental mode of communication. Communicating with machines in natural language effectively is a challenge.

We can use ASR systems to control machines, find contents online and contribute to generate contents.

Most of the speech recognition systems and contents (about 80 %) are available for 10 major languages. Hence, there is a need to expand the system for local languages.

If we can interact with machines in our own local language then it would be greater achievement for modern era.

Speech Input

Analog to digital

Feature Extraction

(Generate speech fingerprint)

Compare and Select wordsProb

Matrices updated

Compare and Select Sentence level match

Pick most probable word

Output

Training algorithm

Word fingerprint template

Sentence fingerprint template

HMM

“Dividing sound waves, extract phonemes & represent using some parameters”

LPCC MFCC RASTA-PLP

Low resource, High popularity,Easy implementation,Single speaker, single language, Below 300 words

Moderate resource, High popularity, Easy-moderate impl,Multi speaker, Multi-language, Moderate vocabulary

High resource, Low popularity, Modrate-hard impl, Multi-speaker, Multi-language, Large vocabulary

Power Spectral Analysis (FFT)

First Order Derivative (DELTA)

Energy Normalization

Outdated Techniques

DNN

Multiple User

Multiple Language

Large Vocabulary

Abundant Resources

Phonetic Dictionary (TTS Synthesizer)

Z1

XnX2X1

Z2 Zn

Observed Data

Hidden or latent data

“Markov Chain”

Why HMM ?

Simple for sequential and temporal data. Handle real world applications. It works on the principle of New State = ʄ (old state, noise)

Initial Probability

Transition Probability

Observed Probability

Applications :

Speech Recognition. Facial Expression Recognition. Handwriting Recognition. Bioinformatics : Analyzing biological

data.

Large systems like Siri, Google voice and Cortana are based on neural network.

High computing processors.

AI algorithms Parallel processing.

Time

Money

ScientistsComputing Power

Engineers

Using built-in supportIn order to build from scratch

Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. Kaldi is intended for use by speech recognition researchers.

An open source toolkit for speech recognition, which includes a recognizer library written in C; an adjustable, modifiable recognizer written in Java. Acoustic model, language model, Input source, Dictionary

Use of ASR systems to interact with the devices used in daily life.

ASR systems working in local languages.

Developing Neural network based ASR systems working in all major languages.

speech recognition techniques

Engineering