superhuman speech recognition jul 2 2008

25
IBM Research  © 2006 IBM Corporation Superhuman Speech Recognition: Technology Challenges & Market Adoption David Nahamoo IBM Fellow Speech CTO, IBM Research July 2, 2008  

Upload: ofershochet

Post on 31-May-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 1/25

IBM Research

  © 2006 IBM Corporation

Superhuman Speech Recognition:Technology Challenges & Market Adoption

David NahamooIBM FellowSpeech CTO, IBM Research

July 2, 2008  

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 2/25

IBM Research

© 2006 IBM Corporation2

•This chart represents all revenue for speech related ecosystem activity.

•Revenue exceeded $1B for the 1st  time in 2006 

•Note also that hosted services will represent ½ of speech related revenue in 2011

WW Voice-Driven Conversation Access Technology Forecast 

*Opus Research 02_2007 

Overall Speech Market Opportunity

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 3/25

IBM Research

© 2006 IBM Corporation3

Market UsageNeed-based SegmentationSpeech Segments

Mobile Internet, Yellow Pages

SMS, IM, email

Embedded - Automotive, Mobile

Devices, Appliances, Entertainment

Contact Centers, Tourism, Global

Digital Communities, Media (XCast)

Media, Medical, Legal, Education,

Government, Unified Messaging

Contact Centers, Government

Contact Centers, Media,

Government

Contact Centers

Information Search & Retrieval

Command & Control

Multilingual Communication

Information Access and provision

Security

Intelligence

Transaction/Problem Solving

Speech Search & Messaging

Speech Control

Speech Translation

Speech Transcription

Speech Biometrics

Speech Analytics

Speech Self Service

Speech Market Segments

• Improved accuracy• Much larger vocabulary speech recognition system

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 4/25

IBM Research

© 2006 IBM Corporation4

New Opportunity Areas

Contact Centers Analytics

 – Quality Assurance, Real Time Alerts, Compliance

Media Transcription

 – Closed captioning

Accessibility

 – Government, Lectures

Content Analytics

 – Audio-indexing, cross-lingual information retrieval, multi-media mining

Dictation

 – Medical, Legal, Insurance, Education

Unified Communication

 – Voicemail, Conference calls, email and SMS on hand held

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 5/25

IBM Research

© 2006 IBM Corporation5

Human

Baseline for 

conversations

Target zone

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 6/25

IBM Research

© 2006 IBM Corporation6

Performance Results (2004 DARPA EARS Evaluation)

14

15

16

17

18

19

20

0.1 1 10 100

xRT

    W    E

    R

IBM

V3

V2

V1-V4

V4

IBM: Best Speed-Accuracy Tradeoff 

(Last public evaluation of English Telephony Transcription)

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 7/25

IBM Research

© 2006 IBM Corporation7

MALACH

Multilingual Access to Large Spoken Ar CHives

• Funded by NSF, 5-year project (Started in Oct. 2001)

Project Participants

 – IBM, Visual History Foundation, Johns Hopkins University, University of Maryland,

Charles University and University of West Bohemia

Objective

 – Improve access to large multilingual collections of spontaneous speech by

advancing the state-of-the-art in technologies that work together to achieve

this objective: Automatic Speech Recognition, Computer-Assisted Translation

, Natural Language Processing and Information Retrieval

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 8/25

IBM Research

© 2006 IBM Corporation8

MALACH: A challenging speech corpus

Emotional speech

• young man they ripped his teeth and beard out they beat him

Disfluencies

• A- a- a- a- band with on- our- on- our- arm

Multimedia digital archive: 116,000 hours of interviews with over 52,000 survivors,liberators, rescuers and witnesses of the Nazi Holocaust, recorded in 32

languages.

Goal: improved access to large multilingual spoken archives

Challenges:

Frequent interruptions:

• CHURCH TWO DAYS these were the people who were to go to

march TO MARCH and your brother smuggled himself SMUGGLED

IN IN IN IN

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 9/25

IBM Research

© 2006 IBM Corporation9

Word Error Rates

20

30

40

50

60

70

80

90

Jan. '02 Oct. '02 Oct. '03 June '04 Nov. '04

Effects of Customization (MALACH Data)

State-of-the-art ASR system

trained on SWB data (8KHz)

MALACH Training data seen

by AM and LM

fMPE, MPE, Consensus

decoding

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 10/25

IBM Research

© 2006 IBM Corporation10

Improvement in Word Error Rate for IBM embedded ViaVoice

0

1

2

3

4

5

6

WER across 3 car speeds and 4 grammars

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 11/25

IBM Research

© 2006 IBM Corporation11

Progress in Word Error Rate – IBM WebSphere Voice Server 

Grammar Tasks over Telephone

2001 - 2006

0

1

2

3

4

5

6

WVS

Word Error Rate %

45% relative improvement in WER the last 2.5 years

20% relative improvement in speed in the last 1.5 years

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 12/25

IBM Research

© 2006 IBM Corporation12

Multi-Talker Speech Separation Task

male and female speaker at 0dB

Lay white at X 8 soon

Bin Green with F 7 now

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 13/25

IBM Research

© 2006 IBM Corporation13

Two Talker Speech Separation Challenge Results

   R  e  c

  o  g  n   i   t   i  o  n   E  r  r  o  r

Examples:

Mixture

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 14/25

IBM Research

© 2006 IBM Corporation14

1992 1993 1994 1995 1996 1997 1998 1999 2000

1

10

100

Voicemail

SWITCHBOARD

BROADCAST NEWS

BROADCAST-HUMAN

SWITCHBOARD-HUMAN

Comparison of Human & Machine Speech Recognition

WSJ Broadcast Conv Tel Vmail SWB Call center Meeting0

10

20

30

40

50

60

70

Clean Speech Spontaneous Speech

Human-Machine Human-Human

                                                                                                                                          W                                                                                                     o                                                                                                       r                                                                                                                                         d  

                                                                                                                                         E                                                                                                     r                                                                                                      r                                                                                                     o                                                                                                        r

                                                                                                                                         R                                                                                                     a                                                                                                                                        t                                                                                                       e  

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 15/25

IBM Research

© 2006 IBM Corporation15

IBM’s Superhuman Speech Recognition

Universal Recognizer 

• Any accent

• Any topic

• Any noise conditions

• Broadcast, phone, in car, or live

• Multiple languages

• Conversational

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 16/25

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 17/25

IBM Research

© 2006 IBM Corporation17

Acoustic Modeling Today

Approach: Hidden Markov Models

 – Observation densities (GMM) for P( feature | class )

• Mature mathematical framework, easy to combine with linguistic information

• However, does not directly model what we want i.e., P( words | acoustics )

Training: Use transcribed speech data

 – Maximum Likelihood

 – Various discriminative criteria

Handling Training/Test Mismatches:

 – Avoid mismatches by collecting “custom” data

 – Adaptation & adaptive training algorithms

Significantly worse than humans for tasks with little or no linguistic

information - e.g., digits/letters recognition Human performance extremely robust to acoustic variations 

 – due to speaker, speaking style, microphone, channel, noise, accent, & dialect variations

Steady progress over the years

Continued progress using current methodology very likely in thefuture

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 18/25

IBM Research

© 2006 IBM Corporation18

Towards a Non-Parametric Approach to Acoustics General Idea: Back to pattern recognition basics!

 – Break test utterance into sequence of larger segments (phone, syllable, word,phrase)

 – Match segments to closest ones in training corpus using some metric (possiblyusing long distance models)

 – Helps to get it right if you’ve heard it before

Why prefer this approach over HMMs?

 – HMMs compress training by x1000; too many modeling assumptions

• 1000hrs ~ 30Gb; State-of-the-art acoustic models ~ 30Mb• Relaxing assumptions have been key to all recent improvements in acoustic modeling

How can we accomplish this? – Store & index training data for rapid access of training segments close to test segments

 – Develop a metric D( train_seq, test_seq): obvious candidate is DTW with appropriate metric and warping rules

Back to the Future?

 – Reminiscent of DTW & Segmental models from late 80’s – ME was missing

 – Limited by computational resources (storage/cpu/data) then & so HMMs won

Implications:

 – Need 100x more data for handling larger units (hence 100x morecomputing resources)

 – Better performance with more data – likely to have “heard it before”

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 19/25

IBM Research

© 2006 IBM Corporation19

Utilizing Linguistic Information in ASRUtilizing Linguistic Information in ASR

Today’s standard ASR does not explicitly use linguistic information

 – But recent work at JHU, SRI and IBM all show promise – Semantic structured LM improves ASR significantly for limited domains

Reduces WER by 25% across many tasks (Air Travel, Medical)

A large amount of linguistic knowledge sources now available, but not usedfor ASR Inside IBM

WWW text: Raw text: 50 million pages ~25 billion words, ~10% useful after cleanup News text: 3-4 billion words, broadcast or newswires Name entity annotated text: 2 million words tagged Ontologies Linguistic knowledge used in rule-based MT system

External WordNet, FrameNet, Cyc ontologies PennTreeBank, Brown corpus (syntactic & semantic annotated) Online dictionaries and thesaurus Google

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 20/25

IBM Research

© 2006 IBM Corporation20

•Acoustic Confusability: LM should be optimized to distinguish between acoustic confusable

sets, rather than based on N-gram counts

•Automatic LM adaptation at different levels: discourse, semantic structure, and phrase

levels

Coherence:

semantic,syntactic,pragmaticWash your clothes withsoap/ soup.

David and his/her fatherwalked into the room.

I ate a one/ nine poundsteak.

SemanticParserDialogue

State

Documen

t Type

Speaker(turn,

gender, ID) Syntactic

Parser

WordClass

Embedded

Grammar

NamedEntity

WorldKnowledge

W

1

, ..., WN

Super Structured LM for LVCSRSuper Structured LM for LVCSR

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 21/25

IBM Research

© 2006 IBM Corporation21

Combination Decoders “ROVER” is used in all current systems

 – NIST tool that combines multiple system outputs through voting Individual systems currently designed in an ad-hoc manner 

Only 5 or so systems possible

“I feel shine today”

“I veal fine today”

“I feel fine toady”

“I feel fine today”

An army (“Million”) of simple decoders

• Each makes uncorrelated errors

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 22/25

IBM Research

© 2006 IBM Corporation22

• Feature definition is key challenge• Maximum entropy model used to compute word probabilities.• Information sources combined in unified theoretical framework.

• Long-span segmental analysis inherently robust to both

stationary and transient noise

Million Feature Paradigm: Acoustic information for ASR 

Discard transient noise;

Global adaptation for 

stationary noise

Segmental analysis

Broadband features

 Narrowband features Onset features

Trajectory features

Information Sources Noise Sources

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 23/25

IBM R h

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 24/25

IBM Research

© 2006 IBM Corporation24

Generalization Dilemma

   P  e  r   f  o  r  m  a  n  c

  e

In-Domain Out-of-Domain

Test Conditions

Complex model:

brute force learning

Simple

model

Want to get here:

Model combination:

Can we at least get best

of both worlds?

Correct

complex

model(simple model

on the right

manifold)

The Gutter of Data Addiction

8/14/2019 SuperHuman Speech Recognition Jul 2 2008

http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 25/25