design and development of semg-based silent speech recognition · 2020. 8. 28. · design and...

1
Design and Development of sEMG-based Silent Speech Recognition Serge H. Roy 1 , Geoffrey S. Meltzner 2 , James T. Heaton 3 , Yunbin Deng 4 , Bhawna Shiwani 1 , Gianluca De Luca 1 , Joshua C. Kline 1 1 Delsys, Inc and Altec Inc, Natick, USA, 2 VocaliD, Inc. Belmont, USA, 3 Harvard Medical School Department of Surgery, MGH, Boston, USA, 4 BAE Systems, Burlington, USA, WEARABLE SENSORS FOR MOVEMENT SCIENCES Acknowledgements Support Final sEMG SSR System – Algorithms, Sensors and Mobile Deployment Objective We set out to design and develop a SSR system based on recordings of the surface electromyographic (sEMG) signal from articulator muscles of the face and neck during silently mouthed (subvocal) speech. 3) Initial Data Processing 1 Separating speech from non-speech sEMG activity Finite multi- channel state machine Robust against noise Algorithm Development – Strategic evolution of Hidden Markov Models (HMMs) for SSR Our SSR system was able to recognize subvocal speech with 8.9% WER from a 2,200- word vocabulary of 1,200 continuous phrases including previously unseen words. The miniaturized sensors provide a robust and unencumbering facial interface that can transmit data via custom wireless or Bluetooth protocol for portable integration with a mobile device. These results demonstrate the viability of our SSR system as a silent modality of speech communication that can be developed further for persons with speech impairments (Meltzner et al, 2017), military personnel, or consumer applications. Background Methods Conclusion VocalID, Inc. Belmont, USA MGH, Boston, USA BAE Systems, Inc. Burlington, USA (R44DC014870) 1) Experiment Setup References 1. Meltzner et. al. Silent Speech Recognition as an Alternative Communication Device for Persons With Laryngectomy. IEEE Trans. on ASLP, 2017. Mel Frequency Cepstral Coefficients (MFCCs) provided the lowest word error rate (WER) of 9.6% across of all combinations of features tested. SSR Configuration: Data Corpus Subjects Vocabulary/Phrases Isolated Words Controls (n=9) 65 words and digits Sequences of Words Controls (n=4) 202 words, 1,200 sequences Continuous Speech Controls (n=5) 2,200 words, 1,200 phrases Speech provides an attractive modality for human-machine interface (HMI) through automatic speech recognition (ASR). But ASR suffers from three primary limitations: 1) Performance degradation in presence of ambient noise 2) Limited ability for privacy/secrecy 3) Poor accessibility for those with speech disorders. These limitations have motivated the need for alternative non-acoustic modalities of subvocal or silent speech recognition (SSR). -0.1 0 0.1 -0.2 0 0.2 -0.05 0 0.05 Ch1 (mV) -0.02 0 0.02 -0.4 0 0.4 -0.1 0 0.1 0 1 2 3 4 5 -0.5 0 0.5 Time (s) 0 1 2 3 4 5 -1 0 1 Ch3 (mV) Ch5 (mV) Ch7 (mV) Ch2 (mV) Ch4 (mV) Ch6 (mV) Ch8 (mV) Time (s) “I’m not ready yet” sEMG signal Speech Activity sEMG signal Speech Activity 2) Data Collection Subjects – n=18 healthy (11 males, 8 females, age range 18-42 y.o.) sEMG Sensors: 8 DE 2.1 sensors and Trigno™ Mini sensors (Delsys, Natick, USA) Protocol – Subjects silently mouthed words while sEMG activity was recorded from muscles of face and neck. Challenge 1 Discriminating isolated words using sEMG-based speech related features 0 20 40 60 80 100 WER (%) sEMG Features Tests for Combinations of sEMG-based Features BEST Challenge 2 Tracking sequences of words from patterns of sEMG signals using grammatical context 0 10 20 30 40 Sentence Construction Grammar Unrestricted NATO Grammar NL Equivalence Grammar Unigram Grammar Subject 1 Subject 2 Subject 3 Mean WER (%) Tests for Different Grammar Models Grammar Models BEST Challenge 3 Recognizing a large vocabulary of untrained words using phoneme-based models WER (%) Tests for Gaussian Mixture Model Configurations 0 10 20 30 40 4 8 16 32 Subject 1 Subject 2 Subject 3 Mean 0 5 10 15 20 20 30 40 54 154 Subject 1 Subject 2 Subject 3 Mean WER (%) Tests for Different Feature Dimensions Gaussian Mixtures HLDA Feature Dimension Natural Language (NL) Equivalence Grammar provided a larger range of linguistically correct English sentences and had the second lowest WER of 6.8%. Increasing to 16 Gaussian mixtures to model each HMM state decreased WER to 11.3% Reducing the HLDA feature dimension to 40 decreased the WER to 7.3% Subject Digits Text Messages Special Operations Common Phrases Mean WER 1 2.7 0.9 2.0 0.0 1.4 2 15.4 15.9 8.0 15.4 13.9 3 20.7 12.1 13.6 5.2 12.9 4 18.2 10.3 10.6 3.7 10.7 5 12.2 5.6 3.4 1.2 5.6 Mean 13.8 9.0 7.5 5.1 8.9 SD. 7.0 5.8 4.9 6.1 5.3 SSR System Performance (WER): SSR Sensors: Trigno™ Quattro Facial Array Wireless/Bluetooth communication for mobile use Sensors sEMG 4 sensor-array (under development) worn on face and neck Features – MFCCs Grammar – NL Equivalence Recognition Toolkit – KALDI Model HMM Triphone, HLDA Feature Reduction, maximum likelihood linear regressions (MLLR), subspace Gaussian mixture modelling (SGMM) We are moving BEST BEST FINAL

Upload: others

Post on 01-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design and Development of sEMG-based Silent Speech Recognition · 2020. 8. 28. · Design and Development of sEMG-based Silent Speech Recognition Serge H. Roy 1 , Geoffrey S. Meltzner

Design and Development of sEMG-based Silent Speech RecognitionSerge H. Roy1, Geoffrey S. Meltzner2, James T. Heaton3, Yunbin Deng4, Bhawna Shiwani1, Gianluca De Luca1, Joshua C. Kline1

1Delsys, Inc and Altec Inc, Natick, USA, 2VocaliD, Inc. Belmont, USA, 3Harvard Medical School Department of Surgery, MGH, Boston, USA, 4BAE Systems, Burlington, USA,

WEARABLE SENSORS

FOR MOVEMENT SCIENCES

Acknowledgements Support

Final sEMG SSR System – Algorithms, Sensors and Mobile Deployment

ObjectiveWe set out to design and develop a SSRsystem based on recordings of the surfaceelectromyographic (sEMG) signal fromarticulator muscles of the face and neckduring silently mouthed (subvocal) speech.

3) Initial Data Processing1

• Separating speech from non-speech sEMG activity

• Finite multi-channel state machine

• Robust against noise

Algorithm Development – Strategic evolution of Hidden Markov Models (HMMs) for SSR

• Our SSR system was able to recognize subvocal speech with 8.9% WER from a 2,200-word vocabulary of 1,200 continuous phrases including previously unseen words.

• The miniaturized sensors provide a robust and unencumbering facial interface thatcan transmit data via custom wireless or Bluetooth protocol for portable integrationwith a mobile device.

• These results demonstrate the viability of our SSR system as a silent modality ofspeech communication that can be developed further for persons with speechimpairments (Meltzner et al, 2017), military personnel, or consumer applications.

Background

Methods

Conclusion• VocalID, Inc. Belmont, USA

• MGH, Boston, USA

• BAE Systems, Inc. Burlington, USA

(R44DC014870)

1) Experiment Setup

References1. Meltzner et. al. Silent Speech Recognition as an Alternative Communication

Device for Persons With Laryngectomy. IEEE Trans. on ASLP, 2017.

• Mel Frequency Cepstral Coefficients (MFCCs) providedthe lowest word error rate (WER) of 9.6% across of allcombinations of features tested.

SSR Configuration:

Data Corpus Subjects Vocabulary/Phrases

Isolated Words Controls (n=9) 65 words and digits

Sequences of Words Controls (n=4) 202 words, 1,200 sequences

Continuous Speech Controls (n=5) 2,200 words, 1,200 phrases

Speech provides an attractive modality for human-machine interface(HMI) through automatic speech recognition (ASR). But ASR suffers fromthree primary limitations:

1) Performance degradation in presence of ambient noise

2) Limited ability for privacy/secrecy

3) Poor accessibility for those with speech disorders.

These limitations have motivated the need for alternative non-acousticmodalities of subvocal or silent speech recognition (SSR).

-0.1

0

0.1

-0.2

0

0.2

-0.05

0

0.05

Ch

1 (

mV

)

-0.02

0

0.02

-0.4

0

0.4

-0.1

0

0.1

0 1 2 3 4 5-0.5

0

0.5

Time (s)

0 1 2 3 4 5-1

0

1

Ch

3 (

mV

)C

h5

(m

V)

Ch

7 (

mV

)

Ch

2 (

mV

)C

h4

(m

V)

Ch

6 (

mV

)C

h8

(m

V)

Time (s)

“I’m not ready yet”

sEMG signal Speech Activity sEMG signal Speech Activity

2) Data Collection

• Subjects – n=18 healthy (11 males, 8 females, age range 18-42 y.o.)

• sEMG Sensors: 8 DE 2.1 sensors and Trigno™ Mini sensors (Delsys,Natick, USA)

• Protocol – Subjects silently mouthed words while sEMG activity wasrecorded from muscles of face and neck.

Challenge 1

Discriminating isolated words using sEMG-based speech related features

0

20

40

60

80

100

WER

(%

)

sEMG Features

Tests for Combinations of sEMG-based Features

BEST

Challenge 2

Tracking sequences of words from patterns of sEMG signals using grammatical context

0

10

20

30

40

SentenceConstruction

Grammar

UnrestrictedNATO

Grammar

NLEquivalence

Grammar

UnigramGrammar

Subject 1 Subject 2 Subject 3 Mean

WER

(%

)

Tests for Different Grammar Models

Grammar Models

BEST

Challenge 3

Recognizing a large vocabulary of untrained words using phoneme-based models

WER

(%

)

Tests for Gaussian Mixture Model Configurations

0

10

20

30

40

4 8 16 32

Subject 1 Subject 2 Subject 3 Mean

0

5

10

15

20

20 30 40 54 154

Subject 1 Subject 2 Subject 3 Mean

WER

(%

)

Tests for Different Feature Dimensions

Gaussian Mixtures

HLDA Feature Dimension• Natural Language (NL) Equivalence Grammar provided alarger range of linguistically correct English sentences andhad the second lowest WER of 6.8%.

• Increasing to 16 Gaussian mixtures to model each HMMstate decreased WER to 11.3%

• Reducing the HLDA feature dimension to 40 decreasedthe WER to 7.3%

Subject DigitsText

Messages

Special

Operations

Common

Phrases

Mean

WER

1 2.7 0.9 2.0 0.0 1.4

2 15.4 15.9 8.0 15.4 13.9

3 20.7 12.1 13.6 5.2 12.9

4 18.2 10.3 10.6 3.7 10.7

5 12.2 5.6 3.4 1.2 5.6

Mean 13.8 9.0 7.5 5.1 8.9

SD. 7.0 5.8 4.9 6.1 5.3

SSR System Performance (WER):SSR Sensors: Trigno™ Quattro Facial Array

• Wireless/Bluetoothcommunication for mobile use

• Sensors – sEMG 4 sensor-array (underdevelopment) worn on face and neck

• Features – MFCCs

• Grammar – NL Equivalence

• Recognition Toolkit – KALDI

• Model – HMM Triphone, HLDA FeatureReduction, maximum likelihood linearregressions (MLLR), subspace Gaussian mixturemodelling (SGMM)

We are moving

BEST

BEST

FINAL