guidedsourceseparationmeetsastrongasr backend: hitachi ... · guidedsourceseparationmeetsastrongasr...

Post on 25-Jul-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NT

Guided Source Separation Meets a Strong ASRBackend: Hitachi/Paderborn University Joint

Investigation for Dinner Party ASR

Naoyuki Kanda1, Christoph Boeddeker2, Jens Heitkaemper2,Yusuke Fujita1, Shota Horiguchi1, Kenji Nagamatsu1,

Reinhold Haeb-Umbach2

1Hitachi Ltd., Japan2Paderborn University, Germany

11:00, September 17, 2019

NT

Table of contents

1 Introduction/Motivation

2 Front-end: Guided Source Separation (GSS)

3 Back-end: Automatic Speech Recognition (ASR)

4 Simulation results

5 Conclusion

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 1 / 11

NT

CHiME-5 Dataset: Dinner party automatic speech recognition

Kitchen> 30min

Dining> 30min

Living> 30min

• 16+2+2 sessions with ca. 2 h• 4 participants in each scenario• 6 Kinect microphone arrays

Difficulties• Natural conversation• Overlap• No simulated data, only in-earmicrophone signals

• Realistic recording (e.g. devicefailure, lost samples)

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 2 / 11

NT

Motivation

Baseline81.1% (DEV) 73.3% (EVAL)

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 3 / 11

NT

Motivation

Baseline81.1% (DEV) 73.3% (EVAL)

Paderborn (GSS)61.7% (DEV) 69.0% (EVAL)Pure frontend contribution

Challeng

e

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 3 / 11

NT

Motivation

Baseline81.1% (DEV) 73.3% (EVAL)

Paderborn (GSS)61.7% (DEV) 69.0% (EVAL)Pure frontend contribution

Hitachi-JHU (ASR)54.0% (DEV) 50.6% (EVAL)

Main gain from speech recognizer

Challeng

e Challenge

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 3 / 11

NT

Motivation

Baseline81.1% (DEV) 73.3% (EVAL)

Paderborn (GSS)61.7% (DEV) 69.0% (EVAL)Pure frontend contribution

Hitachi-JHU (ASR)54.0% (DEV) 50.6% (EVAL)

Main gain from speech recognizer

Hitachi-PaderbornGSS Meets a Strong ASR Backend

Interaction: Do they profit from each other?Can we use external data to boost the performance?

Can we tune the system?

Challeng

e Challenge

Here Here

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 3 / 11

NT

Guided Source Separation (GSS) – Spatial Mixture Model

STFT

Speaker 1

Speaker 2

Noise

TF MasksE-Step

Who is activeat (t, f ) bin

M-StepCalculate speaker

directions

TFsource

activ

ity

Source

directions

Spatial Mixture Modelunsupervised on one utterance

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 4 / 11

NT

Guided Source Separation (GSS) – Spatial Mixture Model

STFT

Speaker 1

Speaker 2

Noise

TF MasksE-Step

Who is activeat (t, f ) bin

M-StepCalculate speaker

directions

TFsource

activ

ity

Source

directions

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 4 / 11

NT

Guided Source Separation (GSS) – Spatial Mixture Model

STFT

Speaker 1

Speaker 2

Noise

TF MasksE-Step

Who is activeat (t, f ) bin

M-StepCalculate speaker

directions

TFsource

activ

ity

needinitializationand canbe guided

Source

directions

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 4 / 11

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context ±15 s Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context ±15 s Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context ±15 s Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context ±15 s Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

pseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

pseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

pseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

pseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

pseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

pseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

GSSpseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

NT

ASR – Hypothesis Deduplication (HD)

Same words were sometimes recognized for overlapped utterances

um yeah

can I help with um yeah that looks goodcan I help with�����XXXXXum yeah that looks good

1. Compare confidencescores one-by-one

2. Delete lowerconfident words

Hypothesis deduplication:• Compensates weak source separation system• Indicator which source separation worked better

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 7 / 11

NT

ASR – Hypothesis Deduplication (HD)

Same words were sometimes recognized for overlapped utterances

um yeah

can I help with um yeah that looks goodcan I help with�����XXXXXum yeah that looks good

1. Compare confidencescores one-by-one

2. Delete lowerconfident words

Hypothesis deduplication:• Compensates weak source separation system• Indicator which source separation worked better

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 7 / 11

NT

ASR – Hypothesis Deduplication (HD)

Same words were sometimes recognized for overlapped utterances

um yeah

can I help with um yeah that looks goodcan I help with�����XXXXXum yeah that looks good

1. Compare confidencescores one-by-one

2. Delete lowerconfident words

Hypothesis deduplication:• Compensates weak source separation system• Indicator which source separation worked better

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 7 / 11

NT

Simulation results – WER – Challenge conform

DEV (%) EVAL (%)Baseline (single) 81.1 73.3USTC/iFlytek 45.0 46.1Proposed 39.94 41.64

Speech enhancement (multi array – 24 channels) DEV (%) EVAL (%)Hitachi-JHU 57.50Paderborn 49.21Paderborn + BF w/o context 46.54 51.99Paderborn + BF w/o context + ASR feedback 45.14 47.29

Model combination RNN-LM HD DEV (%) EVAL (%)45.14 47.29

X 41.67 43.70X X 39.94 41.64X X X 40.26 42.00

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 8 / 11

NT

Simulation results – WER – Challenge conform

DEV (%) EVAL (%)Baseline (single) 81.1 73.3USTC/iFlytek 45.0 46.1Proposed 39.94 41.64

Speech enhancement (multi array – 24 channels) DEV (%) EVAL (%)Hitachi-JHU 57.50Paderborn 49.21Paderborn + BF w/o context 46.54 51.99Paderborn + BF w/o context + ASR feedback 45.14 47.29

Model combination RNN-LM HD DEV (%) EVAL (%)45.14 47.29

X 41.67 43.70X X 39.94 41.64X X X 40.26 42.00

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 8 / 11

NT

Simulation results – WER – Challenge conform

DEV (%) EVAL (%)Baseline (single) 81.1 73.3USTC/iFlytek 45.0 46.1Proposed 39.94 41.64

Speech enhancement (multi array – 24 channels) DEV (%) EVAL (%)Hitachi-JHU 57.50Paderborn 49.21Paderborn + BF w/o context 46.54 51.99Paderborn + BF w/o context + ASR feedback 45.14 47.29

Model combination RNN-LM HD DEV (%) EVAL (%)45.14 47.29

X 41.67 43.70X X 39.94 41.64X X X 40.26 42.00

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 8 / 11

NT

Simulation results – WER – External data

ASR difficulty on CHiME-5: training data has only 40 h

AM Training Data DEV (%)CNN-TDNN-LSTM LibriSpeech (960h) 62.09Baseline TDNN CHiME-5 (40h) 58.39

CNN-TDNN-RBiLSTM CHiME-5 (40h) 45.14

Naive use of 960 hours does not improve performancePossible causes:

• Conversational/spontaneous speech• Enhanced speech not close enough to clean speech

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 9 / 11

NT

Simulation results – WER – External data

3-gram LM Training Data # of Words DEVPPL WER (%)

CHiME-5 (Baseline) 0.4M 155 45.14CHiME-5 + AMI 1.2M 140 45.10CHiME-5 + LibriSpeech 9.8M 134 44.49CHiME-5 + AMI + LibriSpeech 10.6M 131 44.21

Larger gain expected.

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 10 / 11

NT

Conclusion

+ GSS front-end can boost second best challenge system(Hitachi-JHU)

+ CHiME-5: New best WER (DEV 39.94% and EVAL 41.64%)+ Annotation fine-tuning with ASR+ Dropping context for beamforming◦ Taking more data for AM and LM needs more investigations

Thank you for listening!

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 11 / 11

NT

Conclusion

+ GSS front-end can boost second best challenge system(Hitachi-JHU)

+ CHiME-5: New best WER (DEV 39.94% and EVAL 41.64%)+ Annotation fine-tuning with ASR+ Dropping context for beamforming◦ Taking more data for AM and LM needs more investigations

Thank you for listening!

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 11 / 11

NT

CHiME 5 result table

Track Session Kitchen Dining Living Overall

SingleDev S02 62.33 52.82 44.62 52.07S09 51.87 54.02 48.09

Eval S01 60.07 40.88 60.94 47.31S21 49.09 38.14 42.67

MultipleDev S02 46.66 45.07 36.19 39.94S09 36.40 39.43 35.33

Eval S01 53.93 35.66 49.78 41.64S21 46.43 34.53 36.64

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 1 / 2

NT

Single array to multi array

Arrays Context in BFOn Off

1 58.05 58.133 52.30 48.816 49.21 46.54

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 2 / 2

top related