guidedsourceseparationmeetsastrongasr backend: hitachi ... · guidedsourceseparationmeetsastrongasr...

36
NT Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR Naoyuki Kanda 1 , Christoph Boeddeker 2 , Jens Heitkaemper 2 , Yusuke Fujita 1 , Shota Horiguchi 1 , Kenji Nagamatsu 1 , Reinhold Haeb-Umbach 2 1 Hitachi Ltd., Japan 2 Paderborn University, Germany 11:00, September 17, 2019

Upload: others

Post on 25-Jul-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Guided Source Separation Meets a Strong ASRBackend: Hitachi/Paderborn University Joint

Investigation for Dinner Party ASR

Naoyuki Kanda1, Christoph Boeddeker2, Jens Heitkaemper2,Yusuke Fujita1, Shota Horiguchi1, Kenji Nagamatsu1,

Reinhold Haeb-Umbach2

1Hitachi Ltd., Japan2Paderborn University, Germany

11:00, September 17, 2019

Page 2: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Table of contents

1 Introduction/Motivation

2 Front-end: Guided Source Separation (GSS)

3 Back-end: Automatic Speech Recognition (ASR)

4 Simulation results

5 Conclusion

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 1 / 11

Page 3: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

CHiME-5 Dataset: Dinner party automatic speech recognition

Kitchen> 30min

Dining> 30min

Living> 30min

• 16+2+2 sessions with ca. 2 h• 4 participants in each scenario• 6 Kinect microphone arrays

Difficulties• Natural conversation• Overlap• No simulated data, only in-earmicrophone signals

• Realistic recording (e.g. devicefailure, lost samples)

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 2 / 11

Page 4: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Motivation

Baseline81.1% (DEV) 73.3% (EVAL)

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 3 / 11

Page 5: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Motivation

Baseline81.1% (DEV) 73.3% (EVAL)

Paderborn (GSS)61.7% (DEV) 69.0% (EVAL)Pure frontend contribution

Challeng

e

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 3 / 11

Page 6: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Motivation

Baseline81.1% (DEV) 73.3% (EVAL)

Paderborn (GSS)61.7% (DEV) 69.0% (EVAL)Pure frontend contribution

Hitachi-JHU (ASR)54.0% (DEV) 50.6% (EVAL)

Main gain from speech recognizer

Challeng

e Challenge

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 3 / 11

Page 7: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Motivation

Baseline81.1% (DEV) 73.3% (EVAL)

Paderborn (GSS)61.7% (DEV) 69.0% (EVAL)Pure frontend contribution

Hitachi-JHU (ASR)54.0% (DEV) 50.6% (EVAL)

Main gain from speech recognizer

Hitachi-PaderbornGSS Meets a Strong ASR Backend

Interaction: Do they profit from each other?Can we use external data to boost the performance?

Can we tune the system?

Challeng

e Challenge

Here Here

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 3 / 11

Page 8: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Guided Source Separation (GSS) – Spatial Mixture Model

STFT

Speaker 1

Speaker 2

Noise

TF MasksE-Step

Who is activeat (t, f ) bin

M-StepCalculate speaker

directions

TFsource

activ

ity

Source

directions

Spatial Mixture Modelunsupervised on one utterance

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 4 / 11

Page 9: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Guided Source Separation (GSS) – Spatial Mixture Model

STFT

Speaker 1

Speaker 2

Noise

TF MasksE-Step

Who is activeat (t, f ) bin

M-StepCalculate speaker

directions

TFsource

activ

ity

Source

directions

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 4 / 11

Page 10: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Guided Source Separation (GSS) – Spatial Mixture Model

STFT

Speaker 1

Speaker 2

Noise

TF MasksE-Step

Who is activeat (t, f ) bin

M-StepCalculate speaker

directions

TFsource

activ

ity

needinitializationand canbe guided

Source

directions

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 4 / 11

Page 11: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

Page 12: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

Page 13: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

Page 14: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context ±15 s Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

Page 15: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context ±15 s Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

Page 16: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context ±15 s Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

Page 17: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Guided Source Separation (GSS)

Speaker P05Speaker P06Speaker P07Speaker P08

Array 1

Array 6

...

24Ch

annels

STFT&

Dereverb.

SpatialMixtureModel

Beamformer

ASR

Init

N

Mask

Enhanced

AlignmentTime annotations

Guide

You thought I spent forty fivebucks on these? Mm- mm.

Context ±15 s Context 0 s

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 5 / 11

Page 18: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

pseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

Page 19: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

pseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

Page 20: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

pseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

Page 21: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

pseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

Page 22: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

pseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

Page 23: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

pseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

Page 24: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Automatic Speech Recognition (ASR)

STFT&

Dereverb.

Speaker-adaptedNeural

Beamformer

FBANK

MFCC

i-vector

1ch features

1ch- & 4chAcoustic Model

16.8%

log spec&

phase diff

4ch features

Decoder N-bestROVER

8.1%Hypothesis

Deduplication

1.3%

3-gram LM ForewardRNN-LM

BackwardRNN-LM

1.6%

4chaudiosignal

GSSpseudolog-likelihood

N-best hypotheses

Hypotheses using different combination ofspeech enhancement, AM, and microphone arrays

HypothesisResult

Number with %indicates absolute WERimprovement by thatmodule for DEV set

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 6 / 11

Page 25: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

ASR – Hypothesis Deduplication (HD)

Same words were sometimes recognized for overlapped utterances

um yeah

can I help with um yeah that looks goodcan I help with�����XXXXXum yeah that looks good

1. Compare confidencescores one-by-one

2. Delete lowerconfident words

Hypothesis deduplication:• Compensates weak source separation system• Indicator which source separation worked better

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 7 / 11

Page 26: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

ASR – Hypothesis Deduplication (HD)

Same words were sometimes recognized for overlapped utterances

um yeah

can I help with um yeah that looks goodcan I help with�����XXXXXum yeah that looks good

1. Compare confidencescores one-by-one

2. Delete lowerconfident words

Hypothesis deduplication:• Compensates weak source separation system• Indicator which source separation worked better

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 7 / 11

Page 27: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

ASR – Hypothesis Deduplication (HD)

Same words were sometimes recognized for overlapped utterances

um yeah

can I help with um yeah that looks goodcan I help with�����XXXXXum yeah that looks good

1. Compare confidencescores one-by-one

2. Delete lowerconfident words

Hypothesis deduplication:• Compensates weak source separation system• Indicator which source separation worked better

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 7 / 11

Page 28: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Simulation results – WER – Challenge conform

DEV (%) EVAL (%)Baseline (single) 81.1 73.3USTC/iFlytek 45.0 46.1Proposed 39.94 41.64

Speech enhancement (multi array – 24 channels) DEV (%) EVAL (%)Hitachi-JHU 57.50Paderborn 49.21Paderborn + BF w/o context 46.54 51.99Paderborn + BF w/o context + ASR feedback 45.14 47.29

Model combination RNN-LM HD DEV (%) EVAL (%)45.14 47.29

X 41.67 43.70X X 39.94 41.64X X X 40.26 42.00

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 8 / 11

Page 29: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Simulation results – WER – Challenge conform

DEV (%) EVAL (%)Baseline (single) 81.1 73.3USTC/iFlytek 45.0 46.1Proposed 39.94 41.64

Speech enhancement (multi array – 24 channels) DEV (%) EVAL (%)Hitachi-JHU 57.50Paderborn 49.21Paderborn + BF w/o context 46.54 51.99Paderborn + BF w/o context + ASR feedback 45.14 47.29

Model combination RNN-LM HD DEV (%) EVAL (%)45.14 47.29

X 41.67 43.70X X 39.94 41.64X X X 40.26 42.00

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 8 / 11

Page 30: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Simulation results – WER – Challenge conform

DEV (%) EVAL (%)Baseline (single) 81.1 73.3USTC/iFlytek 45.0 46.1Proposed 39.94 41.64

Speech enhancement (multi array – 24 channels) DEV (%) EVAL (%)Hitachi-JHU 57.50Paderborn 49.21Paderborn + BF w/o context 46.54 51.99Paderborn + BF w/o context + ASR feedback 45.14 47.29

Model combination RNN-LM HD DEV (%) EVAL (%)45.14 47.29

X 41.67 43.70X X 39.94 41.64X X X 40.26 42.00

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 8 / 11

Page 31: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Simulation results – WER – External data

ASR difficulty on CHiME-5: training data has only 40 h

AM Training Data DEV (%)CNN-TDNN-LSTM LibriSpeech (960h) 62.09Baseline TDNN CHiME-5 (40h) 58.39

CNN-TDNN-RBiLSTM CHiME-5 (40h) 45.14

Naive use of 960 hours does not improve performancePossible causes:

• Conversational/spontaneous speech• Enhanced speech not close enough to clean speech

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 9 / 11

Page 32: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Simulation results – WER – External data

3-gram LM Training Data # of Words DEVPPL WER (%)

CHiME-5 (Baseline) 0.4M 155 45.14CHiME-5 + AMI 1.2M 140 45.10CHiME-5 + LibriSpeech 9.8M 134 44.49CHiME-5 + AMI + LibriSpeech 10.6M 131 44.21

Larger gain expected.

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 10 / 11

Page 33: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Conclusion

+ GSS front-end can boost second best challenge system(Hitachi-JHU)

+ CHiME-5: New best WER (DEV 39.94% and EVAL 41.64%)+ Annotation fine-tuning with ASR+ Dropping context for beamforming◦ Taking more data for AM and LM needs more investigations

Thank you for listening!

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 11 / 11

Page 34: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Conclusion

+ GSS front-end can boost second best challenge system(Hitachi-JHU)

+ CHiME-5: New best WER (DEV 39.94% and EVAL 41.64%)+ Annotation fine-tuning with ASR+ Dropping context for beamforming◦ Taking more data for AM and LM needs more investigations

Thank you for listening!

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 11 / 11

Page 35: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

CHiME 5 result table

Track Session Kitchen Dining Living Overall

SingleDev S02 62.33 52.82 44.62 52.07S09 51.87 54.02 48.09

Eval S01 60.07 40.88 60.94 47.31S21 49.09 38.14 42.67

MultipleDev S02 46.66 45.07 36.19 39.94S09 36.40 39.43 35.33

Eval S01 53.93 35.66 49.78 41.64S21 46.43 34.53 36.64

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 1 / 2

Page 36: GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi ... · GuidedSourceSeparationMeetsaStrongASR Backend: Hitachi/PaderbornUniversityJoint InvestigationforDinnerPartyASR NaoyukiKanda1,ChristophBoeddeker2,JensHeitkaemper2

NT

Single array to multi array

Arrays Context in BFOn Off

1 58.05 58.133 52.30 48.816 49.21 46.54

Kanda, Boeddeker, Heitkaemper, ... GSS Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR 2 / 2