towards machine comprehension of spoken contentaliensunmin.github.io/aii_workshop/1st/speech and...
TRANSCRIPT
![Page 1: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/1.jpg)
Towards Machine Comprehension
of Spoken ContentHung-yi Lee
李宏毅
![Page 2: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/2.jpg)
Machine Comprehension of Spoken Content
300 hrs multimedia is uploaded per minute.(2015.01)
2163 courses on Coursera(today)
➢ We need machine to listen to the audio data,
understand it, and extract useful information for
humans.
➢ In these multimedia, the spoken part carries
very important information about the content.
➢ Nobody is able to go through the data.
![Page 3: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/3.jpg)
Overview
Spoken Content
Text
SpeechRecognition
Spoken Content Retrieval
Key Term Extraction
Summarizationuser
Interaction
QuestionAnswering
Audio wordto vector
Talk to Humans
Help Humans
interactive spoken content retrieval
![Page 4: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/4.jpg)
Speech Recognition
Spoken Content
Text
SpeechRecognition
[Liao, et al., ASRU 15]
hidden layer h1
hidden layer h2
W1
W2
F2(x, y; θ
2)
WL
speech signal
F1(x, y; θ
1)
y (phoneme label sequence)
(a ) u s e DNN p h o n e p o s te rio r a s a c o u s tic ve c to r
(b ) s tru c tu re d S VM (c ) s tru c tu re d DNN
Ψ(x,y)
hidden layer hL-1
hidden layer h1
hidden layer hL
W0,0
output layer
input layer
W0,L
feature extraction
a c b a
x (acoustic vector sequence)
[Ko, et al., ICASSP 17]
Acoustic Models Language Models
![Page 5: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/5.jpg)
Summarization
Spoken Content
Text
SpeechRecognition
Summarization
[Yu, et al., SLT 16][Lu, et al., Interspeech 17]
![Page 6: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/6.jpg)
Summarization
Audio File
to be summarized
This is the summary.
➢ Select the most informative segments to form a compact version
Extractive Summaries
…… deep learning is powerful …… …… ……
[Lee, et al., Interspeech 12][Lee, et al., ICASSP 13][Shiang, et al., Interspeech 13]
➢ Machine does not write summaries in its own words
![Page 7: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/7.jpg)
Abstractive Summarization
• Now machine can do abstractive summary (write summaries in its own words)• Title generation: abstractive summary with one
sentence
Title 1
Title 2
Title 3
Training Data
title generated by machine
without hand-crafted rules
(in its own words)
![Page 8: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/8.jpg)
Sequence-to-sequence
• Input: transcriptions of audio, output: title
ℎ1 ℎ2 ℎ3 ℎ4
RNN Encoder: read through the input
w1 w4w2 w3
transcriptions of audio from automatic speech recognition (ASR)
𝑧1 𝑧2 ……
……wA wB
RNN generator
![Page 9: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/9.jpg)
Demo
•作者:葉政杰、周儒杰
• http://140.112.30.37:2401/
• https://www.youtube.com/watch?v=X3BapMl7Wv8
• From SONG TUYEN NEWS: https://www.youtube.com/channel/UC-P4mEcWZVrFfdZIuiODiTg
![Page 10: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/10.jpg)
Key Term Extraction
Spoken Content
Text
SpeechRecognition
Key Term Extraction
Summarization
[Shen & Lee, Interspeech 16]
![Page 11: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/11.jpg)
Speech Question Answering
Spoken Content
Text
SpeechRecognition
Key Term Extraction
Summarization
QuestionAnswering
![Page 12: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/12.jpg)
Speech Question Answering
What is a possible origin of Venus’ clouds?
Speech Question Answering: Machine answers questions based on the information in spoken content
Gases released as a result of volcanic activity
![Page 13: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/13.jpg)
New task for Machine Comprehension of Spoken Content
• TOEFL Listening Comprehension Test by Machine
Question: “ What is a possible origin of Venus’ clouds? ”
Audio Story:
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)
![Page 14: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/14.jpg)
New task for Machine Comprehension of Spoken Content
• TOEFL Listening Comprehension Test by Machine
“what is a possible origin of Venus‘ clouds?"
Question:
Audio Story:Neural
Network
4 Choices
e.g. (A)
answer
Using previous exams to train the network
ASR transcriptions
![Page 15: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/15.jpg)
Model Architecture
“what is a possible origin of Venus‘ clouds?"
Question:
Question Semantics
…… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano ……
Audio Story:
Speech Recognition
Semantic Analysis
Semantic Analysis
Attention
Answer
Select the choice most similar to the answer
Attention
The whole model learned end-to-end.
![Page 16: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/16.jpg)
More Details
(A)
(A) (A) (A) (A)
(B) (B) (B)
![Page 17: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/17.jpg)
Experimental ResultsA
ccu
racy
(%
)
random
Naïve approaches
Example Naïve approach:1. Find the paragraph containing most key terms in
the question.2. Select the choice containing most key terms in
that paragraph.
![Page 18: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/18.jpg)
Experimental ResultsA
ccu
racy
(%
)
random
42.2% [Tseng, Shen, Lee, Lee, Interspeech’16]
Naïve approaches
48.8% [Fan, Hsu, Lee, Lee, SLT’16]
![Page 19: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/19.jpg)
Analysis
• There are three types of questions
Type 3: Connecting Information➢Understanding Organization➢Connecting Content➢Making Inferences
![Page 20: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/20.jpg)
Analysis
• There are three types of questions
Type 3: Pragmatic Understanding➢Understanding the Function of What Is Said➢Understanding the Speaker’s Attitude
![Page 21: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/21.jpg)
Talk to Humans
Spoken Content
Text
SpeechRecognition
Key Term Extraction
Summarizationuser
Interaction
QuestionAnswering
Talk to Humans
![Page 22: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/22.jpg)
Chat-bot
On-going project:➢Training by reinforcement learning➢Training by generative adversarial network (GAN)
Sequence-to-sequence learning from human conversation without hand-crafted rules.
電視影集、電影台詞Source of image: https://github.com/farizrahman4u/seq2seq
Positive/negative
Different answers
![Page 23: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/23.jpg)
Demo - Towards Characterization
•作者:王耀賢
• https://github.com/yaushian/simple_sentiment_dialogue
• https://github.com/yaushian/personal-dialogue
![Page 24: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/24.jpg)
Audio Word to Vector
Spoken Content
Text
SpeechRecognition
Spoken Content Retrieval
Key Term Extraction
Summarizationuser
Interaction
QuestionAnswering
Audio wordto vector
Talk to Humans
Help Humans
interactive spoken content retrieval
![Page 25: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/25.jpg)
Typical Word to Vector
• Machine represents each word by a vector representing its meaning
• Learning from lots of text without supervision
dog
cat
rabbit
jumprun
flower
tree
![Page 26: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/26.jpg)
Audio Word to Vector
• Machine represents each audio segment also by a vector
audio segment
vector
(word-level)
Learn from lots of audio without supervision
[Chung, et al., Interspeech 16)
![Page 27: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/27.jpg)
Sequence-to-sequence Auto-encoder
audio segment
acoustic featuresx1 x2 x3 x4
RNN Encoder
vector
The vector we want
We use sequence-to-sequence auto-encoder here
The training is unsupervised.
![Page 28: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/28.jpg)
Sequence-to-sequence Auto-encoder
RNN Generatorx1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3x4
RNN Encoder
audio segment
acoustic features
The RNN encoder and generator are jointly trained.
Input acoustic features
![Page 29: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/29.jpg)
What does machine learn?
• Typical word to vector:
• Audio word to vector (phonetic information)
𝑉 𝑅𝑜𝑚𝑒 − 𝑉 𝐼𝑡𝑎𝑙𝑦 + 𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦 ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛
𝑉 𝑘𝑖𝑛𝑔 − 𝑉 𝑞𝑢𝑒𝑒𝑛 + 𝑉 𝑎𝑢𝑛𝑡 ≈ 𝑉 𝑢𝑛𝑐𝑙𝑒
V( ) - V( ) + V( ) = V( )
GIRL PEARL PEARLS
V( ) - V( ) + V( ) = V( )
GIRLS
![Page 30: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/30.jpg)
Demo
![Page 31: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/31.jpg)
Application:Video Caption Generation
Visual + Audio
A girl is running.
A group of people is walking in the forest.
A group of people is knocked by a tree.
![Page 32: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/32.jpg)
Demo
• Can machine describe what it see from video?
•作者: 莊舜博、楊棋宇、黃邦齊、萬家宏
MTK產學大聯盟
![Page 33: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/33.jpg)
Next Step ……
• Audio word to vector with semantics
flower tree
dog
cat
cats
walk
walked
run
One day we can build all spoken language understanding applications directly from audio word to vector.
![Page 34: Towards Machine Comprehension of Spoken Contentaliensunmin.github.io/aii_workshop/1st/speech and deep.pdf · Machine Comprehension of Spoken Content 300 hrs multimedia is uploaded](https://reader035.vdocument.in/reader035/viewer/2022081522/5f028d407e708231d404d402/html5/thumbnails/34.jpg)
Concluding Remarks
Spoken Content
Text
SpeechRecognition
Spoken Content Retrieval
Key Term Extraction
Summarizationuser
Interaction
QuestionAnswering
Audio wordto vector
Talk to Humans
Help Humans
interactive spoken content retrieval