asr, nlu, and chatbots · asr, nlu, and chatbots pierre lison. in4080: natural language processing...

48
www.nr.no ASR, NLU, and chatbots Pierre Lison IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Upload: others

Post on 17-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

www.nr.no

ASR, NLU, and chatbots

Pierre Lison

IN4080: Natural Language Processing (Fall 2019)

07.11.2019

Page 2: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Plan for today► Automatic Speech Recognition (ASR)

► Natural Language Understanding (NLU)

► Chatbot models

2

Page 3: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Plan for today► Automatic Speech Recognition (ASR)► Natural Language Understanding (NLU)

► Chatbot models

3

Page 4: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Speech recognition

4User

Interpreteddialogue act

Dialogue management

Recognitionhypotheses

Language understanding

input speech signal(user utterance)

Speech recognition

Utterance to synthesise

Speech synthesis

Intendedresponse

Generation

output speech signal(machine utterance)

Page 5: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

A difficult problem!

5

Page 6: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

The speech chain

6

[Denes and Pinson (1993), «The speech chain»]

Page 7: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Speech production

7

►Sounds are variations in air pressure►How are they produced?

▪ An air supply: the lungs (we usually speak by breathing out)

▪ A sound source setting the air in motion (e.g. vibrating) in ways relevant to speech production: the larynx, in which the vocal folds are located

▪ A set of 3 filters modulating the sound: the pharynx, the oral tract (teeth, tongue, palate,lips, etc.) & the nasal tract

Page 8: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Speech production

8

Visualisation of the vocal tract via magnetic resonance imaging [MRI]:

NB: A few languages also rely on sounds not produced by vibration of vocal folds, such as click languages (e.g. Khoisan family in south-east Africa):

Page 9: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Speech perception

9

zoom on the part between 1.126 and 1.157 s.

About 4 cycles in the waveform, which means a frequency of about 4/0.03 ≈129 Hz

A (speech) sound is a variation of air pressure

▪ This variation originates from the speaker’s speech organs

▪ We can plot a wave showing the changes in air pressure over time (zero value being the normal air pressure)

Page 10: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Important measures

10

1. The fundamental frequency F0: lowest frequency of the sound wave, corresponding to the speed of vibration of the vocal folds (between 85-180 Hz for male voices and 165-255 Hz for female voices)

2. The intensity: the signal power normalised to the human auditory threshold, measured in dB(decibels):

for a sample of N time points t1,... tNP0 is the human auditory threshold, = 2 x 10-5 Pa

Note: dB scale is logarithmic, not linear!

Page 11: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Why are F0 and the intensity important?

11

F0 correlates with the pitch of the voice, and the pitch movement for an utterance will give us its intonation

"The ball is red" "Is the ball red?"

Interrogative utterance = rising intonation at the end

Page 12: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

12

The signal intensity corresponds to the loudness of the speech sound

low intensity high intensity

F0 correlates with the pitch of the voice, and the pitch movement for an utterance will give us its intonation

Why are F0 and the intensity important?

Page 13: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Speech recognition task► Given a sequence of acoustic observations

(e.g. every 10 milliseconds)

► We wish to determine the sequence of words

13

O = o1, o2, o3, ..., om

W = w1, w2, w3, ..., wn

Page 14: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Speech recognition task

14

• O is a sequence of acoustic observations (e.g. every 10 milliseconds)

• W is a (hidden) sequence of symbols

► Goal: Map speech signal into sequence of linguistic symbols (words or characters):

► Many sources of variation: speaker voice (and style), accents, ambient noise, etc.

Page 15: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Classical model

15

Using Bayes' rule, we can rewrite Ŵ as:

(Bayes)

(P(O) constant for all W)

Determines the probability of the word sequence W

Language modelAcoustic model

Determines the probability of the acoustic inputs O given the word sequence W

Page 16: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Modern neural models► The best performing ASR are deep, end-

to-end neural architectures▪ Less dependent on external ressources

(such as pronunciation dictionaries)▪ No need to preprocess acoustic inputs

► Too complex / time demanding to review them in this course▪ But they rely on the same building blocks as other

NNs: convolutions, recurrence, (self-)attention, etc.

16

Page 17: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

17[Figure from Bhuvana Ramabhadran’s presentation at Interspeech 2018]

ASR Performance

Page 18: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

ASR evaluation

18

► Standard evaluation metric: Word Error Rate▪ Measures how much the utterance hypothesis h differs

from the «gold standard» transcription t*

► Relies on a minimum edit distance between hand t*, counting the number of word substitutions, insertions and deletions:

Page 19: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

ASR evaluation

19

Gold standard Transcription

yes can you now rotate this triangle

ASR hypothesis yes can you not rotate this triangle there

Gold standard Transcription there is five and

ASR hypothesis the size and

1 Sub + 1 Ins 2 Sub + 1 Del

Page 20: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Disfluencies

20

►Speakers construct their utterances «as they go», incrementally▪ Production leaves a trace in the speech stream

►Presence of multiple disfluencies▪ Pauses, fillers («øh», «um», «liksom»)

▪ Fragments

▪ repetitions («the the ball»), corrections («the ball err mug»), repairs («the bu/ ball»)

Page 21: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Disfluencies

21

Internal structure of a disfluency:

► reparandum: part of the utterance which is edited out

► interregnum: (optional) filler

► repair: part meant to replace the reparandum

[Shriberg (1994), «Preliminaries to a Theory of Speech Disfluencies», Ph.D thesis]

Page 22: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Disfluencies

22

► Repetitions

► Corrections:

► Rephrasing/completion:

Page 23: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

More complex disfluencies

23

så gikk jeg e flytta vi til Nesøya da begynte jeg på barneskolen der og så har jeg gått på Landøya ungdomsskole # som ligger ## rett over broa nesten # rett med Holmen

jeg gikk på Bryn e skole som lå rett ved der vi bodde den gangen e barneskolevidere på Hauger ungdomsskole

da hadde alle hele på skolen skulle liksom # spise julegrøt og det va- det var bare en mandel og da var jeg som fikk den da ble skikkelig sånn " wow # jeg har fått den " ble så glad

[«Norske talespråkskorpus - Oslo delen» (NoTa), collected and annotated by the Tekstlaboratoriet]

Page 24: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Disfluency detection► We can build a neural network to

automatically detect disfluencies▪ Sequence labelling task: determine for

each token whether it is disfluent or not

► Open question: what should we do with the result? Edit out the disfluencies?

24

[Paria Jamshid Lou et al (2018), Disfluency detection using auto-correlational neural networks, EMNLP]

Page 25: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Plan for today► Automatic Speech Recognition (ASR)

► Natural Language Understanding (NLU)► Chatbot models

25

Page 26: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

NLU► The goal of NLU is to determine the

content (or intent) of an utterance

► The output may be:▪ A categorical label (or a set of labels)▪ A list of recognised slots

26

«Show me morning flights from Boston to San Francisco on Tuesday»

Page 27: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Intent classification► Can be framed as a text classification task

▪ Requires dialogue data annotated with intents▪ Categories may be derived from domain-independent

taxonomies (e.g. dialogue acts)▪ … Or ad-hoc taxonomies for your domain

► Pick up your favourite machine learning model

27How are you ?

LSTM LSTM LSTM LSTMsoftmax

Output label

Page 28: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Intent classification► How to take the context into account?

▪ Simple approach: prepend the previous dialogue history (up to a limit) to the input

▪ More advanced: view intent classification as a sequence labelling task (at utterance level) and find the most likely sequence of intents for a given dialogue

28

Hi Pierre! <turn> Hi Alex! <turn> How are you?

Page 29: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Slot filling► Goal: find slots with a user-provided

value in the utterance

► The slots are domain-specific▪ And so are the ontologies listing all

possible values for each slot29

«Show me morning flights from Boston to San Francisco on Tuesday»

Page 30: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Slot fillingPopular approach: use (again) a sequence-labelling approach, for instance BIO

30[illustration from D. Jurafsky]

Page 31: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Reference resolution► Another important NLU task (especially in

situated systems) is reference resolution▪ i.e. «Pick up the box on your left»

31

► Reference resolution is the process of finding which entities are referred to by specific linguistic expressions

Page 32: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Reference resolution

32

Some terminology:► A linguistic expression used to perform reference

is called a referring expression

► The entity that is referred to is called the referent

Pierre(referential ambiguity)

The IN4080 teacher

(coreference)

Page 33: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Reference resolution

33

►Reference resolution usually rely on a discourse model containing the set of entities that can be referred to ▪ As well as their relationships with one another

▪ The discourse model continuously change during the interaction (entities come and go, become more or less focused, etc.)

► In situated systems, the discourse model also contain objects or events in the shared environment

Page 34: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Reference resolution

34

► Various features can be used to resolve references:▪ Grammatical agreement (number, person, gender)▪ Saliency (recency of mention, visual salience, etc.)▪ Semantic constraints

► Based on these features and annotated training data, one can then train a classifier▪ Binary classification: given a referring expression A and a

referent B the classifier determines whether A refers to B

▪ Any supervised learning algorithm will do

Page 35: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Final remarks on NLU► How to extract meaning out of each utterance is

often highly domain-dependent▪ Which information is relevant for your system?▪ What kind of data do you have?

► Rule-based approaches are still quite popular▪ Or hybrid approaches combining rules with ML

► Alternatively: use neural models to generate utterance-level embeddings instead of predefined categories (intents, slots, etc.)

35

Page 36: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Plan for today► Automatic Speech Recognition (ASR)

► Natural Language Understanding (NLU)

► Chatbot models

36

Page 37: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Chatbots: main approaches▪ Rule-based models

▪ Corpus-based models:◦ Information-Retrieval models◦ Sequence-to-Sequence models

37

Pro: Fine-grained control on interaction

Con: Difficult to build, scale and mantain

Pro: Better coverage & robustness

Con: Need training data!

Page 38: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Rule-based models► Pattern-action rules

► For instance:

38[example from D. Jurafsky]

Page 39: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

IR models► Alternatively, one can adopt a data-driven

approach and learn how to respond to the user based on a dialogue corpus

► Key idea:▪ Given a user input q, find the utterance t in the

dialogue corpus that is most similar to q▪ Then return as response the utterance r

following t in the corpus

39

Page 40: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

IR models

► How to determine which utterance is «most similar» to the actual user utterance?▪ Cosine similarity over some vectors▪ The vectors can be TF-IDF weighted words▪ Or utterance-level embeddings

40

Page 41: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Dual encoders

41

• Training data: (input, response) pairs associated with 1 (correct response) or 0 (wrong response)

• At runtime, search response with max output score

• Dot product between input embeddings and response embeddings (after linear transform)

• Followed by a sigmoid to get a score in [0,1]

[Lison & Bibauw (2017), «Not all dialogues are created equal: Instance Weighting for neural conversational models» (SIGDIAL)]

Page 42: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Seq2seq models► Sequence-to-sequence models generate a

response token-by-token▪ Akin to machine translation▪ Advantage: can generate «creative»

responses not observed in the corpus

► Two steps:▪ First «encode» the input with e.g. an LSTM▪ Then «decode» the output token-by-token

42

Page 43: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Seq2seq models

43[Image borrowed from Deep Learning for Chatbots: Part 1]

NB: state-of-the-art seq2seq models use an attention mechanism (not shown here) above the recurrent layer

Page 44: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Seq2seq models► Interesting models for dialogue research

► But:▪ Difficult to «control» (hard to know in advance

what the system may generate)▪ Lack of diversity in the responses (often stick to

generic answers: «I don’t know» etc.)▪ Getting a seq2seq model that works reasonably

well takes a lot of time (and data)

44

[Li, Jiwei, et al. (2015) "A diversity-promoting objective function for neural conversation models.», ACL]

Page 45: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Plan for today► Automatic Speech Recognition (ASR)

► Natural Language Understanding (NLU)

► Chatbot models

► Short recap

45

Page 46: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Summary

► Deep NNs have boosted ASR performance▪ But not yet a «solved problem»▪ (especially for ressource-poor languages and

non-standard voices/acoustic environments) ▪ Word Error Rate metric used for evaluation

► Disfluencies abound in spoken language

46

Acoustic observations O = o1, o2, o3, ..., om

Recognition hypothesis W = w1, w2, w3, ..., wn

ASR:

Page 47: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Summary► Natural language understanding (NLU) is

an umbrella term for models designed to extract «content» from the utterances▪ Intent recognition▪ Slot filling▪ Reference resolution

► Rule-based or ML-based approaches▪ Or a combination of both!

47

Page 48: ASR, NLU, and chatbots ·  ASR, NLU, and chatbots Pierre Lison. IN4080: Natural Language Processing (Fall 2019) 07.11.2019

Summary► Chatbots can be either

▪ Handcrafted with pattern response rules▪ Data-driven, based on a dialogue corpus

48

IR models: find the «best» response among the ones in the corpus

Seq2seq models: generate a response token by token