asr, nlu, and chatbots · asr, nlu, and chatbots pierre lison. in4080: natural language processing...

www.nr.no

ASR, NLU, and chatbots

Pierre Lison

IN4080: Natural Language Processing (Fall 2019)

07.11.2019

Plan for today► Automatic Speech Recognition (ASR)

► Natural Language Understanding (NLU)

► Chatbot models

2

Plan for today► Automatic Speech Recognition (ASR)► Natural Language Understanding (NLU)

► Chatbot models

3

Speech recognition

4User

Interpreteddialogue act

Dialogue management

Recognitionhypotheses

Language understanding

input speech signal(user utterance)

Speech recognition

Utterance to synthesise

Speech synthesis

Intendedresponse

Generation

output speech signal(machine utterance)

A difficult problem!

5

The speech chain

6

[Denes and Pinson (1993), «The speech chain»]

Speech production

7

►Sounds are variations in air pressure►How are they produced?

▪ An air supply: the lungs (we usually speak by breathing out)

▪ A sound source setting the air in motion (e.g. vibrating) in ways relevant to speech production: the larynx, in which the vocal folds are located

▪ A set of 3 filters modulating the sound: the pharynx, the oral tract (teeth, tongue, palate,lips, etc.) & the nasal tract

Speech production

8

Visualisation of the vocal tract via magnetic resonance imaging [MRI]:

NB: A few languages also rely on sounds not produced by vibration of vocal folds, such as click languages (e.g. Khoisan family in south-east Africa):

Speech perception

9

zoom on the part between 1.126 and 1.157 s.

About 4 cycles in the waveform, which means a frequency of about 4/0.03 ≈129 Hz

A (speech) sound is a variation of air pressure

▪ This variation originates from the speaker’s speech organs

▪ We can plot a wave showing the changes in air pressure over time (zero value being the normal air pressure)

Important measures

10

1. The fundamental frequency F0: lowest frequency of the sound wave, corresponding to the speed of vibration of the vocal folds (between 85-180 Hz for male voices and 165-255 Hz for female voices)

2. The intensity: the signal power normalised to the human auditory threshold, measured in dB(decibels):

for a sample of N time points t1,... tNP0 is the human auditory threshold, = 2 x 10-5 Pa

Note: dB scale is logarithmic, not linear!

Why are F0 and the intensity important?

11

F0 correlates with the pitch of the voice, and the pitch movement for an utterance will give us its intonation

"The ball is red" "Is the ball red?"

Interrogative utterance = rising intonation at the end

12

The signal intensity corresponds to the loudness of the speech sound

low intensity high intensity

F0 correlates with the pitch of the voice, and the pitch movement for an utterance will give us its intonation

Why are F0 and the intensity important?

Speech recognition task► Given a sequence of acoustic observations

(e.g. every 10 milliseconds)

► We wish to determine the sequence of words

13

O = o1, o2, o3, ..., om

W = w1, w2, w3, ..., wn

Speech recognition task

14

• O is a sequence of acoustic observations (e.g. every 10 milliseconds)

• W is a (hidden) sequence of symbols

► Goal: Map speech signal into sequence of linguistic symbols (words or characters):

► Many sources of variation: speaker voice (and style), accents, ambient noise, etc.

Classical model

15

Using Bayes' rule, we can rewrite Ŵ as:

(Bayes)

(P(O) constant for all W)

Determines the probability of the word sequence W

Language modelAcoustic model

Determines the probability of the acoustic inputs O given the word sequence W

Modern neural models► The best performing ASR are deep, end-

to-end neural architectures▪ Less dependent on external ressources

(such as pronunciation dictionaries)▪ No need to preprocess acoustic inputs

► Too complex / time demanding to review them in this course▪ But they rely on the same building blocks as other

NNs: convolutions, recurrence, (self-)attention, etc.

16

17[Figure from Bhuvana Ramabhadran’s presentation at Interspeech 2018]

ASR Performance

ASR evaluation

18

► Standard evaluation metric: Word Error Rate▪ Measures how much the utterance hypothesis h differs

from the «gold standard» transcription t*

► Relies on a minimum edit distance between hand t*, counting the number of word substitutions, insertions and deletions:

ASR evaluation

19

Gold standard Transcription

yes can you now rotate this triangle

ASR hypothesis yes can you not rotate this triangle there

Gold standard Transcription there is five and

ASR hypothesis the size and

1 Sub + 1 Ins 2 Sub + 1 Del

Disfluencies

20

►Speakers construct their utterances «as they go», incrementally▪ Production leaves a trace in the speech stream

►Presence of multiple disfluencies▪ Pauses, fillers («øh», «um», «liksom»)

▪ Fragments

▪ repetitions («the the ball»), corrections («the ball err mug»), repairs («the bu/ ball»)

Disfluencies

21

Internal structure of a disfluency:

► reparandum: part of the utterance which is edited out

► interregnum: (optional) filler

► repair: part meant to replace the reparandum

[Shriberg (1994), «Preliminaries to a Theory of Speech Disfluencies», Ph.D thesis]

Disfluencies

22

► Repetitions

► Corrections:

► Rephrasing/completion:

More complex disfluencies

23

så gikk jeg e flytta vi til Nesøya da begynte jeg på barneskolen der og så har jeg gått på Landøya ungdomsskole # som ligger ## rett over broa nesten # rett med Holmen

jeg gikk på Bryn e skole som lå rett ved der vi bodde den gangen e barneskolevidere på Hauger ungdomsskole

da hadde alle hele på skolen skulle liksom # spise julegrøt og det va- det var bare en mandel og da var jeg som fikk den da ble skikkelig sånn " wow # jeg har fått den " ble så glad

[«Norske talespråkskorpus - Oslo delen» (NoTa), collected and annotated by the Tekstlaboratoriet]

Disfluency detection► We can build a neural network to

automatically detect disfluencies▪ Sequence labelling task: determine for

each token whether it is disfluent or not

► Open question: what should we do with the result? Edit out the disfluencies?

24

[Paria Jamshid Lou et al (2018), Disfluency detection using auto-correlational neural networks, EMNLP]


► Natural Language Understanding (NLU)► Chatbot models

25

NLU► The goal of NLU is to determine the

content (or intent) of an utterance

► The output may be:▪ A categorical label (or a set of labels)▪ A list of recognised slots

26

«Show me morning flights from Boston to San Francisco on Tuesday»

Intent classification► Can be framed as a text classification task

▪ Requires dialogue data annotated with intents▪ Categories may be derived from domain-independent

taxonomies (e.g. dialogue acts)▪ … Or ad-hoc taxonomies for your domain

► Pick up your favourite machine learning model

27How are you ?

LSTM LSTM LSTM LSTMsoftmax

Output label

Intent classification► How to take the context into account?

▪ Simple approach: prepend the previous dialogue history (up to a limit) to the input

▪ More advanced: view intent classification as a sequence labelling task (at utterance level) and find the most likely sequence of intents for a given dialogue

28

Hi Pierre! <turn> Hi Alex! <turn> How are you?

Slot filling► Goal: find slots with a user-provided

value in the utterance

► The slots are domain-specific▪ And so are the ontologies listing all

possible values for each slot29

«Show me morning flights from Boston to San Francisco on Tuesday»

Slot fillingPopular approach: use (again) a sequence-labelling approach, for instance BIO

30[illustration from D. Jurafsky]

Reference resolution► Another important NLU task (especially in

situated systems) is reference resolution▪ i.e. «Pick up the box on your left»

31

► Reference resolution is the process of finding which entities are referred to by specific linguistic expressions

Reference resolution

32

Some terminology:► A linguistic expression used to perform reference

is called a referring expression

► The entity that is referred to is called the referent

Pierre(referential ambiguity)

The IN4080 teacher

(coreference)


33

►Reference resolution usually rely on a discourse model containing the set of entities that can be referred to ▪ As well as their relationships with one another

▪ The discourse model continuously change during the interaction (entities come and go, become more or less focused, etc.)

► In situated systems, the discourse model also contain objects or events in the shared environment


34

► Various features can be used to resolve references:▪ Grammatical agreement (number, person, gender)▪ Saliency (recency of mention, visual salience, etc.)▪ Semantic constraints

► Based on these features and annotated training data, one can then train a classifier▪ Binary classification: given a referring expression A and a

referent B the classifier determines whether A refers to B

▪ Any supervised learning algorithm will do

Final remarks on NLU► How to extract meaning out of each utterance is

often highly domain-dependent▪ Which information is relevant for your system?▪ What kind of data do you have?

► Rule-based approaches are still quite popular▪ Or hybrid approaches combining rules with ML

► Alternatively: use neural models to generate utterance-level embeddings instead of predefined categories (intents, slots, etc.)

35



► Chatbot models

36

Chatbots: main approaches▪ Rule-based models

▪ Corpus-based models:◦ Information-Retrieval models◦ Sequence-to-Sequence models

37

Pro: Fine-grained control on interaction

Con: Difficult to build, scale and mantain

Pro: Better coverage & robustness

Con: Need training data!

Rule-based models► Pattern-action rules

► For instance:

38[example from D. Jurafsky]

IR models► Alternatively, one can adopt a data-driven

approach and learn how to respond to the user based on a dialogue corpus

► Key idea:▪ Given a user input q, find the utterance t in the

dialogue corpus that is most similar to q▪ Then return as response the utterance r

following t in the corpus

39

IR models

► How to determine which utterance is «most similar» to the actual user utterance?▪ Cosine similarity over some vectors▪ The vectors can be TF-IDF weighted words▪ Or utterance-level embeddings

40

Dual encoders

41

• Training data: (input, response) pairs associated with 1 (correct response) or 0 (wrong response)

• At runtime, search response with max output score

• Dot product between input embeddings and response embeddings (after linear transform)

• Followed by a sigmoid to get a score in [0,1]

[Lison & Bibauw (2017), «Not all dialogues are created equal: Instance Weighting for neural conversational models» (SIGDIAL)]

Seq2seq models► Sequence-to-sequence models generate a

response token-by-token▪ Akin to machine translation▪ Advantage: can generate «creative»

responses not observed in the corpus

► Two steps:▪ First «encode» the input with e.g. an LSTM▪ Then «decode» the output token-by-token

42

Seq2seq models

43[Image borrowed from Deep Learning for Chatbots: Part 1]

NB: state-of-the-art seq2seq models use an attention mechanism (not shown here) above the recurrent layer

Seq2seq models► Interesting models for dialogue research

► But:▪ Difficult to «control» (hard to know in advance

what the system may generate)▪ Lack of diversity in the responses (often stick to

generic answers: «I don’t know» etc.)▪ Getting a seq2seq model that works reasonably

well takes a lot of time (and data)

44

[Li, Jiwei, et al. (2015) "A diversity-promoting objective function for neural conversation models.», ACL]



► Chatbot models

► Short recap

45

Summary

► Deep NNs have boosted ASR performance▪ But not yet a «solved problem»▪ (especially for ressource-poor languages and

non-standard voices/acoustic environments) ▪ Word Error Rate metric used for evaluation

► Disfluencies abound in spoken language

46

Acoustic observations O = o1, o2, o3, ..., om

Recognition hypothesis W = w1, w2, w3, ..., wn

ASR:

Summary► Natural language understanding (NLU) is

an umbrella term for models designed to extract «content» from the utterances▪ Intent recognition▪ Slot filling▪ Reference resolution

► Rule-based or ML-based approaches▪ Or a combination of both!

47

Summary► Chatbots can be either

▪ Handcrafted with pattern response rules▪ Data-driven, based on a dialogue corpus

48

IR models: find the «best» response among the ones in the corpus

Seq2seq models: generate a response token by token

asr, nlu, and chatbots · asr, nlu, and chatbots pierre lison. in4080: natural language processing...

Documents