nyai #5 - fun with neural nets by jason yosinski

MEETUP #5: Neural Nets (Jason Yosinski) &

ML for Production (Ken Sanford)

Fun with Neural Nets

NYAI meetup 24 August 2016 Jason Yosinski

Original slides available under Creative Commons Attribution-ShareAlike 3.0

Geometric Intelligence

1950 1960 1970 1980 1990 2000 2010 2020 ……

Progress in AI

1950 1960 1970 1980 1990 2000 2010 2020 ……

Progress in AI

Chen et al., 2014

SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORKS

Guoguo Chen⇤1 Carolina Parada2 Georg Heigold2

1 Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD2 Google Inc., Mountain View, CA

[email protected] [email protected] [email protected]

ABSTRACT

Our application requires a keyword spotting system with a smallmemory footprint, low computational cost, and high precision. Tomeet these requirements, we propose a simple approach based ondeep neural networks. A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.Keyword recognition results achieve 45% relative improvement withrespect to a competitive Hidden Markov Model-based system, whileperformance in the presence of babble noise shows 39% relative im-provement.

Index Terms— Deep Neural Network, Keyword Spotting, Em-bedded Speech Recognition

1. INTRODUCTION

Thanks to the rapid development of smartphones and tablets, inter-acting with technology using voice is becoming commonplace. Forexample, Google offers the ability to search by voice [1] on Androiddevices and Apple’s iOS devices are equipped with a conversationalassistant named Siri. These products allow a user to tap a device andthen speak a query or a command.

We are interested in enabling users to have a fully hands-freeexperience by developing a system that listens continuously for spe-cific keywords to initiate voice input. This could be especially use-ful in situations like driving. The proposed system must be highlyaccurate, low-latency, small-footprint, and run in computationallyconstrained environments such as modern mobile devices. Runningthe system on the device avoids latency and power implications withconnecting to the server for recognition.

Keyword Spotting (KWS) aims at detecting predefined key-words in an audio stream, and it is a potential technique to providethe desired hands-free interface. There is an extensive literature inKWS, although most of the proposed methods are not suitable forlow-latency applications in computationally constrained environ-ments. For example, several KWS systems [2, 3, 4] assume offlineprocessing of the audio using large vocabulary continuous speechrecognition systems (LVCSR) to generate rich lattices. In this case,their task focuses on efficient indexing and search for keywords inthe lattices. These systems are often used to search large databasesof audio content. We focus instead on detecting keywords in theaudio stream without any latency.

A commonly used technique for keyword spotting is the Key-word/Filler Hidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despitebeing initially proposed over two decades ago, it remains highlycompetitive. In this generative approach, an HMM model is trained

⇤The author performed the work as a summer intern at Google, MTV.

for each keyword, and a filler model HMM is trained from the non-keyword segments of the speech signal (fillers). At runtime, thesesystems require Viterbi decoding, which can be computationally ex-pensive depending on the HMM topology. Other recent work ex-plores discriminative models for keyword spotting based on large-margin formulation [10, 11] or recurrent neural networks [12, 13].These systems show improvement over the HMM approach but re-quire processing of the entire utterance to find the optimal keywordregion or take information from a long time span to predict the entirekeyword, increasing detection latency.

We propose a simple discriminative KWS approach based ondeep neural networks that is appropriate for mobile devices. Werefer to it as Deep KWS . A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.In contrast with the HMM approach, this system does not requirea sequence search algorithm (decoding), leading to a significantlysimpler implementation, reduced runtime computation, and smallermemory footprint. It also makes a decision every 10 ms, minimizinglatency. We show that the Deep KWS system outperforms a standardHMM based system on both clean and noisy test sets, even when asmaller amount of data is used for training.

We describe our DNN based KWS framework in Section 2, andthe baseline HMM based KWS system in Section 3. The experimen-tal setup, results and some discussion follow in Section 4. Section 5closes with the conclusions.

2. DEEP KWS SYSTEM

The proposed Deep KWS framework is illustrated in Figure 1. Theframework consists of three major components: (i) a feature extrac-tion module, (ii) a deep neural network, and (iii) a posterior handlingmodule. The feature extraction module (i) performs voice-activitydetection and generates a vector of features every frame (10 ms).These features are stacked using the left and right context to cre-

Fig. 1. Framework of Deep KWS system, components from left toright: (i) Feature Extraction (ii) Deep Neural Network (iii) PosteriorHandling

Speech recognition, natural language conversation

1950 1960 1970 1980 1990 2000 2010 2020 ……

Progress in AI

Chen et al., 2014

ABSTRACT

1. INTRODUCTION

2. DEEP KWS SYSTEM

1950 1960 1970 1980 1990 2000 2010 2020 ……

Progress in AI

Chen et al., 2014

ABSTRACT

1. INTRODUCTION

2. DEEP KWS SYSTEM

Reinforcement Learning

Silver et al., 2016

1950 1960 1970 1980 1990 2000 2010 2020 ……

Progress in AI

Chen et al., 2014

ABSTRACT

1. INTRODUCTION

2. DEEP KWS SYSTEM

Silver et al., 2016

1950 1960 1970 1980 1990 2000 2010 2020 ……

Progress in AI

Chen et al., 2014

ABSTRACT

1. INTRODUCTION

2. DEEP KWS SYSTEM

Silver et al., 2016

Not just perceiving the world,but also generating…

Robot Gait Discovery

Hand-Coded Gait

Fixed Shallow Topology, Learned Parameters

Learned Deep Topology, Learned Parameters

9x fasterthan human designed gait

Krizhevsky et al. 2012

AlexNet

Recipe for understanding:• architecture

5 convolutional layers 3 FC layers

AlexNet

Recipe for understanding:• architecture• dataset (big: 250b)

AlexNet

ImageNet, Deng et al. 2009

AlexNet

ImageNet, Deng et al. 2009

jaguar gibbon great white shark water bottle

golden retriever orangutan fireboat bubble

tobacco shop ambulance cowboy hat mixing bowl

AlexNet

Recipe for understanding:• architecture• dataset (big: 250b)• parameters (big: 60m)

< DeepVis Toolbox demo >

Code at: http://yosinski.com/

See also: Erhan et al, 2009; Szegedy et al., 2013.

(similar to this)

Deep Neural Networks are Easily Fooled:High Confidence Predictions for Unrecognizable Images

Simonyan ICLR ’14L2

Dai, Lu, Wu, ICLR ’15

PeacockLearnedNo regularization

L2 + L1 + spatial

No regularization

Nguyen, Dosovitskiy, Yosinski, Brox, Clune.“Synthesizing the preferred inputs for neurons in neural networks via deep generator networks”

I m age

banana

convertible

Deep% generator%network(prior) DNN% being%visualized

candle

CodeForward%and%backward%passes

u9 u2u1 c1

fc6 fc7fc8fc6

c3 c4 c5. . .

up c o n v o l u t i o n a l c o n v o l u t i o n a l

I m age

banana

convertible

Deep% generator%network(prior) DNN% being%visualized

candle

CodeForward%and%backward%passes

u9 u2u1 c1

fc6 fc7fc8fc6

c3 c4 c5. . .

up c o n v o l u t i o n a l c o n v o l u t i o n a l

Nguyen, Dosovitskiy, Yosinski, Brox, Clune.“Synthesizing the preferred inputs for neurons in neural networks via deep generator networks”

Castle Candle

Fireboat Candle

“What I cannot create, I do not understand.”

Richard Feynman’s blackboardCar

Engine Intelligencevs.

ability

computation

scientific understanding

AI Progress

ability

computation

scientific understanding

AI Progress

Waiting for EEs and Internet

New field

“Pseudobiology” ?(study of fake life)

Thanks!

Hod Lipson

Jeff Clune

Yoshua Bengio

Anh Nguyen

Code/etc:Email:

http://yosinski.com [email protected]

( Slides: http://s.yosinski.com/nyai.pdf )

Food & Drinks:

O’Reilly AI Conference Ticket Giveaway

INTERMISSION

Randomly selected by Jason & Ken

nyai #5 - fun with neural nets by jason yosinski

Technology