nyai #5 - fun with neural nets by jason yosinski

63
MEETUP #5: Neural Nets (Jason Yosinski) & ML for Production (Ken Sanford)

Upload: rizwan-a-habib

Post on 14-Jan-2017

39 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: NYAI #5 - Fun With Neural Nets by Jason Yosinski

MEETUP #5: Neural Nets (Jason Yosinski) &

ML for Production (Ken Sanford)

Page 2: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Fun with Neural Nets

NYAI meetup 24 August 2016 Jason Yosinski

Original slides available under Creative Commons Attribution-ShareAlike 3.0

Geometric Intelligence

Page 3: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Neu

ral n

ets

star

t wor

king

1950 1960 1970 1980 1990 2000 2010 2020 ……

Progress in AI

Page 4: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Neu

ral n

ets

star

t wor

king

1950 1960 1970 1980 1990 2000 2010 2020 ……

Progress in AI

Chen et al., 2014

SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORKS

Guoguo Chen⇤1 Carolina Parada2 Georg Heigold2

1 Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD2 Google Inc., Mountain View, CA

[email protected] [email protected] [email protected]

ABSTRACT

Our application requires a keyword spotting system with a smallmemory footprint, low computational cost, and high precision. Tomeet these requirements, we propose a simple approach based ondeep neural networks. A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.Keyword recognition results achieve 45% relative improvement withrespect to a competitive Hidden Markov Model-based system, whileperformance in the presence of babble noise shows 39% relative im-provement.

Index Terms— Deep Neural Network, Keyword Spotting, Em-bedded Speech Recognition

1. INTRODUCTION

Thanks to the rapid development of smartphones and tablets, inter-acting with technology using voice is becoming commonplace. Forexample, Google offers the ability to search by voice [1] on Androiddevices and Apple’s iOS devices are equipped with a conversationalassistant named Siri. These products allow a user to tap a device andthen speak a query or a command.

We are interested in enabling users to have a fully hands-freeexperience by developing a system that listens continuously for spe-cific keywords to initiate voice input. This could be especially use-ful in situations like driving. The proposed system must be highlyaccurate, low-latency, small-footprint, and run in computationallyconstrained environments such as modern mobile devices. Runningthe system on the device avoids latency and power implications withconnecting to the server for recognition.

Keyword Spotting (KWS) aims at detecting predefined key-words in an audio stream, and it is a potential technique to providethe desired hands-free interface. There is an extensive literature inKWS, although most of the proposed methods are not suitable forlow-latency applications in computationally constrained environ-ments. For example, several KWS systems [2, 3, 4] assume offlineprocessing of the audio using large vocabulary continuous speechrecognition systems (LVCSR) to generate rich lattices. In this case,their task focuses on efficient indexing and search for keywords inthe lattices. These systems are often used to search large databasesof audio content. We focus instead on detecting keywords in theaudio stream without any latency.

A commonly used technique for keyword spotting is the Key-word/Filler Hidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despitebeing initially proposed over two decades ago, it remains highlycompetitive. In this generative approach, an HMM model is trained

⇤The author performed the work as a summer intern at Google, MTV.

for each keyword, and a filler model HMM is trained from the non-keyword segments of the speech signal (fillers). At runtime, thesesystems require Viterbi decoding, which can be computationally ex-pensive depending on the HMM topology. Other recent work ex-plores discriminative models for keyword spotting based on large-margin formulation [10, 11] or recurrent neural networks [12, 13].These systems show improvement over the HMM approach but re-quire processing of the entire utterance to find the optimal keywordregion or take information from a long time span to predict the entirekeyword, increasing detection latency.

We propose a simple discriminative KWS approach based ondeep neural networks that is appropriate for mobile devices. Werefer to it as Deep KWS . A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.In contrast with the HMM approach, this system does not requirea sequence search algorithm (decoding), leading to a significantlysimpler implementation, reduced runtime computation, and smallermemory footprint. It also makes a decision every 10 ms, minimizinglatency. We show that the Deep KWS system outperforms a standardHMM based system on both clean and noisy test sets, even when asmaller amount of data is used for training.

We describe our DNN based KWS framework in Section 2, andthe baseline HMM based KWS system in Section 3. The experimen-tal setup, results and some discussion follow in Section 4. Section 5closes with the conclusions.

2. DEEP KWS SYSTEM

The proposed Deep KWS framework is illustrated in Figure 1. Theframework consists of three major components: (i) a feature extrac-tion module, (ii) a deep neural network, and (iii) a posterior handlingmodule. The feature extraction module (i) performs voice-activitydetection and generates a vector of features every frame (10 ms).These features are stacked using the left and right context to cre-

Fig. 1. Framework of Deep KWS system, components from left toright: (i) Feature Extraction (ii) Deep Neural Network (iii) PosteriorHandling

Speech recognition, natural language conversation

Page 5: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Neu

ral n

ets

star

t wor

king

1950 1960 1970 1980 1990 2000 2010 2020 ……

Progress in AI

Chen et al., 2014

SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORKS

Guoguo Chen⇤1 Carolina Parada2 Georg Heigold2

1 Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD2 Google Inc., Mountain View, CA

[email protected] [email protected] [email protected]

ABSTRACT

Our application requires a keyword spotting system with a smallmemory footprint, low computational cost, and high precision. Tomeet these requirements, we propose a simple approach based ondeep neural networks. A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.Keyword recognition results achieve 45% relative improvement withrespect to a competitive Hidden Markov Model-based system, whileperformance in the presence of babble noise shows 39% relative im-provement.

Index Terms— Deep Neural Network, Keyword Spotting, Em-bedded Speech Recognition

1. INTRODUCTION

Thanks to the rapid development of smartphones and tablets, inter-acting with technology using voice is becoming commonplace. Forexample, Google offers the ability to search by voice [1] on Androiddevices and Apple’s iOS devices are equipped with a conversationalassistant named Siri. These products allow a user to tap a device andthen speak a query or a command.

We are interested in enabling users to have a fully hands-freeexperience by developing a system that listens continuously for spe-cific keywords to initiate voice input. This could be especially use-ful in situations like driving. The proposed system must be highlyaccurate, low-latency, small-footprint, and run in computationallyconstrained environments such as modern mobile devices. Runningthe system on the device avoids latency and power implications withconnecting to the server for recognition.

Keyword Spotting (KWS) aims at detecting predefined key-words in an audio stream, and it is a potential technique to providethe desired hands-free interface. There is an extensive literature inKWS, although most of the proposed methods are not suitable forlow-latency applications in computationally constrained environ-ments. For example, several KWS systems [2, 3, 4] assume offlineprocessing of the audio using large vocabulary continuous speechrecognition systems (LVCSR) to generate rich lattices. In this case,their task focuses on efficient indexing and search for keywords inthe lattices. These systems are often used to search large databasesof audio content. We focus instead on detecting keywords in theaudio stream without any latency.

A commonly used technique for keyword spotting is the Key-word/Filler Hidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despitebeing initially proposed over two decades ago, it remains highlycompetitive. In this generative approach, an HMM model is trained

⇤The author performed the work as a summer intern at Google, MTV.

for each keyword, and a filler model HMM is trained from the non-keyword segments of the speech signal (fillers). At runtime, thesesystems require Viterbi decoding, which can be computationally ex-pensive depending on the HMM topology. Other recent work ex-plores discriminative models for keyword spotting based on large-margin formulation [10, 11] or recurrent neural networks [12, 13].These systems show improvement over the HMM approach but re-quire processing of the entire utterance to find the optimal keywordregion or take information from a long time span to predict the entirekeyword, increasing detection latency.

We propose a simple discriminative KWS approach based ondeep neural networks that is appropriate for mobile devices. Werefer to it as Deep KWS . A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.In contrast with the HMM approach, this system does not requirea sequence search algorithm (decoding), leading to a significantlysimpler implementation, reduced runtime computation, and smallermemory footprint. It also makes a decision every 10 ms, minimizinglatency. We show that the Deep KWS system outperforms a standardHMM based system on both clean and noisy test sets, even when asmaller amount of data is used for training.

We describe our DNN based KWS framework in Section 2, andthe baseline HMM based KWS system in Section 3. The experimen-tal setup, results and some discussion follow in Section 4. Section 5closes with the conclusions.

2. DEEP KWS SYSTEM

The proposed Deep KWS framework is illustrated in Figure 1. Theframework consists of three major components: (i) a feature extrac-tion module, (ii) a deep neural network, and (iii) a posterior handlingmodule. The feature extraction module (i) performs voice-activitydetection and generates a vector of features every frame (10 ms).These features are stacked using the left and right context to cre-

Fig. 1. Framework of Deep KWS system, components from left toright: (i) Feature Extraction (ii) Deep Neural Network (iii) PosteriorHandling

Speech recognition, natural language conversation

Page 6: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Neu

ral n

ets

star

t wor

king

1950 1960 1970 1980 1990 2000 2010 2020 ……

Progress in AI

Chen et al., 2014

SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORKS

Guoguo Chen⇤1 Carolina Parada2 Georg Heigold2

1 Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD2 Google Inc., Mountain View, CA

[email protected] [email protected] [email protected]

ABSTRACT

Our application requires a keyword spotting system with a smallmemory footprint, low computational cost, and high precision. Tomeet these requirements, we propose a simple approach based ondeep neural networks. A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.Keyword recognition results achieve 45% relative improvement withrespect to a competitive Hidden Markov Model-based system, whileperformance in the presence of babble noise shows 39% relative im-provement.

Index Terms— Deep Neural Network, Keyword Spotting, Em-bedded Speech Recognition

1. INTRODUCTION

Thanks to the rapid development of smartphones and tablets, inter-acting with technology using voice is becoming commonplace. Forexample, Google offers the ability to search by voice [1] on Androiddevices and Apple’s iOS devices are equipped with a conversationalassistant named Siri. These products allow a user to tap a device andthen speak a query or a command.

We are interested in enabling users to have a fully hands-freeexperience by developing a system that listens continuously for spe-cific keywords to initiate voice input. This could be especially use-ful in situations like driving. The proposed system must be highlyaccurate, low-latency, small-footprint, and run in computationallyconstrained environments such as modern mobile devices. Runningthe system on the device avoids latency and power implications withconnecting to the server for recognition.

Keyword Spotting (KWS) aims at detecting predefined key-words in an audio stream, and it is a potential technique to providethe desired hands-free interface. There is an extensive literature inKWS, although most of the proposed methods are not suitable forlow-latency applications in computationally constrained environ-ments. For example, several KWS systems [2, 3, 4] assume offlineprocessing of the audio using large vocabulary continuous speechrecognition systems (LVCSR) to generate rich lattices. In this case,their task focuses on efficient indexing and search for keywords inthe lattices. These systems are often used to search large databasesof audio content. We focus instead on detecting keywords in theaudio stream without any latency.

A commonly used technique for keyword spotting is the Key-word/Filler Hidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despitebeing initially proposed over two decades ago, it remains highlycompetitive. In this generative approach, an HMM model is trained

⇤The author performed the work as a summer intern at Google, MTV.

for each keyword, and a filler model HMM is trained from the non-keyword segments of the speech signal (fillers). At runtime, thesesystems require Viterbi decoding, which can be computationally ex-pensive depending on the HMM topology. Other recent work ex-plores discriminative models for keyword spotting based on large-margin formulation [10, 11] or recurrent neural networks [12, 13].These systems show improvement over the HMM approach but re-quire processing of the entire utterance to find the optimal keywordregion or take information from a long time span to predict the entirekeyword, increasing detection latency.

We propose a simple discriminative KWS approach based ondeep neural networks that is appropriate for mobile devices. Werefer to it as Deep KWS . A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.In contrast with the HMM approach, this system does not requirea sequence search algorithm (decoding), leading to a significantlysimpler implementation, reduced runtime computation, and smallermemory footprint. It also makes a decision every 10 ms, minimizinglatency. We show that the Deep KWS system outperforms a standardHMM based system on both clean and noisy test sets, even when asmaller amount of data is used for training.

We describe our DNN based KWS framework in Section 2, andthe baseline HMM based KWS system in Section 3. The experimen-tal setup, results and some discussion follow in Section 4. Section 5closes with the conclusions.

2. DEEP KWS SYSTEM

The proposed Deep KWS framework is illustrated in Figure 1. Theframework consists of three major components: (i) a feature extrac-tion module, (ii) a deep neural network, and (iii) a posterior handlingmodule. The feature extraction module (i) performs voice-activitydetection and generates a vector of features every frame (10 ms).These features are stacked using the left and right context to cre-

Fig. 1. Framework of Deep KWS system, components from left toright: (i) Feature Extraction (ii) Deep Neural Network (iii) PosteriorHandling

Speech recognition, natural language conversation

Reinforcement Learning

Silver et al., 2016

Page 7: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Neu

ral n

ets

star

t wor

king

1950 1960 1970 1980 1990 2000 2010 2020 ……

Progress in AI

Chen et al., 2014

SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORKS

Guoguo Chen⇤1 Carolina Parada2 Georg Heigold2

1 Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD2 Google Inc., Mountain View, CA

[email protected] [email protected] [email protected]

ABSTRACT

Our application requires a keyword spotting system with a smallmemory footprint, low computational cost, and high precision. Tomeet these requirements, we propose a simple approach based ondeep neural networks. A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.Keyword recognition results achieve 45% relative improvement withrespect to a competitive Hidden Markov Model-based system, whileperformance in the presence of babble noise shows 39% relative im-provement.

Index Terms— Deep Neural Network, Keyword Spotting, Em-bedded Speech Recognition

1. INTRODUCTION

Thanks to the rapid development of smartphones and tablets, inter-acting with technology using voice is becoming commonplace. Forexample, Google offers the ability to search by voice [1] on Androiddevices and Apple’s iOS devices are equipped with a conversationalassistant named Siri. These products allow a user to tap a device andthen speak a query or a command.

We are interested in enabling users to have a fully hands-freeexperience by developing a system that listens continuously for spe-cific keywords to initiate voice input. This could be especially use-ful in situations like driving. The proposed system must be highlyaccurate, low-latency, small-footprint, and run in computationallyconstrained environments such as modern mobile devices. Runningthe system on the device avoids latency and power implications withconnecting to the server for recognition.

Keyword Spotting (KWS) aims at detecting predefined key-words in an audio stream, and it is a potential technique to providethe desired hands-free interface. There is an extensive literature inKWS, although most of the proposed methods are not suitable forlow-latency applications in computationally constrained environ-ments. For example, several KWS systems [2, 3, 4] assume offlineprocessing of the audio using large vocabulary continuous speechrecognition systems (LVCSR) to generate rich lattices. In this case,their task focuses on efficient indexing and search for keywords inthe lattices. These systems are often used to search large databasesof audio content. We focus instead on detecting keywords in theaudio stream without any latency.

A commonly used technique for keyword spotting is the Key-word/Filler Hidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despitebeing initially proposed over two decades ago, it remains highlycompetitive. In this generative approach, an HMM model is trained

⇤The author performed the work as a summer intern at Google, MTV.

for each keyword, and a filler model HMM is trained from the non-keyword segments of the speech signal (fillers). At runtime, thesesystems require Viterbi decoding, which can be computationally ex-pensive depending on the HMM topology. Other recent work ex-plores discriminative models for keyword spotting based on large-margin formulation [10, 11] or recurrent neural networks [12, 13].These systems show improvement over the HMM approach but re-quire processing of the entire utterance to find the optimal keywordregion or take information from a long time span to predict the entirekeyword, increasing detection latency.

We propose a simple discriminative KWS approach based ondeep neural networks that is appropriate for mobile devices. Werefer to it as Deep KWS . A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.In contrast with the HMM approach, this system does not requirea sequence search algorithm (decoding), leading to a significantlysimpler implementation, reduced runtime computation, and smallermemory footprint. It also makes a decision every 10 ms, minimizinglatency. We show that the Deep KWS system outperforms a standardHMM based system on both clean and noisy test sets, even when asmaller amount of data is used for training.

We describe our DNN based KWS framework in Section 2, andthe baseline HMM based KWS system in Section 3. The experimen-tal setup, results and some discussion follow in Section 4. Section 5closes with the conclusions.

2. DEEP KWS SYSTEM

The proposed Deep KWS framework is illustrated in Figure 1. Theframework consists of three major components: (i) a feature extrac-tion module, (ii) a deep neural network, and (iii) a posterior handlingmodule. The feature extraction module (i) performs voice-activitydetection and generates a vector of features every frame (10 ms).These features are stacked using the left and right context to cre-

Fig. 1. Framework of Deep KWS system, components from left toright: (i) Feature Extraction (ii) Deep Neural Network (iii) PosteriorHandling

Speech recognition, natural language conversation

Reinforcement Learning

Silver et al., 2016

Page 8: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Neu

ral n

ets

star

t wor

king

1950 1960 1970 1980 1990 2000 2010 2020 ……

Progress in AI

Chen et al., 2014

SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORKS

Guoguo Chen⇤1 Carolina Parada2 Georg Heigold2

1 Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD2 Google Inc., Mountain View, CA

[email protected] [email protected] [email protected]

ABSTRACT

Our application requires a keyword spotting system with a smallmemory footprint, low computational cost, and high precision. Tomeet these requirements, we propose a simple approach based ondeep neural networks. A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.Keyword recognition results achieve 45% relative improvement withrespect to a competitive Hidden Markov Model-based system, whileperformance in the presence of babble noise shows 39% relative im-provement.

Index Terms— Deep Neural Network, Keyword Spotting, Em-bedded Speech Recognition

1. INTRODUCTION

Thanks to the rapid development of smartphones and tablets, inter-acting with technology using voice is becoming commonplace. Forexample, Google offers the ability to search by voice [1] on Androiddevices and Apple’s iOS devices are equipped with a conversationalassistant named Siri. These products allow a user to tap a device andthen speak a query or a command.

We are interested in enabling users to have a fully hands-freeexperience by developing a system that listens continuously for spe-cific keywords to initiate voice input. This could be especially use-ful in situations like driving. The proposed system must be highlyaccurate, low-latency, small-footprint, and run in computationallyconstrained environments such as modern mobile devices. Runningthe system on the device avoids latency and power implications withconnecting to the server for recognition.

Keyword Spotting (KWS) aims at detecting predefined key-words in an audio stream, and it is a potential technique to providethe desired hands-free interface. There is an extensive literature inKWS, although most of the proposed methods are not suitable forlow-latency applications in computationally constrained environ-ments. For example, several KWS systems [2, 3, 4] assume offlineprocessing of the audio using large vocabulary continuous speechrecognition systems (LVCSR) to generate rich lattices. In this case,their task focuses on efficient indexing and search for keywords inthe lattices. These systems are often used to search large databasesof audio content. We focus instead on detecting keywords in theaudio stream without any latency.

A commonly used technique for keyword spotting is the Key-word/Filler Hidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despitebeing initially proposed over two decades ago, it remains highlycompetitive. In this generative approach, an HMM model is trained

⇤The author performed the work as a summer intern at Google, MTV.

for each keyword, and a filler model HMM is trained from the non-keyword segments of the speech signal (fillers). At runtime, thesesystems require Viterbi decoding, which can be computationally ex-pensive depending on the HMM topology. Other recent work ex-plores discriminative models for keyword spotting based on large-margin formulation [10, 11] or recurrent neural networks [12, 13].These systems show improvement over the HMM approach but re-quire processing of the entire utterance to find the optimal keywordregion or take information from a long time span to predict the entirekeyword, increasing detection latency.

We propose a simple discriminative KWS approach based ondeep neural networks that is appropriate for mobile devices. Werefer to it as Deep KWS . A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.In contrast with the HMM approach, this system does not requirea sequence search algorithm (decoding), leading to a significantlysimpler implementation, reduced runtime computation, and smallermemory footprint. It also makes a decision every 10 ms, minimizinglatency. We show that the Deep KWS system outperforms a standardHMM based system on both clean and noisy test sets, even when asmaller amount of data is used for training.

We describe our DNN based KWS framework in Section 2, andthe baseline HMM based KWS system in Section 3. The experimen-tal setup, results and some discussion follow in Section 4. Section 5closes with the conclusions.

2. DEEP KWS SYSTEM

The proposed Deep KWS framework is illustrated in Figure 1. Theframework consists of three major components: (i) a feature extrac-tion module, (ii) a deep neural network, and (iii) a posterior handlingmodule. The feature extraction module (i) performs voice-activitydetection and generates a vector of features every frame (10 ms).These features are stacked using the left and right context to cre-

Fig. 1. Framework of Deep KWS system, components from left toright: (i) Feature Extraction (ii) Deep Neural Network (iii) PosteriorHandling

Speech recognition, natural language conversation

Reinforcement Learning

Silver et al., 2016

Page 9: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Not just perceiving the world,but also generating…

Page 10: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Robot Gait Discovery

Page 11: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Hand-Coded Gait

Page 12: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Fixed Shallow Topology, Learned Parameters

Page 13: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Learned Deep Topology, Learned Parameters

Page 14: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Learned Deep Topology, Learned Parameters

Page 15: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Learned Deep Topology, Learned Parameters

9x fasterthan human designed gait

Page 16: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 17: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 18: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 19: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 20: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 21: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 22: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 23: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 24: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Lion

Krizhevsky et al. 2012

AlexNet

Lion

Recipe for understanding:• architecture

5 convolutional layers 3 FC layers

Page 25: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Lion

Krizhevsky et al. 2012

AlexNet

Lion

Recipe for understanding:• architecture• dataset (big: 250b)

5 convolutional layers 3 FC layers

Page 26: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Lion

Krizhevsky et al. 2012

AlexNet

Lion

Recipe for understanding:• architecture• dataset (big: 250b)

5 convolutional layers 3 FC layers

Page 27: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Lion

Krizhevsky et al. 2012

AlexNet

Lion

Recipe for understanding:• architecture• dataset (big: 250b)

5 convolutional layers 3 FC layers

ImageNet, Deng et al. 2009

Page 28: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Lion

Krizhevsky et al. 2012

AlexNet

Lion

Recipe for understanding:• architecture• dataset (big: 250b)

5 convolutional layers 3 FC layers

ImageNet, Deng et al. 2009

jaguar gibbon great white shark water bottle

golden retriever orangutan fireboat bubble

tobacco shop ambulance cowboy hat mixing bowl

Page 29: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Lion

Krizhevsky et al. 2012

AlexNet

Lion

Recipe for understanding:• architecture• dataset (big: 250b)

5 convolutional layers 3 FC layers

Page 30: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Lion

Krizhevsky et al. 2012

AlexNet

Lion

Recipe for understanding:• architecture• dataset (big: 250b)• parameters (big: 60m)

5 convolutional layers 3 FC layers

? ? ?

Page 31: NYAI #5 - Fun With Neural Nets by Jason Yosinski

< DeepVis Toolbox demo >

Code at: http://yosinski.com/

Page 32: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Lion

Recipe for understanding:• architecture• dataset (big: 250b)• parameters (big: 60m)

Page 33: NYAI #5 - Fun With Neural Nets by Jason Yosinski

See also: Erhan et al, 2009; Szegedy et al., 2013.

Recipe for understanding:• architecture• dataset (big: 250b)• parameters (big: 60m)

Page 34: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 35: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 36: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 37: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 38: NYAI #5 - Fun With Neural Nets by Jason Yosinski

yx

r g b

(similar to this)

Page 39: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 40: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 41: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 42: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 43: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Deep Neural Networks are Easily Fooled:High Confidence Predictions for Unrecognizable Images

Page 44: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 45: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Simonyan ICLR ’14L2

Dai, Lu, Wu, ICLR ’15

PeacockLearnedNo regularization

Page 46: NYAI #5 - Fun With Neural Nets by Jason Yosinski

L2 + L1 + spatial

No regularization

Page 47: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 48: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 49: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 50: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 51: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Nguyen, Dosovitskiy, Yosinski, Brox, Clune.“Synthesizing the preferred inputs for neurons in neural networks via deep generator networks”

...

I m age

banana

convertible

.....

Deep% generator%network(prior) DNN% being%visualized

candle

CodeForward%and%backward%passes

u9 u2u1 c1

c2

fc6 fc7fc8fc6

c3 c4 c5. . .

up c o n v o l u t i o n a l c o n v o l u t i o n a l

Page 52: NYAI #5 - Fun With Neural Nets by Jason Yosinski

...

I m age

banana

convertible

.....

Deep% generator%network(prior) DNN% being%visualized

candle

CodeForward%and%backward%passes

u9 u2u1 c1

c2

fc6 fc7fc8fc6

c3 c4 c5. . .

up c o n v o l u t i o n a l c o n v o l u t i o n a l

Nguyen, Dosovitskiy, Yosinski, Brox, Clune.“Synthesizing the preferred inputs for neurons in neural networks via deep generator networks”

Page 53: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 54: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 55: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 56: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Castle Candle

+ =

Fireboat Candle

+ =

Page 57: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 58: NYAI #5 - Fun With Neural Nets by Jason Yosinski

“What I cannot create, I do not understand.”

Richard Feynman’s blackboardCar

Engine Intelligencevs.

Page 59: NYAI #5 - Fun With Neural Nets by Jason Yosinski

time

ability

computation

data

scientific understanding

AI Progress

Page 60: NYAI #5 - Fun With Neural Nets by Jason Yosinski

time

ability

computation

data

scientific understanding

AI Progress

Waiting for EEs and Internet

New field

“Pseudobiology” ?(study of fake life)

Page 61: NYAI #5 - Fun With Neural Nets by Jason Yosinski
Page 62: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Thanks!

Hod Lipson

Jeff Clune

Yoshua Bengio

Anh Nguyen

Code/etc:Email:

http://yosinski.com [email protected]

( Slides: http://s.yosinski.com/nyai.pdf )

Page 63: NYAI #5 - Fun With Neural Nets by Jason Yosinski

Food & Drinks:

O’Reilly AI Conference Ticket Giveaway

INTERMISSION

Randomly selected by Jason & Ken