a to z of ai/ml: a quick introduction to artificial ... · a quick introduction to arti cial...

65
A to Z of AI/ML: A Quick Introduction to Artificial Intelligence and Machine Learning Capabilities and Tools EngCon 2017 Mark Crowley Assistant Professor Electrical and Computer Engineering University of Waterloo [email protected] Sep 23, 2017 Mark Crowley A to Z of AI/ML Sep 23, 2017 1 / 112

Upload: dangkhanh

Post on 02-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • A to Z of AI/ML:A Quick Introduction to Artificial Intelligence and

    Machine Learning Capabilities and ToolsEngCon 2017

    Mark CrowleyAssistant Professor

    Electrical and Computer EngineeringUniversity of [email protected]

    Sep 23, 2017

    Mark Crowley A to Z of AI/ML Sep 23, 2017 1 / 112

  • Introduction

    Outline

    Introduction

    What is AI?

    Neural Networks

    Convolutional Neural Networks

    Do you need AI/ML?

    Mark Crowley A to Z of AI/ML Sep 23, 2017 2 / 112

  • Introduction

    My Background

    Waterloo : Assistant Professor, ECE Department since 2015

    PhD at UBC in Computer Science with Prof. David Poole

    Postdoc at Oregon State University

    UW ECE ML Lab:https://uwaterloo.ca/scholar/mcrowley/lab

    Waterloo Institute for Complexity and Innovation (WICI)

    Research Fellow at ElementAI

    Pattern Analysis and Machine Intelligence (PAMI)

    http:\waterloo.ai

    List of facultyResearch projects (co-op/internships)List of spinoff companies from UWaterloo (good place for project ideas)

    Mark Crowley A to Z of AI/ML Sep 23, 2017 3 / 112

    https://uwaterloo.ca/scholar/mcrowley/labhttp:\waterloo.ai

  • What is AI?

    What do you think of when you hear?

    Artificial Intelligence Machine Learning

    Mark Crowley A to Z of AI/ML Sep 23, 2017 4 / 112

  • What is AI? Landscape of Big Data/AI/ML

    Data, Big Data, Machine Learning, AI, etc, etc,

    Machine Learning

    Artificial Intelligence

    Big Data Tools

    Human Decision Making

    Automated Decision MakingData

    Data Analysis

    X f1 f2 fkx1 .5 104.2 China

    x2 .34 92.0 USA

    xn .2 85.2 Canada

    Reports, statistics,Charts, trends

    Policies, Decision Rules,Summaries

    Classification,Patterns, PredictionsProbabilities

    Mark Crowley A to Z of AI/ML Sep 23, 2017 9 / 112

  • What is AI? Landscape of Big Data/AI/ML

    Data, Big Data, Machine Learning, AI, etc, etc,

    Machine Learning

    Artificial Intelligence

    Big Data Tools

    Human Decision Making

    Automated Decision Making

    DataData Analysis Classification,

    Patterns, PredictionsProbabilities

    X f1 f2 fkx1 .5 104.2 China

    x2 .34 92.0 USA

    xn .2 85.2 Canada

    Reports, statistics,Charts, trends

    Policies, Decision Rules,Summaries

    Multi-agent

    systems

    CNN DQN

    Game Theory

    Deep Learning

    ReinforcementLearning

    RNN

    LSTM

    ILP

    Probabilistic Programming

    ConstraintProgramming

    SAT

    SMP

    A3C

    Nat Lang Processing

    RoboticsVision

    Mark Crowley A to Z of AI/ML Sep 23, 2017 10 / 112

    features

    Dat

    a po

    ints

  • What is AI? Landscape of Big Data/AI/ML

    Major Types/Areas of AI

    Artificial Intellgience: some algorithm to enable computers to performactions we define as requireing intelligence.

    This is a moving target.

    Search Based Heuristic Optimization (A*)

    Evolutionary computation (genetic algorithms)

    Logic Programming (inductive logic programming, fuzzy logic)

    Probabilistic Reasoning Under Uncertainty (bayesian networks)

    Computer Vision

    Natural Language Processing

    Robotics

    Machine Learning

    Mark Crowley A to Z of AI/ML Sep 23, 2017 11 / 112

  • What is AI? Landscape of Big Data/AI/ML

    Major Types/Areas of AI

    Artificial Intellgience: some algorithm to enable computers to performactions we define as requireing intelligence. This is a moving target.

    Search Based Heuristic Optimization (A*)

    Evolutionary computation (genetic algorithms)

    Logic Programming (inductive logic programming, fuzzy logic)

    Probabilistic Reasoning Under Uncertainty (bayesian networks)

    Computer Vision

    Natural Language Processing

    Robotics

    Machine Learning

    Mark Crowley A to Z of AI/ML Sep 23, 2017 11 / 112

  • What is AI? Landscape of Big Data/AI/ML

    Major Types/Areas of AI

    Artificial Intellgience: some algorithm to enable computers to performactions we define as requireing intelligence. This is a moving target.

    Search Based Heuristic Optimization (A*)

    Evolutionary computation (genetic algorithms)

    Logic Programming (inductive logic programming, fuzzy logic)

    Probabilistic Reasoning Under Uncertainty (bayesian networks)

    Computer Vision

    Natural Language Processing

    Robotics

    Machine Learning

    Mark Crowley A to Z of AI/ML Sep 23, 2017 11 / 112

  • What is AI? Landscape of Big Data/AI/ML

    Types of Machines Learning

    Machine Learning: Detect patterns in data, use the uncovered patternsto predict future data or other outcomes of interest Kevin Murphy,Google Research.

    Mark Crowley A to Z of AI/ML Sep 23, 2017 12 / 112

  • What is AI? Landscape of Big Data/AI/ML

    Deep Learning

    Deep Learning: methods which perform machine learning through the useof multilayer neural networks of some kind. Deep Learning can be appliedin any of the three main types of ML:

    Supervised Learning : very common, enourmous improvement inrecent years

    Unsupervised Learning : just beginning, lots of potential

    Reinforement Learning : recent (past 3 years) this has exploded,exspecially for video games

    Mark Crowley A to Z of AI/ML Sep 23, 2017 14 / 112

  • What is AI? Landscape of Big Data/AI/ML

    Increasing Complexity of Supervised ML Methods

    1 mean, mode, max, min - basic statistics and patterns2 prediction/regression - least squares, ridge regression3 linear classification - use distances and separation of data points.

    (logistic regression, SVM, KNN)4 Kernel Based Classification - define a mapping from original data to a

    new space, allow nonlinear divisions to be found5 Decision trees - learn rules that divide data arbitrarily (C4.5, Random

    Forests, AdaBoost)6 Neural Networks - learn function using neurons7 Deep Neural Networks - same, but deep :)8 Recurrent Neural Networks - adding links to past timesteps, learning

    with memory of the past9 Convolutional Neural Networks - adding convolutional filters, good for

    images10 Inception Resnets, Long-Term Short-Term Networks, Voxception

    Networks, .... oh it keeps going...Mark Crowley A to Z of AI/ML Sep 23, 2017 15 / 112

  • What is AI? Classification

    One Example of ML: Classification

    Mark Crowley A to Z of AI/ML Sep 23, 2017 16 / 112

  • What is AI? Classification

    Clustering vs. Classification

    Clustering

    Unsupervised

    Uses unlabeled data

    Organize patterns w.r.t. anoptimization criteria

    Requires a definition of similarity

    Hard to evaluate

    Examples: K-means, FuzzyC-means, HierarchicalClustering, DBScan

    Classification

    Supervised

    Uses labeled data

    Requires training phase

    Domain sensitive

    Easy to evaluate (you know thecorrect answer)

    Examples: Naive Bayes, KNN,SVM, Decision Trees, RandomForests

    Mark Crowley A to Z of AI/ML Sep 23, 2017 18 / 112

  • What is AI? Classification

    Classification Performance Depends on the Algorithm

    A good example of this choices is Support Vector Machines (SVMs).

    popular until dawn of deep learning in past few years

    core idea: find a dividing hyperplane

    many variations: plane can be linear, polynomial, gaussian,high-dimensional

    So what is the right approach? Experimentation!

    Mark Crowley A to Z of AI/ML Sep 23, 2017 20 / 112

  • What is AI? Classification

    Classification Performance Depends on the Algorithm

    A good example of this choices is Support Vector Machines (SVMs).

    popular until dawn of deep learning in past few years

    core idea: find a dividing hyperplane

    many variations: plane can be linear, polynomial, gaussian,high-dimensional

    So what is the right approach?

    Experimentation!

    Mark Crowley A to Z of AI/ML Sep 23, 2017 20 / 112

  • What is AI? Classification

    Classification Performance Depends on the Algorithm

    A good example of this choices is Support Vector Machines (SVMs).

    popular until dawn of deep learning in past few years

    core idea: find a dividing hyperplane

    many variations: plane can be linear, polynomial, gaussian,high-dimensional

    So what is the right approach? Experimentation!

    Mark Crowley A to Z of AI/ML Sep 23, 2017 20 / 112

  • What is AI? Classification

    Classification Performance Depends on the Algorithm

    So choose carefully...See http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

    Mark Crowley A to Z of AI/ML Sep 23, 2017 21 / 112

    http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.htmlhttp://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

  • Neural Networks Building Upon Classic Machine Learning

    Outline

    Introduction

    What is AI?

    Neural NetworksBuilding Upon Classic Machine LearningHistory Of Neural NetworksImproving Performance

    Convolutional Neural Networks

    Do you need AI/ML?

    Mark Crowley A to Z of AI/ML Sep 23, 2017 22 / 112

  • Neural Networks Building Upon Classic Machine Learning

    Linear Regression vs. Logistic Regression

    A simple type of Generalized Linear Model

    Linear regression learns a function to predict a continuous variableoutput of continous or discrete input variables

    Y = b0 +

    (biXi ) +

    Logistic regression predicts the probability of an outcome, theappropriate class for an input vector or the odds of one outcomebeing more likely than another.

    Mark Crowley A to Z of AI/ML Sep 23, 2017 23 / 112

  • Neural Networks Building Upon Classic Machine Learning

    Logistic Regression as a Graphical Model

    o(x) = (wT xi ) = (w0 +i

    wixi ) =1

    1 + exp((w0 +

    i wixi ))

    Mark Crowley A to Z of AI/ML Sep 23, 2017 25 / 112

  • Neural Networks Building Upon Classic Machine Learning

    Logistic Regression Used as a Classifier

    Logistic Regression can be used as a simple linear classifier.

    Compare probabilities of each class P(Y = 0|X ) and P(Y = 1|X ).Treat the halfway point on the sigmoid as the decision boundary.

    P(Y = 1|X ) > 0.5 classify X in class 1w0 +

    i

    wixi = 0

    Mark Crowley A to Z of AI/ML Sep 23, 2017 26 / 112

  • Neural Networks Building Upon Classic Machine Learning

    Training Logistic Regression Model via Gradient Descent

    Cant easily perform Maximum Likelihood Estimation

    The negative log-likehood of the logistic function is given by NLL andits gradient by g

    NLL(w) =Ni=1

    log

    (1 + exp((w0 +

    i

    wixi ))

    )

    g =

    w=i

    ((wT xi ) yi )xi

    Then we can update the parameters iteratively

    k+1 = k kgk

    where k is the learning rate or step size.Mark Crowley A to Z of AI/ML Sep 23, 2017 27 / 112

  • Neural Networks History Of Neural Networks

    Neural Networks to learn f : X Yf can be a non-linear function

    X (vector of) continuous and/or discrete variablesY (vector of) continuous and/or discrete variables

    Feedforward Neural networks - Represent f by network of non-linear(logistic/sigmoid/ReLU) units:

    Input layer, X

    Output layer, Y

    Hidden layer, H

    Nonlinear UnitSigmoid/ReLU/ELU

    Mark Crowley A to Z of AI/ML Sep 23, 2017 37 / 112

  • Neural Networks History Of Neural Networks

    Basic Three Layer Neural Network

    Input Layer

    vector data, each input collects one feature/dimension of the dataand passes it on to the (first) hidden layer.

    Hidden Layer

    Each hidden unit computes a weighted sum of all the units from theinput layer (or any previous layer) and passes it through a nonlinearactivation function.

    Output Layer

    Each output unit computes a weighted sum of all the hidden unitsand passes it through a (possibly nonlinear) threshold function.

    Mark Crowley A to Z of AI/ML Sep 23, 2017 38 / 112

  • Neural Networks History Of Neural Networks

    Properties of Neural Networks

    Universality: Given a large enough layer of hidden units (or multiplelayers) a NN can represent any function.

    Representation Learning: classic statistical machine learning is aboutlearning functions to map input data to output. But Neural Networks,and especially Deep Learning, are more about learning arepresentation in order to perform classification or some other task.

    Mark Crowley A to Z of AI/ML Sep 23, 2017 40 / 112

  • Neural Networks History Of Neural Networks

    Hidden Layer: Adding Nonlinearity

    Each hidden unit emits an output that is a nonlinear activationfunction of its net activiation.

    yj = f (netj)

    This is essential to neural networks power, if its linear then it allbecomes just linear regression.

    The output is thus thresholded through this nonlinear activationfunction.

    Mark Crowley A to Z of AI/ML Sep 23, 2017 44 / 112

  • Neural Networks History Of Neural Networks

    Activation Functions

    tanh was another common function.

    sigmoid is now discourage except for final layer to obtainprobabilities. Can over-saturate easily.

    ReLU is the new standard activation function to use.

    Mark Crowley A to Z of AI/ML Sep 23, 2017 45 / 112

  • Neural Networks History Of Neural Networks

    Rectified Linear Activation

    CHAPTER 6. DEEP FEEDFORWARD NETWORKS

    0

    z

    0

    g(z

    )=

    max{0

    ,z}

    Figure 6.3: The rectified linear activation function. This activation function is the defaultactivation function recommended for use with most feedforward neural networks. Applyingthis function to the output of a linear transformation yields a nonlinear transformation.However, the function remains very close to linear, in the sense that is a piecewise linearfunction with two linear pieces. Because rectified linear units are nearly linear, theypreserve many of the properties that make linear models easy to optimize with gradient-based methods. They also preserve many of the properties that make linear modelsgeneralize well. A common principle throughout computer science is that we can buildcomplicated systems from minimal components. Much as a Turing machines memoryneeds only to be able to store 0 or 1 states, we can build a universal function approximatorfrom rectified linear functions.

    175

    (Goodfellow 2016)

    Rectified Linear Units (ReLU) have become standard max(0, netj)strong signals are alwasy easy to distinguishmost values are zero, deritive is mostly zerothey do not saturate as easily as sigmoid

    new Exponential linear units - evidence that they perform better thanReLU in some situations.

    Mark Crowley A to Z of AI/ML Sep 23, 2017 46 / 112

  • Neural Networks History Of Neural Networks

    Gradient Descent

    d

    E Error function, MSE, cross-entropy loss

    For Neural Networks,

    E[w] no longer convex in w

    Error Function: Mean Squared Error, cross-entropy loss, etc.

    Gradient: 5E [w] =[

    Ew0

    , Ew1 , . . . ,Ewd

    ]Training Update Rule: wi = Ewi where is the training rate.Note: For regression, others, this gradient is convex. In ANNs it is not. Sowe must solve iteratively

    (Slides from Tom Mitchell ML Course, CMU, 2010)

    Mark Crowley A to Z of AI/ML Sep 23, 2017 53 / 112

  • Neural Networks History Of Neural Networks

    Gradient Descent

    d

    E Error function, MSE, cross-entropy loss

    For Neural Networks,

    E[w] no longer convex in w

    Error Function: Mean Squared Error, cross-entropy loss, etc.

    Gradient: 5E [w] =[

    Ew0

    , Ew1 , . . . ,Ewd

    ]Training Update Rule: wi = Ewi where is the training rate.Note: For regression, others, this gradient is convex. In ANNs it is not. Sowe must solve iteratively

    (Slides from Tom Mitchell ML Course, CMU, 2010)

    Mark Crowley A to Z of AI/ML Sep 23, 2017 53 / 112

  • Neural Networks History Of Neural Networks

    Incremental Gradient Descent

    Let error function be : El [w] =12 (y

    l o l)2Do until satisfied:

    For each training example l in D1 Compute the gradient 5E [w]2 update weights : w = w 5 E [w]

    Note: can also use batch gradient descent on many points at once.

    Mark Crowley A to Z of AI/ML Sep 23, 2017 54 / 112

  • Neural Networks History Of Neural Networks

    Backpropagation Algorithm

    We need an iterative algorithm for getting the gradient efficiently.For each training example:

    1 Forward propagation: Input the training example to the network andcompute outputs

    2 Compute output units errors:

    lk = olk(1 o lk)(y lk o lk)

    3 Compute hidden units errors:

    lh = olh(1 o lh)

    k

    wh,klk

    4 Update network weights:

    wi ,j = wi ,j + wli ,j = wi ,j +

    ljo

    li

    Mark Crowley A to Z of AI/ML Sep 23, 2017 56 / 112

  • Neural Networks History Of Neural Networks

    A Short History

    40s Early work in NN goes back to the 40s with a simple modelof the neuron by McCulloh and Pitt as a summing andthresholding devices.

    1958 Rosenblatt in 1958 introduced the Perceptron,a two layernetwork (one input layer and one output node with a bias inaddition to the input features.

    1969 Marvin Minsky: 1969. Perceptrons are just linear, AI goeslogical, beginning of AI Winter

    1980s Neural Network resurgence: Backpropagation (updatingweights by gradient descent)

    1990s SVMs! Kernals can do anything! (no, they cant)

    Mark Crowley A to Z of AI/ML Sep 23, 2017 32 / 112

  • Neural Networks History Of Neural Networks

    A Short History

    1993 LeNet 1 for digit recognition2003 Deep Learning (Convolutional Nets Dropout/RBMs, Deep

    Belief Networks)1986, 2006 Restricted Boltzman Machines

    2006 Neural Network outperform RBF SVM on MNISThandwriting dataset (Hinton et al.)

    2012 AlexNet for ImageNet challenge - this algorithm beatcompetition by error rate of 16% vs 26% for next best

    ImageNet : contains 15 million annotated images inover 22,000 categories.ZFNet paper (2013) extends this and has gooddescription of network structure

    2012-present Google Cat Youtube, speech recognition, self driving cars,computer defeats regional Go champion, ...

    2014 GoogLeNet added many layers and introduced inceptionmodules (allows parallel computation rather than seriallyconnected CNN layers)Mark Crowley A to Z of AI/ML Sep 23, 2017 33 / 112

  • Neural Networks History Of Neural Networks

    A Short History

    2014 Generative Adversarial Networks (GANs) introduced.

    2015 Microsoft algorithm beats human performance at ImageNetchallenge.

    2016 AlphaGo defeats one of best world players of Go Lee Sedolusing Deep Reinforcement Learning.

    2016 Deep Mind introduces A3C Deep RL algorithm that canlearn to play Atari games from images by playing with noinstructions.

    Mark Crowley A to Z of AI/ML Sep 23, 2017 34 / 112

  • Neural Networks History Of Neural Networks

    Outline

    Introduction

    What is AI?

    Neural NetworksBuilding Upon Classic Machine LearningHistory Of Neural NetworksImproving Performance

    Convolutional Neural Networks

    Do you need AI/ML?

    Mark Crowley A to Z of AI/ML Sep 23, 2017 58 / 112

  • Neural Networks History Of Neural Networks

    Problems with ANNs

    Overfitting

    Very inneficient for images, timeseries, large numbers ofinputs-outputs

    Slow to train

    Hard to interpret the resulting model

    Overfitting

    Mark Crowley A to Z of AI/ML Sep 23, 2017 59 / 112

  • Neural Networks Improving Performance

    Heuristics for Improving Backpropagation

    There are a number of useful heuristics for training Neural Networks thatare useful in practice (maybe well learn more today):

    Less hidden nodes, just enough complexity to work, not too much tooverfit.

    Train multiple networks with different sizes and search for the bestdesign.

    Validation set: train on training set until error on validation set startsto rise, then evaluate on evaluation set.

    Try different activiation functions: tanh, ReLU, ELU, ...?

    Dropout (Hinton 2014) - randomly ignore certain units duringtraining, dont update them via gradient descent, leads to hiddenunits that specialize

    Modify learning rate over time (cooling schedule)

    Mark Crowley A to Z of AI/ML Sep 23, 2017 60 / 112

  • Neural Networks Improving Performance

    Dropout

    Dropout (Hinton 2014) - randomly ignore certain units duringtraining, dont update them via gradient descent, leads to hiddenunits that specialize.With probability p dont include a weight in the gradient updates.Reduces overfitting by encouraging robustness of weights in thenetwork.

    Mark Crowley A to Z of AI/ML Sep 23, 2017 64 / 112

  • Neural Networks Improving Performance

    Large, Shallow Models Overfit More

    (Goodfellow 2016)

    CHAPTER 6. DEEP FEEDFORWARD NETWORKS

    0.0 0.2 0.4 0.6 0.8 1.0

    Number of parameters

    108

    91

    92

    93

    94

    95

    96

    97

    Test

    accuracy

    (percent)

    3, convolutional

    3, fully connected

    11, convolutional

    Figure 6.7: Deeper models tend to perform better. This is not merely because the model islarger. This experiment from Goodfellow et al. (2014d) shows that increasing the numberof parameters in layers of convolutional networks without increasing their depth is notnearly as effective at increasing test set performance. The legend indicates the depth ofnetwork used to make each curve and whether the curve represents variation in the size ofthe convolutional or the fully connected layers. We observe that shallow models in thiscontext overfit at around 20 million parameters while deep ones can benefit from havingover 60 million. This suggests that using a deep model expresses a useful preference overthe space of functions the model can learn. Specifically, it expresses a belief that thefunction should consist of many simpler functions composed together. This could resulteither in learning a representation that is composed in turn of simpler representations (e.g.,corners defined in terms of edges) or in learning a program with sequentially dependentsteps (e.g., first locate a set of objects, then segment them from each other, then recognizethem).

    203

    Mark Crowley A to Z of AI/ML Sep 23, 2017 67 / 112

  • Convolutional Neural Networks

    Outline

    Introduction

    What is AI?

    Neural Networks

    Convolutional Neural NetworksMotivationOther Types of Deep Neural Networks

    Do you need AI/ML?

    Mark Crowley A to Z of AI/ML Sep 23, 2017 68 / 112

  • Convolutional Neural Networks Motivation

    Convolutional Network Structure

    input data: image (eg. 256x256 pixels x3 channels RGB)

    output : categorical label

    Mark Crowley A to Z of AI/ML Sep 23, 2017 70 / 112

  • Convolutional Neural Networks Motivation

    Example Applications of CNNs

    (Karpathy Blog, Oct, 25, 2015 - http:karpathy.github.io20151025selfie)

    Mark Crowley A to Z of AI/ML Sep 23, 2017 71 / 112

  • Convolutional Neural Networks Motivation

    Parameter sharingCHAPTER 9. CONVOLUTIONAL NETWORKS

    x1x1 x2x2 x3x3

    s2s2s1s1 s3s3

    x4x4

    s4s4

    x5x5

    s5s5

    x1x1 x2x2 x3x3 x4x4 x5x5

    s2s2s1s1 s3s3 s4s4 s5s5

    Figure 9.5: Parameter sharing: Black arrows indicate the connections that use a particularparameter in two different models. (Top)The black arrows indicate uses of the centralelement of a 3-element kernel in a convolutional model. Due to parameter sharing, thissingle parameter is used at all input locations. (Bottom)The single black arrow indicatesthe use of the central element of the weight matrix in a fully connected model. This modelhas no parameter sharing so the parameter is used only once.

    for every location, we learn only one set. This does not affect the runtime offorward propagationit is still O(k n)but it does further reduce the storagerequirements of the model to k parameters. Recall that k is usually several ordersof magnitude less than m. Since m and n are usually roughly the same size, k ispractically insignificant compared to mn. Convolution is thus dramatically moreefficient than dense matrix multiplication in terms of the memory requirementsand statistical efficiency. For a graphical depiction of how parameter sharing works,see figure 9.5.

    As an example of both of these first two principles in action, figure 9.6 showshow sparse connectivity and parameter sharing can dramatically improve theefficiency of a linear function for detecting edges in an image.

    In the case of convolution, the particular form of parameter sharing causes thelayer to have a property called equivariance to translation. To say a function isequivariant means that if the input changes, the output changes in the same way.Specifically, a function f(x) is equivariant to a function g if f(g(x)) = g(f(x)).In the case of convolution, if we let g be any function that translates the input,i.e., shifts it, then the convolution function is equivariant to g. For example, let Ibe a function giving image brightness at integer coordinates. Let g be a function

    338

    Convolution shares the same

    parameters across all spatial

    locations

    Traditional matrix

    multiplication does not share any parameters

    (Goodfellow 2016)

    Mark Crowley A to Z of AI/ML Sep 23, 2017 73 / 112

  • Convolutional Neural Networks Motivation

    2D Convolution

    (Goodfellow 2016)

    CHAPTER 9. CONVOLUTIONAL NETWORKS

    a b c d

    e f g h

    i j k l

    w x

    y z

    aw + bx +ey + fzaw + bx +ey + fz

    bw + cx +fy + gzbw + cx +fy + gz

    cw + dx +gy + hzcw + dx +gy + hz

    ew + fx +iy + jzew + fx +iy + jz

    fw + gx +jy + kzfw + gx +jy + kz

    gw + hx +ky + lzgw + hx +ky + lz

    InputKernel / Tensor

    Output

    Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restrictthe output to only positions where the kernel lies entirely within the image, called validconvolution in some contexts. We draw boxes with arrows to indicate how the upper-leftelement of the output tensor is formed by applying the kernel to the correspondingupper-left region of the input tensor.

    334

    Mark Crowley A to Z of AI/ML Sep 23, 2017 76 / 112

  • Convolutional Neural Networks Motivation

    A simple example

    Edge detection by convolution with a kernal that subtracts the value fromthe neighbouring pixel on the left for every pixel.

    Edge Detection by Convolution

    CHAPTER 9. CONVOLUTIONAL NETWORKS

    Figure 9.6: Efficiency of edge detection. The image on the right was formed by takingeach pixel in the original image and subtracting the value of its neighboring pixel on theleft. This shows the strength of all of the vertically oriented edges in the input image,which can be a useful operation for object detection. Both images are 280 pixels tall.The input image is 320 pixels wide while the output image is 319 pixels wide. Thistransformation can be described by a convolution kernel containing two elements, andrequires 319 280 3 = 267, 960 floating point operations (two multiplications andone addition per output pixel) to compute using convolution. To describe the sametransformation with a matrix multiplication would take 320 280 319 280, or overeight billion, entries in the matrix, making convolution four billion times more efficient forrepresenting this transformation. The straightforward matrix multiplication algorithmperforms over sixteen billion floating point operations, making convolution roughly 60,000times more efficient computationally. Of course, most of the entries of the matrix would bezero. If we stored only the nonzero entries of the matrix, then both matrix multiplicationand convolution would require the same number of floating point operations to compute.The matrix would still need to contain 2 319 280 = 178, 640 entries. Convolutionis an extremely efficient way of describing transformations that apply the same lineartransformation of a small, local region across the entire input. (Photo credit: PaulaGoodfellow)

    340

    CHAPTER 9. CONVOLUTIONAL NETWORKS

    Figure 9.6: Efficiency of edge detection. The image on the right was formed by takingeach pixel in the original image and subtracting the value of its neighboring pixel on theleft. This shows the strength of all of the vertically oriented edges in the input image,which can be a useful operation for object detection. Both images are 280 pixels tall.The input image is 320 pixels wide while the output image is 319 pixels wide. Thistransformation can be described by a convolution kernel containing two elements, andrequires 319 280 3 = 267, 960 floating point operations (two multiplications andone addition per output pixel) to compute using convolution. To describe the sametransformation with a matrix multiplication would take 320 280 319 280, or overeight billion, entries in the matrix, making convolution four billion times more efficient forrepresenting this transformation. The straightforward matrix multiplication algorithmperforms over sixteen billion floating point operations, making convolution roughly 60,000times more efficient computationally. Of course, most of the entries of the matrix would bezero. If we stored only the nonzero entries of the matrix, then both matrix multiplicationand convolution would require the same number of floating point operations to compute.The matrix would still need to contain 2 319 280 = 178, 640 entries. Convolutionis an extremely efficient way of describing transformations that apply the same lineartransformation of a small, local region across the entire input. (Photo credit: PaulaGoodfellow)

    340

    -1 -1

    Input

    KernelOutput

    (Goodfellow 2016)

    Mark Crowley A to Z of AI/ML Sep 23, 2017 77 / 112

  • Convolutional Neural Networks Motivation

    Other CNN Modification

    Pooling: Nearby pixels tend to represent the same thing/class/object.So, pool responses from nearby nodes. (eg. mean, median,min, max)

    Strides: number of pixels overlap between adjacent filters

    Zero padding: removing edge pixels from filter scan, can reduce size ofnetwork and deal with edge effects

    Connectivity: Alternate local connectivity options, partial connectivity

    Mark Crowley A to Z of AI/ML Sep 23, 2017 90 / 112

  • Convolutional Neural Networks Other Types of Deep Neural Networks

    Other Types of Deep Neural Networks

    RBM: Restricted Boltzman Machines (RBM) - older directed deepmodel.

    RNN: Recurrent Neural Networks (RNN) - allow links from outputsback to inputs, over time, good for time series learning

    LSTM: Long-Term Short-Term networks - more complex form ofRNN

    DeepRL: Deep Reinforcement Learning

    GAN: General Adversarial Networks - train two networks at once

    Mark Crowley A to Z of AI/ML Sep 23, 2017 99 / 112

  • Convolutional Neural Networks Other Types of Deep Neural Networks

    Other Types of Deep Neural Networks

    RBM: Restricted Boltzman Machines (RBM) - older directed deepmodel.

    RNN: Recurrent Neural Networks (RNN) - allow links from outputsback to inputs, over time, good for time series learning

    LSTM: Long-Term Short-Term networks - more complex form ofRNN

    integrate strategically remembered particularinformation from the pastformalizes a process for forgetting information overtime.useful if you need to learn patterns over time and yourdata feautres

    DeepRL: Deep Reinforcement Learning

    GAN: General Adversarial Networks - train two networks at once

    Mark Crowley A to Z of AI/ML Sep 23, 2017 99 / 112

  • Convolutional Neural Networks Other Types of Deep Neural Networks

    Other Types of Deep Neural Networks

    RBM: Restricted Boltzman Machines (RBM) - older directed deepmodel.

    RNN: Recurrent Neural Networks (RNN) - allow links from outputsback to inputs, over time, good for time series learning

    LSTM: Long-Term Short-Term networks - more complex form ofRNN

    DeepRL: Deep Reinforcement Learning

    CNNs + Fully Connected Deep Network for learning arepresentation of a policyReinforcement Learning for updating the policy throughexperience to make improved decision decisionsRequires a value/reward function

    GAN: General Adversarial Networks - train two networks at once

    Mark Crowley A to Z of AI/ML Sep 23, 2017 99 / 112

  • Convolutional Neural Networks Other Types of Deep Neural Networks

    Other Types of Deep Neural Networks

    RBM: Restricted Boltzman Machines (RBM) - older directed deepmodel.

    RNN: Recurrent Neural Networks (RNN) - allow links from outputsback to inputs, over time, good for time series learning

    LSTM: Long-Term Short-Term networks - more complex form ofRNN

    DeepRL: Deep Reinforcement Learning

    GAN: General Adversarial Networks - train two networks at once

    Mark Crowley A to Z of AI/ML Sep 23, 2017 99 / 112

  • Convolutional Neural Networks Other Types of Deep Neural Networks

    General Adversarial Networks

    One network produces/hallucinates new answers (generative)

    Second network distinguishes between the real and the generatedanswers (adversary/critic)

    Blog withCode: GANS in 50 lines of code PyTorch code. easy wayto get started

    Mark Crowley A to Z of AI/ML Sep 23, 2017 100 / 112

  • Do you need AI/ML?

    Outline

    Introduction

    What is AI?

    Neural Networks

    Convolutional Neural Networks

    Do you need AI/ML?Defining Your QuestionsDesigning Your AI/ML SystemLanguages and LibrariesDeep Learning FrameworksCompute Resources

    Mark Crowley A to Z of AI/ML Sep 23, 2017 101 / 112

  • Do you need AI/ML? Defining Your Questions

    Defining Your Questions

    Is it a decision to be made?

    Is there a pattern to detect?

    Do you have data?

    What kinds of questions do you have about the data?

    Yes/No questions - Did X happen? Are A and B correlated?Timing - When did X happen?

    Anomaly detection - Is X strange/abnormal/unexpected?Classification - What kind of Y is X?Prediction - Weve seen lots of (X,Y) now we want to know (X,?)Do you have labels?

    Can you give the right answer for some portion of the data?Collecting labels: Automatic? Manual? Crowd-sourced? (eg. AmazonMechanical Turk) YYes Supervised Learning - Lots of optionsNo Unsupervised Learning - Some options (getting better all thetime)

    Mark Crowley A to Z of AI/ML Sep 23, 2017 102 / 112

  • Do you need AI/ML? Defining Your Questions

    Answers and Constraints

    What kind of answer do you need? (increasing difficulty)

    Find patterns which are present in the data and view them

    Most likely explanation for a pattern

    Probability of (fact about X,A,B...) being true

    A policy for actions to take in the future to maximize benefit

    Guarantees that X will (or will not) happen (very hard)

    How big is your data?

    Is it static?

    MB, GB, TB?

    Is it streaming?

    KB/sec, MB/sec

    How many data points/rows/events will there be?

    Mark Crowley A to Z of AI/ML Sep 23, 2017 103 / 112

  • Do you need AI/ML? Designing Your AI/ML System

    How to Design your AI/ML Question

    Define your task:

    Prediction, Clustering, Classification,Anomaly Detection?

    Define objectives, error metrics,performance standards

    Collect Data:

    Set up data stream (storage, inputflow, parallelization, Hadoop)

    Preprocessing:

    Noise/Outlier Filtering

    Completing missing data (histograms,interpolation)

    Normalization (scaling data)

    Mark Crowley A to Z of AI/ML Sep 23, 2017 104 / 112

  • Do you need AI/ML? Designing Your AI/ML System

    How to Design your AI/ML Question

    Dimensionality Reduction / FeatureSelection:

    Choose features to use/extractfrom data

    PCA/LDA/LLE/GDA

    Choose Algorithm:

    Consider goals, questions

    Tractability

    Experimental Design:

    train/validate/test data sets

    cross-validation

    Run it! :

    Deployment

    Maintenance

    Updating to new advances

    Mark Crowley A to Z of AI/ML Sep 23, 2017 105 / 112

  • Do you need AI/ML? Languages and Libraries

    Language Choices

    Any language can be used for implementing/using AI/ML algorithms, butsome make it much easier

    C++: you can do it, may need to implement many things yourself

    Java: many of libraries for ML (Weka is a good open source one,Deeplearning4j)

    Scala: leaner, functional language that compile to JVM bytecode,good for prototyping, can reuse libraries for Java(Deeplearning4j)

    R: focussed on statisical methods, more and more machinelearning libraries implemented for this

    Matlab: good for all the calculations, if you have the right librariesits great (not cheap or very portable beyond school)

    Python: most commonly used right now for deep learning, weregonna need another slide ...

    Mark Crowley A to Z of AI/ML Sep 23, 2017 106 / 112

  • Do you need AI/ML? Languages and Libraries

    Python

    numpy - numerical libraries, implementation of matrix and linearalgerbra datastructures, graphing tools

    pandas - table datastructure, statistical analysis tools (implementsmany useful features from R)

    scipy - includes all of the above and more, full installation ofscientific libraries, basically turns Python into R+Matlab

    scikit-learn - many standard machine learning algorithms implemented aseasy-to-use Python APIs

    jupyter notebooks - these are powerful web-based interfaces to python fordata analysis and machine learning.

    Mark Crowley A to Z of AI/ML Sep 23, 2017 107 / 112

  • Do you need AI/ML? Deep Learning Frameworks

    Deep Learning Frameworks

    Caffe - older, easy to set up mockups, harder to install?

    Theano - made out of University of Montreal, great theoretical setup,very flexible, python only

    Tensorflow - made by Google, scales to many GPUs, servers, lots ofoptimization, requires planning of the whole systembeforehand, most languages

    PyTorch - easier to mock things up, try different designs, not asoptimized for large scale performance as tensorflow

    MXNet - made by Microsoft, supports most languages and OSs

    Deeplearning4j - Java focussed framework

    Keras - open interface to create models in multiple frameworks(tensorflow, theano, MXNet)

    Mark Crowley A to Z of AI/ML Sep 23, 2017 108 / 112

  • Do you need AI/ML? Compute Resources

    Cloud Services

    There are several powerful, free services you can access via a studentaccount which you can request directly.

    AWS: Amazon Web Service - very large, has accessible APIs toconnect to, many options for hardware to run on (but thebest ones will cost extra)

    Azure: Microsoft - lots of visual tools for composing AI/MLcomponents.

    Google Cloud ML Engine: - uses all the latest tools and tensorflow models

    None of these provide GPU servers for free, that will costextra. (It will still work, just be slower for deep learning.)

    Mark Crowley A to Z of AI/ML Sep 23, 2017 109 / 112

  • Do you need AI/ML? Compute Resources

    Summary

    Introduction

    What is AI?Landscape of Big Data/AI/MLClassification

    Neural NetworksBuilding Upon Classic Machine LearningHistory Of Neural NetworksImproving Performance

    Convolutional Neural NetworksMotivationOther Types of Deep Neural Networks

    Do you need AI/ML?Defining Your QuestionsDesigning Your AI/ML SystemLanguages and LibrariesDeep Learning FrameworksCompute Resources

    Mark Crowley A to Z of AI/ML Sep 23, 2017 110 / 112

  • Do you need AI/ML? Compute Resources

    Useful Books

    A book for of three eras of Machine Learning:

    [Goodfellow, 2016]Goodfellow, Bengio and Courville. Deep Learning, MIT Press, 2016.

    http://www.deeplearningbook.org/Website has free copy of book as pdfs.

    [Murphy, 2012]Kevin Murphy, Machine Learning: A Probabilistic Perspective, MITPress, 2012.

    [Duda, Pattern Classification, 2001]R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (2nded.), John Wiley and Sons, 2001.

    Mark Crowley A to Z of AI/ML Sep 23, 2017 111 / 112

    http://www.deeplearningbook.org/

  • Do you need AI/ML? Compute Resources

    Useful Papers and Blogs

    [lecun2015]Y. LeCun, Y. Bengio, G. Hinton, L. Y., B. Y., and H. G., Deeplearning, Nature, vol. 521, no. 7553, pp. 436444, 2015. Greatreferences at back with comments on seminal papers.

    [bengio2009]Y. Bengio, Learning Deep Architectures for AI, Foundations andTrends in Machine Learning, vol. 2, no. 1. 2009. An earlier generalreferenceon the fundamentals of Deep Learning.

    [krizhevsky2012]A. Krizhevsky, G. E. Hinton, and I. Sutskever,ImageNet Classificationwith Deep Convolutional Neural Networks, Adv. Neural Inf. Process.Syst. pp. 19, 2012. The beginning of the current craze.

    [Karpathy, 2015]Andrej Karpathys Blog - http://karpathy.github.io Easy tofollow explanations with code.

    [desphande2016]A. Desphande. The 9 Deep Learning Papers You Need To KnowAbout. github.io blog post, 2016.https://adeshpande3.github.io/adeshpande3.github.io/

    The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

    Mark Crowley A to Z of AI/ML Sep 23, 2017 112 / 112

    http://karpathy.github.iohttps://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

    IntroductionWhat is AI?Landscape of Big Data/AI/MLClassification

    Neural NetworksBuilding Upon Classic Machine LearningHistory Of Neural NetworksImproving Performance

    Convolutional Neural NetworksMotivationOther Types of Deep Neural Networks

    Do you need AI/ML?Defining Your QuestionsDesigning Your AI/ML SystemLanguages and LibrariesDeep Learning FrameworksCompute Resources