practical deep learning · jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 break 14:45-15:30...

Practical deep learning

Markus Koskela

Mats Sjöberg

CSC – IT Center for Science Ltd, Espoo

February 13–14, 2019

All original material (C) 2019 by CSC – IT Center for Science Ltd.

Licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License,

http://creativecommons.org/licenses/by-sa/4.0

All other material copyrighted by their respective authors.

Agenda

Up-to-date agenda and lecture slides can be found at https://tinyurl.com/y83ctvug

Exercise materials are at GitHub: https://github.com/csc-training/intro-to-dl/

Wireless accounts for CSC-guest network behind the badges. Alternatively, use the eduroam

network with your university accounts or the LAN cables on the tables.

Accounts to Taito-GPU cluster delivered separately.

Day 1: Notebooks

9:00-10:30 Lecture 1: Introduction to deep learning

10:30-10:45 Break

10:45-11:00 Exercise 1: Introduction to Notebooks, Keras fundamentals

Jupyter notebook: keras-test-setup.ipynb

11:00-11:30 Lecture 2: Multi-layer perceptron networks

11:30-12:00 Exercise 2: Classification with MLPs

Jupyter notebook: keras-mnist-mlp.ipynb

12:00-13:00 Lunch

13:00-14:00 Lecture 3: Image data, convolutional neural networks

14:00-14:30 Exercise 3: Image classification with CNNs

Jupyter notebook: keras-mnist-cnn.ipynb

14:30-14:45 Break

14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention

15:30-16:00 Exercise 4: Text sentiment classification with CNNs, RNNs

Jupyter notebooks: keras-imdb-cnn.ipynb, keras-imdb-rnn.ipynb

Day 2: Taito-GPU

9:00-9:45 Lecture 5: Introduction to PyTorch

9:45-10:15 Lecture 6: GPUs, batch jobs, using Taito-GPU

10:15-10:30 Break

10:30-12:00 Exercise 5: Image classification: dogs vs. cats; traffic signs

12:00-13:00 Lunch

13:00-14:00 Exercise 6: Text categorization and labeling: 20 newsgroups; Ted talks

14:00-14:45 Lecture 7: Cloud, GPU utilization, multi-GPU

14:45-15:00 Break

15:00-16:00 Exercise 7: Using multiple GPUs


Lecture 1: Introduction to deep learning

About this course

• Introduction to deep learning– basics of ML assumed– mostly high-school math– much of theory, many details skipped

• 1st day: lectures + small-scale exercises using notebooks.csc.fi • 2nd day: mid-scale experiments using GPUs at Taito-GPU• Slides at: https://tinyurl.com/y83ctvug • Other materials (and link to Gitter) at GitHub:

https://github.com/csc-training/intro-to-dl/ • Focus on text and image classification, no fancy stuff• Using Python, Keras, and PyTorch

Further resources

• This course is largely “inspired by”:“Deep Learning with Python” by François Chollet

• Recommended textbook:“Deep learning” by Goodfellow, Bengio, Courville

• Lots of further material available online, e.g.:http://cs231n.stanford.edu/ http://course.fast.ai/ https://developers.google.com/machine-learning/crash-course/ www.nvidia.com/dlilabs http://introtodeeplearning.com/ https://github.com/oxford-cs-deepnlp-2017/lectures

• Academic courses

What is artificial intelligence?

Artificial intelligence is the ability of a computer to perform tasks

commonly associated with intelligent beings.

What is machine learning?

Machine learning is the study of algorithms that learn from

examples and experience instead of relying on hard-coded rules

and make predictions on new data.

What is deep learning?

Deep learning is a subfield of machine learning focusing on

learning data representations as successive layers of increasingly

meaningful representations.

Image from https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/

cat

“Traditional” machine learning:

handcrafted features

learned classifier

cat

Deep, “end-to-end” learning:

learned high-level features

learned mid-level features

learned low-level features

learned classifier

From: Wang & Raj: On the Origin of Deep Learning (2017)

Demotivational slide

“All of these AI systems we see, none of them is ‘real’ AI”– Josh Tennenbaum

“Neural networks are … neither neural nor even networks.”– François Chollet, author of Keras

Main types of machine learning


• Supervised learning

• Unsupervised learning• Self-supervised learning• Reinforcement learning

cat

dog




By Chire [CC BY-SA 3.0], from Wikimedia Commons




Image from https://arxiv.org/abs/1710.10196




Animation from https://yanpanlau.github.io/2016/07/10/FlappyBird-Keras.html

Fundamentals of machine learning

Data

• Humans learn by observation and unsupervised learning

– model of the world /common sense reasoning

• Machine learning needs lots of (labeled) data to compensate

• Tensors: generalization of matricesto n dimensions (or rank, order, degree)– 1D tensor: vector– 2D tensor: matrix

– 3D, 4D, 5D tensors– numpy.ndarray(shape, dtype)

• Training – validation – test split (+ adversarial test)

• Minibatches

– small sets of input data used at a time

Data

Model – learning/training – inference

•• parameters 𝜃 and hyperparameters http://playground.tensorflow.org/

Optimization

• Mathematical optimization:“the selection of a best element (withregard to some criterion) from someset of available alternatives” (Wikipedia)

• Main types:finite-step, iterative, heuristic

• Learning as an optimization problem

– cost function: loss regularization

By Rebecca Wilson (originally posted to Flickr as Vicariously) [CC BY 2.0], via Wikimedia Commons

Optimization

Image from: Li et al. “Visualizing the Loss Landscape of Neural Nets”, arXiv:1712.09913

Gradient descent

• Derivative and minima/maxima of functions

• Gradient: the derivative of a multivariable function

• Gradient descent:

• (Mini-batch) stochastic gradient

descent (and its variants)

Image from: https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3

Over- and underfitting, generalization, regularization

• Models with lots of parameters can

easily overfit to training data

• Generalization: the quality of ML model is measured on new, unseen samples

• Regularization: any method* to prevent overfitting– simplicity, sparsity, dropout, early stopping

– *) other than adding more data

By Chabacano [GFDL or CC BY-SA 4.0], from Wikimedia Commons

Deep learning

Anatomy of a deep neural network

• Layers• Input data and targets• Loss function• Optimizer

Layers

• Data processing modules• Many different kinds exist

– densely connected– convolutional– recurrent– pooling, flattening, merging, normalization, etc.

• Input: one or more tensorsoutput: one or more tensors

• Usually have a state, encoded as weights– learned, initially random

• When combined, form a network ora model

Input data and targets

• The network maps the input data X to predictions Y′

• During training, the predictions Y′ are compared to true targets Y using the loss function

cat

dog

Loss function

• The quantity to be minimized (optimized) during training

– the only thing the network cares about

– there might also be other metrics you care about

• Common tasks have “standard” loss functions:– mean squared error for regression– binary cross-entropy for two-class classification– categorical cross-entropy for multi-class classification

– etc.

• https://lossfunctions.tumblr.com/

Optimizer

• How to update the weights based on the loss function

• Learning rate

• Stochastic gradient descent, momentum, and their variants– RMSProp is usually a good

first choice– more info:

http://ruder.io/optimizing-gradient-descent/

Animation from: https://imgur.com/s25RsOr

Anatomy of a deep neural network

Deep learning frameworks

• Actually tools for defining static or dynamic general-purpose computational graphs

• Automatic differentation

• Seamless CPU / GPU usage– multi-GPU, distributed

• Python/numpy or R interfaces– instead of C, C++, or CUDA

• Open source

✕

x y 5

✕

+

+

Deep learning frameworks Deep learning frameworks

• Keras is a high-levelneural networks API– we will use TensorFlow

as the compute backend– https://keras.io/

• PyTorch is:– a GPU-based tensor library– an efficient library for dynamic neural networks– https://pytorch.org/

Keras

TensorFlowTheano CNTK PyTorch MXNet Caffe

CUDA, cuDNN MKL, MKL-DNN

GPUs CPUs

TF Estimator torch.nn GluonLasagne


Lecture 2: Multi-layer perceptron networks

Neuron as a linear classifier

By User:ZackWeinberg, based on PNG version by User:Cyc [CC BY-SA 3.0], via Wikimedia Commons

A non-linear classifier? Activation function

• A smooth (differentiable) nonlinear function that is applied after the inner product with the weights

• Common functions:

Neural network

• (Artificial) neural network is a collection of neurons

• Usually organized in layers– input layer– one or more hidden layers

(sizes, activation functions are hyperparameters)

– output layer(size typically equals number of classes in classification; activation function should be compatible with training labels)

By Glosser.ca [CC BY-SA 3.0], via Wikimedia Commons

Backpropagation

• Based on the chain rule of calculus:

cat

dog

• Neural networks are trained with gradient descent,starting from a random weight initialization

• Algorithm for computing the gradients for a neural network:

Multilayer perceptron (MLP) / Dense network

• Classic feedforward neural network

• Densely connected: all inputsfrom the previous layer connected

• In Keras: keras.layers.Dense(units, activation=None)

or:keras.layers.Dense(units)keras.layers.Activation(activation)

cat

dog

fish

Dropout

• randomly setting a fraction rate of input units to 0 at each update during training

• helps to prevent overfitting (regularization)

• In Keras: keras.layers.Dropout(rate)

Image from Srivastava et al (2014), JMLR 15: 1929-1958


Lecture 3: Images and convolutional neural networks

Computer vision

Computer vision = giving computers the ability to understand visual information

Examples:○ A robot that can move around obstacles by analysing the input of

its camera(s)○ A computer system finding images of cats among millions of

images on the Internet

From picture to pixels

Each a set of numbers quantifying the color of that element

0.49411765 0.49411765 0.4745098 0.49019608 0.4745098

0.49411765 0.49411765 0.5058824 0.49411765 0.49803922

0.49803922 0.49411765 0.4862745 0.47058824 0.49411765

0.5019608 0.49803922 0.49803922 0.49019608 0.50980395

0.50980395 0.5058824 0.52156866 0.50980395 0.5058824

Picture source: https://pixabay.com/en/kitty-cat-kid-cat-domestic-cat-2948404/

An image has to be digitized for computer processing

It is turned into millions of “pixel” elements

From pixels to … understanding?

0.49411765 0.49411765 0.4745098 0.49019608 0.4745098

0.49411765 0.49411765 0.5058824 0.49411765 0.49803922

0.49803922 0.49411765 0.4862745 0.47058824 0.49411765

0.5019608 0.49803922 0.49803922 0.49019608 0.50980395

0.50980395 0.5058824 0.52156866 0.50980395 0.5058824

There’s a cat among some flowers in the grass

● This is easy for humans

● But for AI it’s actually one of the harder problems!

● How do you transform that grid of numbers into understanding… or even something useful?

Image understanding• Humans are so good in vision that it’s not even considered intelligence

Convolutional neural networks

Deep learning for Computer vision

● Dense or fully-connected: each neuron connected to all neurons in previous layer

● CNN: only connected to a small “local” set of neurons

● Radically reduces numberof network connections

Dense layer Convolutional layer

Convolutional neural network (CNN, ConvNet) Convolution for image data

● Image represented as 2D grid of values

● Each output neuron connected to small

2D area in the image

● Output value = weighted sum of inputs

● Idea: nearby pixels are related ⇒

we can learn local relationships of pixels

Image source: https://mlnotebook.github.io/post/CNN1/

3✕3 image area3✕3 weights (conv. kernel)

output neuron

Convolution for image data

● We repeat for each output neuron

● Weights stay the same (shared

weights)

● Border effect: without padding output

area is smaller

● Outputs form a “feature map”

feature map

image input 3✕3 weights (conv. kernel)

Image source: https://mlnotebook.github.io/post/CNN1/

A real example

Image from: http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx

Side note: color images

● Example: 256 ✕ 256 color image with 3 color channels (red, green, and blue)⇒ single image is a 3D tensor: 256 ✕ 256 ✕ 3

● Example: 5 ✕ 5 convolution is actually also a 3D tensor: 5 ✕ 5 ✕ 3 ● Slides over width and height, but covers the full color depth


● We can repeat for different sets of weights (kernels)

● Each learns a different “feature”

● Typically: edges, corners, etc

● Each outputs a feature map

...

...

image 256✕256✕3

K kernels each 5✕5(✕3)

K feature maps each 252✕252✕1


● We stack the feature maps into a single tensor

● Depth out output tensor = number of kernels K

● Tensor is the output of the entire convolutional layer

...

image 256✕256✕3

K kernels each 5✕5(✕3)

output tensor 252✕252✕K

“cat”

Convolution in layers: intuition

● We can then add another convolutional layer

● This operates on the previous layer’s output tensor (feature maps)

● Features layered from simple to more complex

catlearned

high-level features

learned mid-level features

learned low-level features

learned classifier

Image from lecture by Yann Le Cun, original from Zeiler & Fergus (2013)

Image datasets• Color image mini-batches are 4D tensors:

width ✕ height ✕ color channels ✕ samples

• Plenty of big datasets for training exist, e.g., ImageNet with 1,2 million images in 1000 classes

• Data augmentation for small datasets: generate more training data by transforming existing data

• E.g., shifting, rotation, cropping,Scaling, adding noise, etc …

Convolutional layers

• Input: tensor of size N × Wi × H

i × C

i

• Hyperparameters:– K: number of filters– w, h: kernel size– padding: how to handle image borders– activation function

• Output: tensor of size N × Wo × H

o × K

• In Keras: keras.layers.Conv2D(filters, kernel_size, padding, activation)

(there is also Conv1D and Conv3D)

Pooling layers

• Used to reduce the spatial resolution– independently on each channel– reduce complexity and number

of parameters

• MAX operator most common– sometimes also AVERAGE

• In Keras: keras.layers.MaxPooling2D(pool_size)keras.layers.AveragePooling2D(pool_size)

Image from http://cs231n.github.io/convolutional-networks/

Other layers

• Flatten

– flattens the input into a vector (typically before dense layers)

• Dropout

– similar as with dense layers

• In Keras:keras.layers.Flatten()keras.layers.Dropout(rate)

1. Input layer = image pixels2. Convolution3. ReLU4. Pooling5. One or more fully connected layers (+ReLU)6. Final fully connected layer to get to the number of classes we

want7. Softmax to get probability distribution over classes

Typical architecture

Repeat one or more times

AlexNet

VGG

Inception / GoogLeNet

ResNet

DenseNet

Large-scale CNNs with pre-trained weights

• For many applications, an existing CNN can be re-used instead of training a new model from scratch:extract features from suitable layer or fine-tune the top layers with new data

• Keras contains several models trained with ImageNet:– Xception, VGG16, VGG19, ResNet50, InceptionV3, InceptionResNetV2,

MobileNet, DenseNet, NASNet

extracted features

re-initialize and train

Some selected applications

• Object detection: https://pjreddie.com/darknet/yolo/

• Semantic segmentation: https://www.youtube.com/watch?v=qWl9idsCuLQ

• Self-driving cars: https://www.youtube.com/watch?v=mCj_C1NOVxw

• Human pose estimation: https://www.youtube.com/watch?v=pW6nZXeWlGM

• Video recognition: https://valossa.com/

• Digital pathology: https://www.aiforia.com/


Lecture 4: Text, embeddings, 1D CNN,recurrent neural networks, attention

Representations for text

Sequence data

By Mogrifier5 [CC BY-SA 3.0], from Wikimedia Commons

By Der Lange 11/6/2005, http://commons.wikimedia.org/w/index.php?title=File:Spike-waves.png&action=edit&section=2

Text data

• sequence of words (or characters)• main representations:

– one-hot encoding– word embedding

raw text

cleaned text

tokens

one-hot encoding

preprocessing tokenization vectorizationword

embedding

One-hot encoding and bag-of-words

• dimensionality equals the number of distinct tokens in dictionary– 1000’s or 10000’s

• tokens are independent of each other• bag-of-words loses the ordering of tokens

– lots of important applications: IR etc.– n-grams

cleaned text

tokens

one-hot encoding

dictionary

The cat is in the moon.

[“the”, “cat”, “is”, “in”, “the”, “moon”]

{“a”: 1, “aardvark”: 2, “aardwolf”: 3,

…}

bag of words

Word embeddings

• dense vector representations

– dimensionality typically much lower than in one-hot⇒ bag-of-words not needed

– learned based on context of words

• semantics

– similar words have similar vectors

– directions in the vector space map to semantic relationships

• context-free and contextual embeddings

• either learn from data or use a pre-trained embedding

(Context-free) word embeddings

Images from: https://www.tensorflow.org/tutorials/word2vec Image from: http://wiki.fast.ai/index.php/Lesson_5_Notes

Standalone word embedding algorithms

• unsupervised learning, no annotation needed

• popular context-free algorithms include:– word2vec (CBoW and skip-gram)– GLoVe– fastText

• recently proposed contextual algorithms include:– ELMo– BERT

• to learn a task-based embedding or to use a pre-trained one?– pre-trained embeddings encode general semantic relationships– need to handle OOV (out-of-vocabulary) words– task-based embeddings may sometimes be better if enough data

Word sequence embedding

• usually a fixed-size matrix or sequence

• in Keras:keras.layers.Embedding(input_dim, output_dim, input_length, trainable, weights)

cleaned text

tokenspadding / truncate

learned embedding

The cat is in the moon.

[“the”, “cat”, “is”, “in”, “the”, “moon”]

sequence embedding

[“the”, “cat”, “is”, “in”, “the”, “moon”, ∅, ∅, ∅, ∅]

10 × N matrixor

sequence of length 10

Deep learning for sequences

Deep learning for sequences

• first layer is usually an embedding

• then there are three main approaches(that can also be combined):

– 1D convolutional layers

– recurrent layers

– attention

• last layers are often dense

(1) CNNs for sequences

• a fixed-length embedded sequence is a matrix

– can be considered as an image ⇒ CNNs can be applied

• 1D convolution

– as we want to process the full embedding each time

• simple and cheap approach for simple tasks

• in Keras:keras.layers.Conv1D(filters, kernel_size, padding, activation)keras.layers.MaxPooling1D(pool_size)keras.layers.GlobalMaxPooling1D()

1D convolution

tokens

embedding

(2) Recurrent neural networks

• MLPs and CNNs expect fixed-sized input, not sequences• RNNs have memory and recurrent connections, i.e. loops• last output contains a representation of the whole sequence• learning by backpropagation through time

– vanishing or exploding gradients!

By François Deloche [CC BY-SA 4.0], from Wikimedia Commons

Recurrent neural networks

Image from: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Long-short term memory (LSTM) network• specialized architecture to solve the vanishing gradient problem

– additional “conveyor belt” dataflow to carry information across timesteps– “forget”, “input”, and “output” gates

By François Deloche [CC BY-SA 4.0], from Wikimedia Commons

• simple RNNs do not usually work in practice⇒ use LSTM or its variants (e.g. GRU)

• can also be used bidirectionally

• cuDNN kernels may be >20 times faster on GPUs

• in Keras:keras.layers.LSTM(units, return_sequences)keras.layers.CuDNNLSTM(units, return_sequences)keras.layers.GRU(units, return_sequences)keras.layers.CuDNNGRU(units, return_sequences)keras.layers.Bidirectional(layer, merge_mode)

LSTM layer Language models and text generation

• RNNs can be trained to predict the next word and then used to generate novel text (or music, etc.)

Image from: https://github.com/oxford-cs-deepnlp-2017/lectures/blob/master/Lecture%204%20-%20Language%20Modelling%20and%20RNNs%20Part%202.pdf

Encoder–decoder (seq2seq) networks

Image from: https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-2/

“You can’t cram the meaning of a whole %&!$ing sentence into a single $&!*ing vector!”

-- Ray Mooney

(3) Attention

Problem: the final encoder

output vector is a bottleneck

Solution: attention

• allows the model to focus on

the relevant part of the input

sequence

• all encoder output vectors are

passed to the decoder and are

weighted using a learned

alignment

Image from: https://arxiv.org/abs/1409.0473

Attention is all you need

• Self-attention: relating different positions of

a sequence in forming its representation

TransformerImage from: https://arxiv.org/abs/1706.03762

BERTImage from: https://arxiv.org/abs/1810.04805

Some applications

• text classification and annotation• author identification• chatbots• reading comprehension / QA• image & video captioning• speech recognition• handwritten text recognition


Lecture 5: Introduction to PyTorch

Software frameworks for deep learning

2

Deep learning frameworks: arXiv mentions

3

Software frameworks for deep learning

• TensorFlow most popular, but not so easy to use and debug

– Keras is an easy-to-use neural network “front end” for TensorFlow

• PyTorch is a Python version of Torch (Lua-based)

– Getting a lot of traction recently, especially in research

4

Keras versus PyTorch

We’ll discuss two main differences between Keras and PyTorch:

• Static versus dynamic computational graphs

• Sequential versus functional style

… although these days both frameworks support all modes and styles

5

Computational graphs

• Any mathematical computation can be expressed as a computational graph

• Neural networks are just a (huge) number of simple computations

• With the graph it is easy to automatically calculate the gradients backwards for each node (backpropagation!)

• Both Keras and PyTorch work in this way

https://en.wikipedia.org/wiki/Automatic_differentiation6

✕

x y 5

✕

+

+

• In Keras (and TF) the graph is static

– First define a fixed graph with inputs being undefined or abstract (variables)

– Then “execute” the graph with specific inputs

• This can be cumbersome and hard to debug

• In theory fast as graph can be optimized during compilation

✕

x y 5

✕

+

+

Keras: static computational graph

7

• In PyTorch the graph is defined dynamically

– You define concrete tensors, e.g., x = torch.tensor(42.)

– Then just write the calculations, e.g.,z = x*y + 5*x + 5

– The computational graph is generated “on the fly” in the background

• Easy to debug, feels more like normal Python coding

✕

x y 5

✕

+

+

PyTorch: dynamic computational graph

8

Keras: sequential style

• Keras models typically defined in a sequential style

• Each layer is added in sequence to a list

Example: 2-layer MLP:

model = Sequential()

model.add(Dense(units=64, activation='relu', input_dim=100))

model.add(Dense(units=10, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.fit(x_train, y_train)

9

Keras: functional style

• Keras also supports a functional style

• Each step is written as a function of some input (or output from a previous step)

Example: 2-layer MLP:

inputs = Input(shape=(100,))

x = Dense(64, activation='relu')(inputs)

predictions = Dense(10, activation='softmax')(x)

model = Model(inputs=inputs, outputs=predictions)

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.fit(x_train, y_train)

No value, just an “open slot” for a future value

10

class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.fc1 = nn.Linear(64) self.fc2 = nn.Linear(10)

def forward(self, inputs): x = nn.ReLU(self.fc1(inputs)) predictions = self.fc2(x) return predictions

net = Net()optimizer = optim.RMSprop(net.parameters())criterion = nn.CrossEntropyLoss()

for i in range(num_epochs): for (x_train, y_train) in enumerate(batch_loader): optimizer.zero_grad() outputs = net(x_train) loss = criterion(outputs, y_train) loss.backward() optimizer.step()

PyTorch: functional with subclassing

Network defined as a Python class

We have to handle training loop manually

Backpropagation, andweight updates 11

Sequential versus functional

These days Keras and PyTorch support all styles!

… but some are more supported than others

Sequential Functional Func. with classes

Keras Yes, canonical Supported Supported, but limited

PyTorch Supported In theory, but does not make sense...

Yes, canonical

12

Common neural network modules in PyTorch

torch.nn.Linear(in_features, out_features, bias=True)

torch.nn.Dropout(p=0.5, ...)

torch.nn.Conv2d(in_channels, out_channels, kernel_size, ...)

torch.nn.Embedding(num_embeddings, embedding_dim, ...)

torch.nn.LSTM(input_size, hidden_size, num_layers=1, dropout=0,

bidirectional=False, ...)

torch.nn.GRU(input_size, hidden_size, num_layers=1, dropout=0,

bidirectional=False, ...)

13

Keras or PyTorch?

We’ll provide examples on how to do things with PyTorch, it’s up to you if you wish to learn PyTorch or stick with Keras

• PyTorch allows more control and customization, easier experimentation with new architectures

• Keras is easier if you just want to apply deep learning, and not do research

Useful PyTorch links:https://pytorch.org/tutorials/

https://pytorch.org/docs/stable/index.html

14


Lecture 6: GPUs, batch jobs, using Taito-GPU

GPU computing

• CPUs are optimized for latency whereas GPUs are optimized for throughput

• CSC’s GPU nodes with P100’s:

#cores max clock speed

memory

2 x Xeon CPUs 2 x 14 3.30 GHz 512 GB

4 x P100 GPUs 4 x 3584 1.48 GHz 4 x 16 GB

Research administration

ICT platforms, Funet network and data center functions are the base for our solutions

Computing and software

Data management and analytics for research

Support and training for research

Solutions for managing and organizing education

Solutions for learners and teachers

Solutions for educational and teaching cooperation

Hosting services tailored to customers’ needs

Identity and authorisation

Management and use of data

CSC’s solutions

CSC’s computing resources

• Supercomputer (Sisu)

• Supercluster (Taito)

• Cloud services (cPouta, ePouta)

• Accelerated computing (GPUs, Pouta and Taito-GPU)

• Grid (FGCI)

• International resources

– Extremely large computing (PRACE)

– Nordic resources (NEIC)

Taito GPU

The P100 nodes consists of 20 Dell PowerEdge C4130 servers with:

• 2x Xeon E5-2680 v4 CPUs with 14 cores each running at 2.4GHz• 512 GB of DDR4 memory• 4x P100 GPUs connected in pairs to each CPU• 2x800GB of Sata SSD scratch space

The K80 nodes consists of 12 Dell PowerEdge C4130 servers with:

• 2x Xeon E5-2680 v3 CPUs with 12 cores each running at 2.5GHz• 256 GB of DDR4 memory• 2x K80 GPU cards each with 2 GPUs for a total of 4 GPUs per node, these are all

connected to the first CPU• 850GB of HDD scratch space

DL2021 – new data management and computing infrastructure

Phase 1 computing cluster (700+ nodes) - summer 2019:

● New Intel Cascade lake CPU architecture supporting VNNI instructions for AI inference workloads

● Includes 80 “AI specific nodes” with 320 GPU’s- 4 NVIDIA V100 (32 GB) GPUs / node, NVLink- 3.2 TB local NVMe disk- Extremely fast network (InfiniBand 200 Gbps)

Taito compute nodes are used via a queuing system

Do not use the login node for heavy computation!

Batch jobs

Steps for running a batch job:

1. Write a batch job script

2. Make sure you have all the input files where the program can find them

3. Submit your job (sbatch batch_job_file.sh)

4. Wait (or check progress: tail slurm-jobid.out)

5. Look at the results, e.g., standard output in slurm-jobid.out

You have to specify the necessary resources:

– resources need to be sufficient for the job– requested resources consume BUs and affect time spent in queue⇒ realistic resource requests give best results

Example batch job script on Taito-GPU

#!/bin/bash

#SBATCH -p gpu --gres=gpu:p100:1

#SBATCH -t 1:00:00 --mem=8G -c 4

srun python my_python_program.py

Relevant sbatch options

-J, --job-name name of job

-c, --cpus-per-task number of processors per task

-p partition specify partition (gpu, gputest, gpulong)

--gres=gpu:type:number request number of GPUs of type (k80, p100)

-t, --time time limit in DD-HH:MM:SS

--mem the real (host) memory required per node

-o, --output file for script’s standard output

-e, --error file for script’s standard error

Managing batch jobs

sbatch batch_job_file.sh submit a job

sbatch --options batch_job_file.sh

scancel jobid delete a job

squeue -l show all jobs in all queues (partitions)

squeue -l -p partition show all jobs in partition

squeue -l -u username show all jobs for a single user

squeue -l -j jobid show status of a single job

sinfo check all available queues

seff jobid show CPU, mem and GPU utilization

Directory or storage area

Intended use Default quota/user

Storage time Backup

$HOME * Initialization scripts, source codes, small data files.Not for running programs or research data.

50 GB Data will be deleted 90 days after closing the account

Yes

$USERAPPL Users' own application software. 50 GB Data will be deleted 90days after closing the account

Yes

$WRKDIR * Temporary data storage. 5 TB 90 days No

$TMPDIR Temporary users' files. 2 days No

project Common storage for project members. A project can consist of one or more user accounts.

On request. Data will be deleted 90 days after closing the project

No

HPC Archive * Long term storage. 2,5 TB Permanent Two copies maintained

IDA Long term storage. On request Permanent Part of the Open Science and Research services.

Pouta Object Storage

Storage and sharing 1 TB Permanent

See https://research.csc.fi/csc-guide-directories-and-data-storage-at-csc for more information.

Module system

• Different software packages have different, possibly conflicting, requirements

• Most commonly used module commands:– module help Show available options– module load modulename Load the given environment module

module load modulename/version– module list List the loaded modules– module avail List modules that are available to load– module spider List all existing modules– module spider name Search the list of existing modules– module swap module1 module2 Replaces a module, including compatible

versions of other loaded modules– module unload modulename Unload the given environment module– module purge Unload all modules

Mlpython: collections of GPU-optimized ML frameworks

• E.g., python-env/2.7.10-ml or python-env/3.6.3-ml

• GPU-optimized versions of ML frameworks, including:- TensorFlow- Keras- PyTorch

• Usage, e.g.:

module purgemodule load python-env/3.6.3-ml

• See https://research.csc.fi/-/mlpython for more information

TensorBoard

• Tool to visualize TF graphs, plot quantitative metrics, and show additional data like images

• Operates by reading TF event files, which contain summary data that can be generated while running TF (or Keras, PyTorch, etc.)

• Instructions in the exercises if you want to try

TensorBoard at Taito-GPU


Lecture 7: Cloud environment, GPU utilization, using multiple GPUs

• Cloud environment allows flexible data analytics– Pouta (Openstack) allows to run and manage own VMs

▪ GPU nodes and IO intensive nodes

▪ ePouta for sensitive data

– Rahti (Openshift) allows to run and manage own containers

(under development, in limited beta)

• Pouta Object Storage for shared data storage

• Good for: web applications, big data frameworks, installing

custom software, building computing infrastructure

Data analytics in the cloud

• a GPU cannot be shared among users

– running multiple parallel processes possible (in theory)

but cumbersome

⇒ GPU jobs should be optimized to utilize the GPU

as efficiently as possible

• standard solution: increase mini-batch size

• monitor your GPU usage:

seff jobid

ssh gxxx nvidia-smi [dmon]

GPU utilization

• Model and data parallelism

• Single-node multi-GPU and

distributed training

• All the main frameworks

offer some level of support– TensorFlow and PyTorch

good choices for distributed

– high-level APIs such as Keras

may not be optimal

– external tools: Horovod, Gloo

Using multiple GPUs for model training Model parallelism Data parallelism

Model parallelism in Alexnet

Image from https://arxiv.org/pdf/1609.08144.pdf

Model parallelism in Google’s NMT

• In data parallelism, we need to gather all gradients and

to send the mean of the gradients back to all GPUs

MPI allreduce

Images from: https://cwiki.apache.org/confluence/display/MXNET/Single+machine+All+Reduce+Topology-aware+Communication

• Horovod is a Python framework for distributed deep

learning– supports TensorFlow, Keras, and PyTorch

• uses Nvidia’s NCCL 2 which provides a highly optimized

version of ring-allreduce

• uses MPI which launches all tasks and transparently sets

up the distributed infrastructure for communication

between tasks– readily compatible with Slurm!

Horovod and ring-allreduce

Horovod and ring-allreduce

Image from: https://eng.uber.com/horovod/

GPU topology

CSC’s P100 nodes:

PCIe switch

GPU0 GPU1

CPU

PCIe switch

GPU2 GPU3

CPU

NVIDIA DGX-1:

GPU topology

NVIDIA DGX-2:

1. request multiple GPUs (and CPUs) with sbatch:

2. modify your code to utilize multiple GPUs– if you use some existing code, there might already be

an option for this

– a single process may not be able to feed GPUs fast

enough => use multiple CPU cores for data processing

– IO easily becomes the bottleneck (especially with

spinning disks, network filesystem, lots of small files)

Using multiple GPUs

--gres=gpu:type:number request number of GPUs of type (k80, p100)

-c, --cpus-per-task number of processors per task

• In Keras: multiprocessing, workers

Using multiple CPUs for ETL

hist = fit_generator(generator, ..., workers=N,

use_multiprocessing=True/False)

EXTRACT TRANSFORM LOAD

• In PyTorch: workers (multiple processes)

train_loader = torch.utils.data.DataLoader(...,

num_workers=N)

• Keras/TF supports single node multi-GPU data parallelism with keras.utils.multi_gpu_model(model, gpus):

Using multiple GPUs in Keras

with tf.device('/cpu:0'):

_model = Sequential(...)

_model.add(...)

_model.add(...)

model = multi_gpu_model(_model, gpus=2)

model.compile(...)

• Notes:– batch_size is split among GPUs (each gets batch_size/gpus of data)– to save a multi-gpu model, use .save() with the template model

• PyTorch supports single node multi-GPU data parallelism by

wrapping your model with torch.nn.DataParallel()

Using multiple GPUs in PyTorch

model = MyModel(...)

if torch.cuda.device_count() > 1:

model = nn.DataParallel(model)

model.to(device)

• Notes:– batch_size is split among GPUs (each gets batch_size/gpus of data)

practical deep learning · jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 break 14:45-15:30...

Documents