hot topics in machine learning (or how to win a kaggle competition)

Hot Topics in Machine Learning (or how to win aKaggle competition)

Benedikt Wilbertz1

1Trendiction S.A., Luxembourg

June 17, 2016

Benedikt Wilbertz (Trendiction) Hot Topics in Machine Learning June 17, 2016 1 / 41

Introduction

Setting for Supervised Learning

Y: prediction space (labels for classification, RK for Regression)

Y : Y-valued random variable to predict

X : typicallly X = Rd, space of predictors (aka features)

X: X -valued random variable modeling the distribution of the predictors

N : N = (Y × X )N , space containing all training sample of size N

N: random variable representing all training samples of size N .Independent of X and Y .

All random variables X,Y,N are defined on a joint probabiliy space (Ω,S,P).Let fN be a model trained on a realization of the random variable N (thismeans a random sample from Y × X of size N)The optimal model in the least square sense is then given by

Supervised Learning Problem

E (Y − fN(X))2 → min

fN∈models(N)


Deep Learning

Neural Networks and Deep Learning

Convolutional Networks

Very popular in the late 80s and 90s:

– 7 layers – 60k parameters

Training

Stochastic Gradient Descent


Deep Learning


Renaissance of neural networks in 2012

Krizhevsky et al ’12: ImageNet Classification with Deep ConvolutionalNeuralNetworks

trained on 1.2 million labeled images

highly optimized GPU code

achieved absolute new state-of-the-art results for classification on 1000 objects.


Deep Learning


GoogLeNet (Szegedy et al. ’14)

–27 layers deep – 1.5 GFLOP/forward pass – 7M parameters


Deep Learning


Pushing deep learning to the limits

He et al ’16: Residual networks with 1k layers and 10M parameters

But what makes the difference to the 90s (apart from sample/parameter size)?

Data augmentation and bootstrapping

Drop out layers

ReLU activation

fast GPUs

Improvements on the training

Regularization

Nesterov/AdaGrad/AdaDelta/Adam variants for SGD


Deep Learning


Software packages

Hardware

NVIDIA GTX 1080: 8.8 TFLOPS for 800 EUR

Cuda 8.0 and Pascal Architecture: Half-precision arithmetic (FP16) willdouble the computing power

Google’s TPUs (Tensor Processing Units): custom ASIC of certainoperation in Tensorflow


Deep Learning


Applications

- Processing of 40M images/day (600img/s) from social media for logo/brandrecognition (only 0.5% contain a logo)


Deep Learning



Gradient Boosting

Trees and Boosting

History

Random Forest (Breiman ’97)

Gradient Tree Boosting (Friedman ’99)

Gradient Tree Boosting + Regularization (XGBoost)

Basic idea of tree ensembles

Model:

y =

K∑k=1

fk(x), fk ∈ F

Tree: fk(x) = wq(x), w ∈ RT , q : Rd → 1, 2, . . . , T


Gradient Boosting

Trees and Boosting

Optimizing the tree structure

Objective:

min←n∑

i=1

l(yi, yi) +

K∑k=1

Ω(fk)

with regularization Ω(fk) = γT + 12λ∑T

j=1 w2j and general loss function l.

Problem: Tree construction is a batch process, so we cannot apply some onlinemethod like SGD


Gradient Boosting

Trees and Boosting

Additive Training (Boosting)

Start from constant prediction, add a new function each time

y(0)i = 0

y(1)i = f1(xi) = y(0) + f1(xi)

y(2)i = f1(xi) + f2(xi) = y(1) + f2(xi)

. . .

y(t)i =

t∑k=1

fk(xi) = y(t−1) + ft(xi)


Gradient Boosting

Trees and Boosting

The prediction at round t is y(t)i = y

(t−1)i + ft(xi)

Using 2nd order taylor expansion on l, this can be applied on any smooth lossfunction like Euclidean loss (Regression), Softmax loss (Classification), NDCG(Ranking problems), etc.

Using gradient gi and hessian hi of l, an optimal tree is grown (stopping whenregularized gini gain becomes negative), which optimizes in interation t

min ←n∑

i=1

l(yi, y(t−1)i + ft(xi)) + Ω(ft(xi))

≈n∑

i=1

l(yi, y(t−1)i ) + gift(xi) +

1

2hif

2t (xi) + Ω(ft) + const

(Explicit solution for leaf weights w. Splits are constructed in a greedy way)


Gradient Boosting

Trees and Boosting

Software packages


Ensembles

Model ensembles

Stacking


Kaggle

Kaggle


Kaggle The Competition

The Task



Task / Rules

Timeframe: Oct 2015 – Feb 2016

Multiple-Choice Question with 4 answers (NIR 25%)

Trainingset: 2500 questions

Validationset: 8192 questions

Public Leaderboard based on 12.5%

Mandatory model submission one week before end

Final testset (12000 new questions) released one week before end

Private Leaderboard only based on this new questions

external data explicitely allowed

800 teams participated in stage I

170 in stage II



Challenges

What do you need to solve this problems?

External Data (Wikipedia, CK12, etc.)

NLP knowledge

Search infrastructure (Elasticsearch/Lucene)

Feature engineering and machine learning


Kaggle IBM Watson

Invitation for IBM Watson


Kaggle IBM Watson

Decline to Participate


Competition Questions

Examples

Question: A scientist claims he has found a cure for a skin disease. Afterpublishing the results, the experiment was found to be biased. Why didpublishing the results allow bias to be recognized within the experiment?

a) It allowed others to replicate the experiment.

b) It helped the scientist gain notoriety within his field.

c) It allowed the cure to be manufactured by the best company.

d) It helped other researchers find out more about the skin disease.

Our Answer: a) (509.6, 461.9, 427.0, 495.6)Correct Answer: a)



Examples






Our Answer: a) (509.6, 461.9, 427.0, 495.6)

Correct Answer: a)



Examples






Our Answer: a) (509.6, 461.9, 427.0, 495.6)Correct Answer: a)



Examples

Question: What is the primary function of skin cells?

a) to deliver messages to the brain

b) to generate movement of muscles

c) to provide a physical barrier to the body

d) to produce carbohydrates for energy

Our Answer: c) (253.0, 261.9, 302.3, 277.0)Correct Answer: c)



Examples






Our Answer: c) (253.0, 261.9, 302.3, 277.0)

Correct Answer: c)



Examples






Our Answer: c) (253.0, 261.9, 302.3, 277.0)Correct Answer: c)



Examples

Question: Which of the following would be most useful for calculating thedensity of a rock sample?

a) microscope and balance

b) graduated cylinder and balance

c) microscope and graduated cylinder

d) beaker and graduated cylinder

Our Answer: b) (267.5, 276.4, 271.3, 275.8)Correct Answer: b)



Examples






Our Answer: b) (267.5, 276.4, 271.3, 275.8)

Correct Answer: b)



Examples






Our Answer: b) (267.5, 276.4, 271.3, 275.8)Correct Answer: b)


Competition Our Approach

Information Retrieval IR

Idea: For each answer a)-d), create pairs of question + answer and scorethese 4 pairs in a search engine. The pair with the highest score wins.

Example

Put (This is a question) AND (this is an answer) into Google andrank by the number of hits.



TF/IDF

Term Frequency-Inverse Document Frequency(tf/idf), is a numerical statisticthat is intended to reflect how important a word is to a document in acollection or corpus.

Definition

TFIDF(t, d,D) := ft,d · IDF(t,D),

where ft,d is the frequency of term t in document d and

IDF(t,D) = logN

|d ∈ D : t ∈ d|

with N being the toal number of documents in the corpus.



BM25

Okapi BM25 (BM stands for Best Matching) is a ranking function used bysearch engines to rank matching documents according to their relevance to agiven search query. It is based on the probabilistic retrieval frameworkdeveloped in the 1970s and 1980s by Stephen E. Robertson, Karen SparckJones, and others.

Definition

BM25(t, d,D) := IDF(t,D) · ft,d · (k1 + 1)

ft,d + k1 ·(

1− b+ b · |D|avgdl

) ,where

IDF(t,D) = logN − n(t) + 0.5

n(t) + 0.5,

and k1 ∈ [1.2, 2.0] and b = 0.75.



Word embeddings

Word Embeddings (Word2Vec, GloVe) are shallow, two-layer neural networks,that are trained to reconstruct linguistic contexts of words: the network isshown a word, and must guess at which words occurred in adjacent positionsin an input text.They build up a mapping femb :W → Rd, where d typically has size 100 or300.One important feature of this mapping is, that they map semantically closewords into similar locations of the d-dimensional vector space.This even allows doing some kind of arithmetic on words, i.e.

femb(Berlin)− femb(Germany) + femb(Italy) = femb(Rom)

Problem

How to score question + answer? Sum/Average/Weighted by IDF?



PMI

Pointwise Mutual Information (PMI), is a measure of association used ininformation theory and statistics.

Definition

pmi(x; y) := logp(x, y)

p(x)p(y)= log

p(x|y)

p(x)= log

p(y|x)

p(y).

Choosing p(x, y) as the probability for the co-occurences of words x and y, wecan use this measure to compare each single word in the question to all theanswers words.The average (or median) of all these scores is then taken as the overall score ofa question-answer pair.



Feature Hashing

Embedding

Use a hashing algorithm of fixed length (say 4096) in order to encode word /sentences as fixed length vectors.

Learning

(Motivated by T. Mikolov’s negative sampling in word2vec)

Using Quizlet’s flashcards we generated an extended dataset

N positive term-definition pairs

3N negative term-definition pairs (i.e. term paired with a randomdefinition)

Train binary classifier using XGBoost with max.depth=10 and 1000srounds.



Final Learning



14 days to go. . .


Competition Last minute changes

Large Scale XGBoost

Pushing XGBoost to the limits. . .

50M quizlet cards

3 + 1 negative sampling yields 200M observations

Feature Hashing produces sparse matrix with 2147863398 entries

Result from XGBoost

long vectors not supported yet:../../src/include/Rinlinedfuns.h:137

Running with 150M samples was fine but needed fast machine to finish tilcompetition deadline. . .



Last Minute Learning



Public Leaderboard / Model submission deadline


Competition Competitors

Cardal


Competition Competitors

Cardal’s approach

Data sources

Wikipedia, CK12, Quizlet, StudyStack, Saylor, Openstax, UtahOER, miscsources from AI2/Aristo

Processing

Hand-written parsers for all the sources (regex!!)

Uses 4 different stemmers

28 sets of features

Lucene Search/Scoring plus homebrewed search/score

Learning

Gradient boosting

Ensemble of 6 models, each uses its own feature mix

Lots of handtuned parameters


Competition Results

Private Leaderboard


Competition Aftermath

Private Leaderboard


Competition Summary

Summary

THANK YOU!


hot topics in machine learning (or how to win a kaggle competition)

Documents