Download - Winning Kaggle 101: Dmitry Larko's Experiences

KAGGLE COMPETITIONS

DMITRY LARKO

My experienceMY EXPERIENCE

FOR MACHINE LEARNING AT BERKELEY

Hi, here is my short bio

About 10 years working experience in DB/Data Warehouse field.

4 years ago I learned about Kaggle from my dad, who competing on Kaggle as

well (and does that better than me)

My first competition was Amazon.com - Employee Access Challenge, I placed 10th

out of 1687, learned a tons of new techniques/algorithms in ML field in a month

and I got addicted to Kaggle.

Until now I participated in 25 completions, was 2nd twice and I am in Kaggle top-

100 Data Scientists.

Currently I’m working as Lead Data Scientist.

https://www.kaggle.com/c/amazon-employee-access-challenge

Motivation. Why to participate?

• To win – this is the best possible motivation for

competition;

• To learn – Kaggle is a good place to learn and the best

place to learn on Kaggle is forum for past competitions;

• Looks good in resume – well… only if you’re constantly

winning

How to start?

Learn Python1

MOOC for Machine learning (Coursera, Udacity, Harvard)2

Participate in Kaggle “Getting started” and “Playground” competitions3

Visit Kaggle finished competitions and go through winner’s solution posted at

competition’s forum4

https://www.coursera.org/learn/machine-learning

https://www.udacity.com/course/machine-learning--ud262

http://cs109.github.io/2015/

Kaggle’s Toolset (1 of 2)

• Scikit-Learn (scikit-learn.org). Simple and efficient tools for data

mining and data analysis. A lot of tutorials, many ML algorithms have

scikit-learn implementation.

• XGBoost (github.com/dmlc/xgboost). An optimized general purpose

gradient boosting library. The library is parallelized (OpenMP). It

implements machine learning algorithm under gradient boosting

framework, including generalized linear model and gradient boosted

regression tree (GBDT).

• Theory behind XGBoost: https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

• Tutorial:

https://www.kaggle.com/tqchen/otto-group-product-classification-challenge/understanding-xgboost-model-on-otto-data

http://scikit-learn.org/stable/

github.com/dmlc/xgboost

https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

https://www.kaggle.com/tqchen/otto-group-product-classification-challenge/understanding-xgboost-model-on-otto-data

Kaggle’s Toolset (2 of 2)

• H2O (h2o.ai). Fast Scalable Machine Learning API. Has stat-of-the-art

models like Random Forest and Gradient Boosting Trees. Allows you to

work with really big datasets on Hadoop cluster. It also works on Spark!

Check out Sparkling Water: http://h2o.ai/product/sparkling-water/

• Neural Nets/Deep Learning. Most of Python libraries build on top of

Theano (http://deeplearning.net/software/theano/):

• Keras (https://github.com/fchollet/keras)

• Lasagne (https://github.com/Lasagne/Lasagne)

• Or TensorFlow:

• Skflow(https://github.com/tensorflow/skflow)

• Keras(https://github.com/fchollet/keras)

http://h2o.ai/

http://h2o.ai/product/sparkling-water/

http://deeplearning.net/software/theano/

https://github.com/fchollet/keras

https://github.com/Lasagne/Lasagne

https://github.com/tensorflow/skflow

https://github.com/fchollet/keras

Kaggle’s Toolset - Advanced

• Vowpal Wabbit (GitHub) – fast and state-of-the-art online learner. A

great tutorial how to use VW for NLP is available here:

https://github.com/hal3/vwnlp

• LibFM (http://libfm.org/) - Factorization machines (FM) are a generic

approach that allows to mimic most factorization models by feature

engineering. Works great for sparse wide datasets, has a few

competitors:

• FastFM - http://ibayer.github.io/fastFM/index.html

• pyFM - https://github.com/coreylynch/pyFM

• Regularized Greedy Forest (RGF) - tree ensemble learning method, can

be better than XGBoost, but you need to know how to cook it.

• Four leaf clover - +100 to Luck, gives outstanding model performance

and state-of-the-art “golden feature” search technique… just kidding.

https://github.com/JohnLangford/vowpal_wabbit/wiki

https://github.com/hal3/vwnlp

http://libfm.org/

http://ibayer.github.io/fastFM/index.html

https://github.com/coreylynch/pyFM

http://stat.rutgers.edu/home/tzhang/software/rgf/

Which past competition to check?

• Greek Media Monitoring Multilabel Classification (WISE 2014). Why?

• Interesting problem, a lot of classes.

• Otto Group Product Classification Challenge. Why?

• Biggest Kaggle competition so far.

• A great dataset to learn and polish you ensembling techniques

• Caterpillar Tube Pricing. Why?

• A good example of business problem to solve

• Need some work with data before you can build good models

• Has a “data leakage” you can find and exploit

• Driver Telematics Analysis. Why?

• Interesting problem to solve.

• To get an experience working with GPS data

Well, all of them of course. But if I would need to choose I’d selected

these 4:

https://www.kaggle.com/c/wise-2014

https://www.kaggle.com/c/otto-group-product-classification-challenge

https://www.kaggle.com/c/caterpillar-tube-pricing

https://www.kaggle.com/c/axa-driver-telematics-analysis

Common steps

• Know your data:

• Visualize;

• Train linear regression and check the weights;

• Build decision tree (I prefer to use R ctree for that);

• Cluster the data and look at what clusters you get out;

• Or train a simple classifier, see what mistakes it makes

• Know your metric.

• Different metrics can give you a clue how approach the problem, for

example for logloss you need to add calibration step for random forest

• Know your tools.

• Knowledge of advantage and disadvantages of ML algorithms you use

can tell you what to do to solve a problem

Common steps cont’d

• Getting more from data.

• Domain knowledge

• Feature extraction and feature engineering pipelines. Some ideas:

• Get most important features and build 2nd level interactions

• Add new features such as cluster ID or leaf ID

• Binarize numerical features to reduce noise

• Ensembling.

• A great Kaggle Ensembling Guide by MLWave

http://mlwave.com/kaggle-ensembling-guide/

Teaming strategy

• Usually we agree to team up at pretty early phase but keep working

independently to avoid any bias in our solutions

• After teaming up:

• Code sharing can help you to learn new things but a waste of

time from competition perspective;

• Share data – especially some engineered features;

• Combine your models’ outcomes using k-fold stacking and build

a second level meta-learner (stacking)

• Continue iteratively add new features and building new models

• Teaming up right before competition end:

• “Black-box” ensembling – linear and not so linear blending using

LB as validation set.

• Linear: f(x) = alpha*F1(x) + (1-alpha)*F2(x)

• Non linear: f(x) = F1(x)^alpha * F2(x)^(1-alpha)

Q&A

Thank you!

Download - Winning Kaggle 101: Dmitry Larko's Experiences

Top Related