KAGGLE COMPETITIONS
DMITRY LARKO
My experienceMY EXPERIENCE
FOR MACHINE LEARNING AT BERKELEY
Hi, here is my short bio
About 10 years working experience in DB/Data Warehouse field.
4 years ago I learned about Kaggle from my dad, who competing on Kaggle as
well (and does that better than me)
My first competition was Amazon.com - Employee Access Challenge, I placed 10th
out of 1687, learned a tons of new techniques/algorithms in ML field in a month
and I got addicted to Kaggle.
Until now I participated in 25 completions, was 2nd twice and I am in Kaggle top-
100 Data Scientists.
Currently I’m working as Lead Data Scientist.
Motivation. Why to participate?
• To win – this is the best possible motivation for
competition;
• To learn – Kaggle is a good place to learn and the best
place to learn on Kaggle is forum for past competitions;
• Looks good in resume – well… only if you’re constantly
winning
How to start?
Learn Python1
MOOC for Machine learning (Coursera, Udacity, Harvard)2
Participate in Kaggle “Getting started” and “Playground” competitions3
Visit Kaggle finished competitions and go through winner’s solution posted at
competition’s forum4
Kaggle’s Toolset (1 of 2)
• Scikit-Learn (scikit-learn.org). Simple and efficient tools for data
mining and data analysis. A lot of tutorials, many ML algorithms have
scikit-learn implementation.
• XGBoost (github.com/dmlc/xgboost). An optimized general purpose
gradient boosting library. The library is parallelized (OpenMP). It
implements machine learning algorithm under gradient boosting
framework, including generalized linear model and gradient boosted
regression tree (GBDT).
• Theory behind XGBoost: https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
• Tutorial:
https://www.kaggle.com/tqchen/otto-group-product-classification-challenge/understanding-xgboost-model-on-otto-data
Kaggle’s Toolset (2 of 2)
• H2O (h2o.ai). Fast Scalable Machine Learning API. Has stat-of-the-art
models like Random Forest and Gradient Boosting Trees. Allows you to
work with really big datasets on Hadoop cluster. It also works on Spark!
Check out Sparkling Water: http://h2o.ai/product/sparkling-water/
• Neural Nets/Deep Learning. Most of Python libraries build on top of
Theano (http://deeplearning.net/software/theano/):
• Keras (https://github.com/fchollet/keras)
• Lasagne (https://github.com/Lasagne/Lasagne)
• Or TensorFlow:
• Skflow(https://github.com/tensorflow/skflow)
• Keras(https://github.com/fchollet/keras)
Kaggle’s Toolset - Advanced
• Vowpal Wabbit (GitHub) – fast and state-of-the-art online learner. A
great tutorial how to use VW for NLP is available here:
https://github.com/hal3/vwnlp
• LibFM (http://libfm.org/) - Factorization machines (FM) are a generic
approach that allows to mimic most factorization models by feature
engineering. Works great for sparse wide datasets, has a few
competitors:
• FastFM - http://ibayer.github.io/fastFM/index.html
• pyFM - https://github.com/coreylynch/pyFM
• Regularized Greedy Forest (RGF) - tree ensemble learning method, can
be better than XGBoost, but you need to know how to cook it.
• Four leaf clover - +100 to Luck, gives outstanding model performance
and state-of-the-art “golden feature” search technique… just kidding.
Which past competition to check?
• Greek Media Monitoring Multilabel Classification (WISE 2014). Why?
• Interesting problem, a lot of classes.
• Otto Group Product Classification Challenge. Why?
• Biggest Kaggle competition so far.
• A great dataset to learn and polish you ensembling techniques
• Caterpillar Tube Pricing. Why?
• A good example of business problem to solve
• Need some work with data before you can build good models
• Has a “data leakage” you can find and exploit
• Driver Telematics Analysis. Why?
• Interesting problem to solve.
• To get an experience working with GPS data
Well, all of them of course. But if I would need to choose I’d selected
these 4:
Common steps
• Know your data:
• Visualize;
• Train linear regression and check the weights;
• Build decision tree (I prefer to use R ctree for that);
• Cluster the data and look at what clusters you get out;
• Or train a simple classifier, see what mistakes it makes
• Know your metric.
• Different metrics can give you a clue how approach the problem, for
example for logloss you need to add calibration step for random forest
• Know your tools.
• Knowledge of advantage and disadvantages of ML algorithms you use
can tell you what to do to solve a problem
Common steps cont’d
• Getting more from data.
• Domain knowledge
• Feature extraction and feature engineering pipelines. Some ideas:
• Get most important features and build 2nd level interactions
• Add new features such as cluster ID or leaf ID
• Binarize numerical features to reduce noise
• Ensembling.
• A great Kaggle Ensembling Guide by MLWave
Teaming strategy
• Usually we agree to team up at pretty early phase but keep working
independently to avoid any bias in our solutions
• After teaming up:
• Code sharing can help you to learn new things but a waste of
time from competition perspective;
• Share data – especially some engineered features;
• Combine your models’ outcomes using k-fold stacking and build
a second level meta-learner (stacking)
• Continue iteratively add new features and building new models
• Teaming up right before competition end:
• “Black-box” ensembling – linear and not so linear blending using
LB as validation set.
• Linear: f(x) = alpha*F1(x) + (1-alpha)*F2(x)
• Non linear: f(x) = F1(x)^alpha * F2(x)^(1-alpha)
Q&A
Thank you!