towards increasing predictability of machine learning research
DESCRIPTION
Towards increasing predictability of machine-learning research. Report at CERN on 16 of September, 2013.TRANSCRIPT
System for displaying ads
on Yandex’s search result pages
and partner’s websites
Ad Targeting Group
Automation of Machine Learning Research
Research with profit
Introduction
R&D best practices
— Modularity
— Computational Measurability
— Transparency and Sharing
— Automation
R&D best practices
— Modularity: units, reuse, abstractions
— Computational Measurability: MDD
— Transparency and Sharing: collaboration
— Automation: …
Happy life principles
— Kindness
— Wholeheartedness
— Love
— Discipline
— Self-development
This is a list of global things,not local (everyday) rules
Automation –
is the use of machines, control systems
and information technologies to reduce
the need for human work to optimize productivity
in the production of goods and services.
Automation –
is the use of information technologies
to optimize productivity and to increase
predictability in the research, development
and other projects.
KPI stands for Key Performance Indicators:
— Money, Clicks on Ads
— Comparison with rivals (# of segments we are better)
— Number of Nobel Prices
— Users & Government Loyalty
— Logliklihood of prediction
Where does automation stop?
.. where real research starts
Where does automation stop?
IntuitionResearch
Creativity
ScienceTools
Complex Maths
Automated pipelines
PDEs
MetricsValidators
SVMPCA
1. Imagine how simple and agile research
work could be.
2. Believe it is possible, automate the most
and find the place for research.
Recipe
Task:Ad click probability prediction(binary classification problem)
KPI: Profit, Clicks, Conversions, Loglikelihood
Yandex LLC
Story of automation
Story of automation
Classifier(matrixnet)
filtersfiltersfiltersfilters
filtersfiltersfiltersreducers
filtersfiltersfiltersmetrics
GnuPlot
filterssimulators
MapReduceSTORAGE
clipart from http://www.stoneys.ch
Story of automation
Classifier(matrixnet)
filtersfiltersfiltersfilters
filtersfiltersfiltersreducers
filtersfiltersfiltersmetrics
GnuPlot
filterssimulators
MapReduceSTORAGE
clipart from http://clipartov.net
Story of automation
Classifier(TMVA, …)
filtersfiltersfiltersfilters
filtersfiltersfiltersreducers
filtersfiltersfiltersmetrics
GnuPlot
filterssimulators
MapReduceSTORAGE
ML Infrastructure
Report
Idea
Pipeline (no automation)
— Prepare raw data set for ML
— Apply filters (cuts) and mappers
— Calculate features
— Assign weights
— Split to train and test
— Train classifier at training set
— Look at learn curve and check for overfitting
— Apply resulted classifier model to testing set
— Calculate metrics and compare with current best
Story of automation
Pipeline (no automation)
— Prepare raw data set for ML
— Apply filters (cuts) and mappers (add new filter)
— Calculate features (add new feature)
— Assign weights (new idea for weighting)
— Split to train and test
— Train classifier at training set (new train options)
— Look at learn curve and check for overfitting
— Apply resulted classifier model to testing set
— Calculate metrics and compare with current best
Story of automation
— Create and commit YAML file
— Read the report
Story of automation
Engine: “matrixnet” # options: VW, TMVA (TODO!)Mappers: | [ Join(‘PLACE FOR NEW FEATURES’), Grep(‘r.Age > 10 and PLACE FOR GREP IDEA'), Mapper(‘r.Weight = PLACE FOR WEIGHT IDEA’), yabs.matrixnet.factor.DefaultFactors(), ]MailTo: [email protected]: ‘PLACE FOR NEW OPTIONS’Tables: ‘EFHFactors:last_14_days’
Pipeline (with automation)
Story of automation
Classifier(TMVA, …)
filtersfiltersfiltersfilters
filtersfiltersfiltersreducers
filtersfiltersfiltersmetrics
GnuPlot
filterssimulators
MapReduceSTORAGE
ML Infrastructure
Report
YAML-file
Story of automation
metric | learn | test | test cur.---------------------------------------ll_p | 0.38171 | 0.36074 | 0.14527 ll_r | 0.38966 | 0.37151 | 0.33247 f1_p | 0.44869 | 0.44430 | 0.43266 fom_p | 0.91526 | 0.90580 | 0.88528 kl_p | 0.31143 | 0.29581 | 0.13186 log_loss | 0.39965 | 0.40354 | 0.44178 mcc_p | 0.30788 | 0.30159 | 0.28512 q10_p | 2.6632 | 2.5994 | 2.5261 q2_p | 1.6315 | 1.6212 | 1.5886 q_p | 1.6244 | 1.6089 | 1.5777
Report
Story of automationML Infrastructure
Classifier(TMVA, …)
filtersfiltersfiltersfilters
filtersfiltersfiltersreducers
filtersfiltersfiltersmetrics
GnuPlot
filterssimulator
s
MapReduceSTORAGE
ProductionReport (Money, Clicks)
Experiment (1%)
Deploy new model
YAML-file
Report (llp)
Report (Money, Clicks)
Challenges (scientific)
— Multi-armed bandit problem• Banner is black box with estimated CTR• Historical data is used for prediction
— Default model bias• Training set is generated by default model
— Move from KPIs to metrics and cost functions • Business Strategy (approx) metrics
— Balancing between different cost functions• Clicks, Money, Conversions, CPA
Challenge (automation):Graphical Pipelines Framework
Simulationdata
Experimental data
map
train
Cut by threshold
Show mass
distribution
Filter backgroun
d
Estimate mixture
parameters
classify
map
Run
Automation for me is:
— Tools (in TMVA)
What is Automation?
Normalization
Rectangular Cuts
SVM Boosted Trees
Gaussianisation
PCA
PDE
Decorrelation
Genetic Algorithms
Automation for me is:
— Tools
• Macro language (high level language)
for expressing ideas
What is Automation?
Simulationdata
Experimental data
map
train
Filter by threshold
Show mass distribution
Filter background
Estimate mixture
parametersclassifymap
Automation for me is:
— Tools
• Macro language (high level language)
for expressing ideas
— Infrastructure
• Connecting with arrows
• Whole pipeline coverage
What is Automation?
Automation for me is:
— Tools
• Macro language (high level language)
for expressing ideas
— Infrastructure
• Connecting with arrows
• Whole pipeline coverage
— Specialization
• Collaboration and delegation
What is Automation?
Automation for me is:
—…
— Specialization
• Collaboration and delegation
What is Automation?
classifiertrain set model
parameters
Parameters
What is Automation?
Comp. Complexity
Model
ProperDefective
Cost FunctionLearning rate
Tree depth
RegularizationFeatures TypesNumber of trees
Automation for me is:
— Tools
• Macro language (high level language) for
expressing ideas
— Infrastructure
• Connecting with arrows
• Whole pipeline coverage
— Specialization
• Collaboration and delegation
What is Automation?
(1) Copy and paste data
— Add new boxes to automated pipeline
— Automate transport between all boxes
— Do not use strange software
Everyday rules: anti-patterns
(2) Execute data pipeline steps manually in a cycle.
— Define new command for this pipeline
— Use standard formats for data streams
— Define needed ‘mappers’ and ‘reducers’ for data
stream and use them
Everyday rules: anti-patterns
(3) Your code is >3 times longer than natural language
description
— Start working on new tools (macro languages, DSL)
Everyday rules: anti-patterns
(4) It takes >1 man-hour to recalculate final graph of
your research
— Automate the whole pipeline
Everyday rules: anti-patterns
(5) You write line of code that has no chance of being
executed >10,000 times
Everyday rules: anti-patterns
(5) You write line of code that has no chance of being
executed >10,000 times
Everyday rules: anti-patterns
Code (>10000 times) Interactive Data Analysis (once)
def pca(data, reduce_dims=0, corr=True, normalise=False,subtract_mean=True): data_mean = None if subtract_mean: data_mean = mean(data, axis=0) data -= data_mean transposed = transpose(data) cov_matrix = corrcoef(transposed) # Compute eigenvalues and sort into
# descending order eigen_vals,eigen_vecs = linalg.eig(cov_matrix) indices = argsort(eigen_vals) indices = indices[::-1] eigen_vecs = eigen_vecs[:, indices] eigen_vals = eigen_vals[indices]
data = filter(data, “RegionID = 213”)data1, data2 = split_random(data)data2ext = decorrelate(data1, data2, fields = [“age”, “income”, …])report = check_features(data2ext) show_report(report)
(5) You write line of code that has no chance of being
executed >10,000 times
Choose one action a time (A) or (B):
A. Interactive data analysis using high level tools
B. Coding: extending/improving tools library or infrastructure. Delegate it?
There is no other options.
Everyday rules: anti-patterns
(6) Your colleagues think that you are doing something
useless
— Stop doing questionable things
Everyday rules: anti-patterns
(7) You have a dream, and it hasn’t came true yet
— Tell Yandex about your dream
Everyday rules: anti-patterns