from data to deployment- full stack data science

From Data to Deployment:

Full Stack Data Science

Ben LinkData Scientist

Indeed is the #1 external

source of hire

64% of US job searchers search

on Indeed each month

80.2Munique US visitors per month

16Mjobs

50+countries

28languages

200Munique visitors

Unique Visitors (millions)

2009 2011 2012 2013 2014 2015

We helppeopleget jobs.

Data Science @ Indeed

Applicant Quality

Job / Employer

Application Model

Resume / Job

Seeker

Good Fit?

What does a data scientist do at Indeed?

Gather Data

Prototype

Models

Generate

Features

Review

Choose final

parameters

A/B Test

Hypothesis

Formulation

Explore

Analyze

Labels

Analyze

FeaturesLabel Hold-out

Deploy

Monitor

Repeat

Evaluate

Gather Data

Prototype

Models

Generate

Features

Review

Choose final

parameters

A/B Test

Hypothesis

Formulation

Explore

Analyze

Labels

Analyze

Deploy

Monitor

Repeat

Evaluate

Gather Data

Prototype

Models

Generate

Features

Review

Choose final

parameters

A/B Test

Hypothesis

Formulation

Explore

Analyze

Labels

Analyze

Deploy

Monitor

Repeat

Evaluate

Gather Data

Prototype

Models

Generate

Features

Review

Choose final

parameters

A/B Test

Hypothesis

Formulation

Explore

Analyze

Labels

Analyze

Deploy

Monitor

Repeat

Evaluate

Gather

Hypothesis

Formulation

Explore

Prototype ModelsGenerate

Features

Analyze

Labels

Analyze

Features

Evaluate

Review

Deploy

Label Hold-out

Choose Final

Parameters

A/B Test

Monitor

Repeat

Full-stack data scientists

Prevent handoff

mistakes

Can contribute on

any team

Have big picture

in mind

Prevent handoff mistakes

IpythonModel

Feature

Extraction

Model Building

Web Infrastructure

Feature

Extraction

Web Infrastructure

Feature

Extraction

Service

Web Infrastructure

Feature

Extraction

Service

Web Infrastructure

Feature

Extraction

ServiceNoSQL

Service

Web Infrastructure

Feature

Extraction

ServiceNoSQL

Web Infrastructure

Service

DataNoSQL

Feature

Extraction

Web Infrastructure

Service

DataNoSQL

Java Feature

Extraction

Contribute on any team

Drive logging of data

Drive product decisions

using external data

Get first data science solution

into production quickly

Iterate on existing solutions

Recognize deployment costs during

feature / model development

Think Big

Focus on right problem

Aware of big picture

Practical Data Science

Job Description Classifiers

Predicting (min) years of experience

from a job description

Simple features for first models

{ ‘regex:5+’:1, ‘tfidf:expert’:1.75, ‘tfidf:advanced’:0.93, ‘tfidfBigram:5

years’:2.25 }

Label data before, during, and after you build a model

Extract features in one place

Reuse your model building code

Release softly and log everything

Validate and review every model

Monitor after deploying

Retrain when needed

The best way to understand your

problem is to label your own data

The fastest way to get labels for your

data is to label your own data

The easiest way to know your labels are

consistent is to label your own data

Labeling encourages

feature development

Labeling creates a human

performance benchmark

Labeling throughout gives you

indications of shifting data

Is the job part time, full time, or both?

Sometimes you don’t need much data

Need to only do better

than a simple heuristic

Training Samples

reLearning Curve

0 1000 3000 70005000

Training

Cross-validation score

Now train others to label

Or use experts

Check their consistency

Can build next generation model quickly

Always flag weird data

Retrain when needed

Feature Extraction

Features PredictionsModel

Builder

Predictor

Prevents feature inconsistency

between train / serve time

Allows faster feature iteration

Encourages feature extraction reuse

Deploy feature extraction services

Features ModelModel Builder

Feature Extraction

Job Description

Feature Extractor

"tfidf:experience"0.007

"bigramTfidif:5 years"0.049

"bigramTfidf:experience in"0.006

"tfidf:expert"0.026

"averageWordLength"5.506

"tfidf:2" 0.017

"tfidf:5" 0.029

"tfidf:years"0.017

Retrain when needed

Features Model

● feature sampling

● feature scaling

● feature selection

Model Builder

● test/train splits

● cross validation

● generate plots

● email results

● export model

input_file=job_decription_years_exp.gzoutput_dir=output/job_description_years_exp_model_builds

model_name=JobExperiencemodel_version=1.2

model_type=RandomForestClassifiermodel_params=[{`n_estimators`:[100, 125, 150], `max_depth`:[3, 4, 5, 6]}]

downsampling_ratio=1.75use_feature_selection=Truefeature_selection_variance_retained=0.9plot_learning_curve=True

mail_to=benjaminl@indeed.com

False Positive Rate

ROC Curve

0.0 0.2 1.00.0 0.80.60.4

Feature

Importance

experience 0.27

5 years 0.19

experience in 0.17

expert 0.16

averageWordLength 0.11

years 0.08

... ...

ClassPrecisio

nRecall

ScoreSupport

1.0 0.92 0.90 0.91 353

2.0 0.87 0.92 0.90 310

5.0 0.90 0.86 0.88 213

/total0.90 0.90 0.90 876

Output your models into

a standard format

Deploy quickly

Model Predictor

Feature Extraction

Predictions

Putting it all together

Feature Extraction

Features PredictionsModel

Builder

Predictor

Retrain when needed

viewjobeval_en_US JUDY-419: Proctor test for viewjob evaluation test

editdetails

control test1control test1

editdetails

control test1

test1 - 50%editdetails

Log everything

uid=1b0un002j1jfi8mp&type=judyQoaEvalFeatures&appdcname=aus&appinstance=judy&tk=1b0un002d1jfid0o&locale=en_US&f.jdTfidf%3A794=0.079

31499364678474&f.candidateResumeRead=0.0&f.trigramJDTfidf%3A2365=0.03493229123324494&f.trigramJDTfidf%3A1135=0.03964128705308954

&f.jdTfidf%3A1618=0.08411276446891801&f.jdTfidf%3A2025=0.07554196313862578&f.jdTfidf%3A796=0.10368340560564313&f.trigramJDTfidf%3A

1324=0.023586131767642488&f.trigramJDTfidf%3A1300=0.013675981072748583&f.jobApplicantDistance=25000.0&f.tfidfBestFitJobsJobDescription

Similarity=0.0&f.jdTfidf%3A2357=0.12212208847891733&f.jdTfidf%3A1786=0.24798453870628528&f.jdTfidf%3A1583=0.11102969484158107&f.trigra

mJDTfidf%3A440=0.009580278396637679&f.bestFitJobsJobDescriptionSimilarity=0.0&f.jdTfidf%3A16=0.09676734768924529&f.trigramJDTfidf%3A3

42=0.052695755493244574&f.jdTfidf%3A2961=0.12933227874206563&f.jdTfidf%3A2559=0.0781937359029168&f.coverLetterJobTitleSimilarity=0.0&f

.jdTfidf%3A313=0.13274661170267346&f.trigramJDTfidf%3A2844=0.011672658147330478&f.jdTfidf%3A1228=0.0826878541112167&f.jdTfidf%3A38

6=0.09321074430754722&f.jdTfidf%3A587=0.09338485474725206&f.trigramJDTfidf%3A2007=0.03398987646377408&f.jdTfidf%3A25=0.0848508555

3898714&f.trigramJDTfidf%3A743=0.052044363109186274&f.trigramJDTfidf%3A742=0.00936380975357828&f.jdTfidf%3A21=0.08956959630539192

&f.trigramJDTfidf%3A1465=0.05695667014121465&f.trigramJDTfidf%3A170=0.019054361889691666&f.trigramJDTfidf%3A2041=0.078672252220736

76&f.jdTfidf%3A178=0.06740515563149391&f.trigramJDTfidf%3A1348=0.020307558998175355&f.yearsOfWorkExperience=0.0&f.trigramJDTfidf%3A2

874=0.021452684048600148&f.trigramJDTfidf%3A2739=0.008846404277542146&f.jtYrsExpRegex%3A0=0.0&f.pastJobTitleSimilarity%3A0=0.0&f.pas

tJobTitleSimilarity%3A1=0.0&f.tfidfResumeJobDescriptionSimilarity=0.020420184609032756&f.jdTfidf%3A276=0.0865108192737853&f.pastJobTitleSi

milarity%3A2=0.0&f.jdTfidf%3A882=0.09227660841710272&f.trigramJDTfidf%3A904=0.028517392545983834&f.applicantsPerJob=0.0&f.majorJobDe

scriptionSimilarity=0.018518518518518517&f.jobDescriptionCharacterLength=501.0&f.trigramJDTfidf%3A221=0.03856671987843533&f.jdSupervisorTi

tleRegex%3A3=1.0&f.jdSupervisorTitleRegex%3A1=0.0&f.jdSupervisorTitleRegex%3A2=0.0&f.jdSupervisorTitleRegex%3A0=0.0&f.jdTfidf%3A1937=0.

10276933510059638&f.jdTfidf%3A2240=0.16550210190515535&f.jdTfidf%3A264=0.1061544307504775&f.jdTfidf%3A1933=0.08140883446275106&f.

trigramJDTfidf%3A2932=0.04909455318062527&f.jdTfidf%3A1082=0.09783192017828135&f.jdTfidf%3A2454=0.08232280250175841&f.jdLicenceReg

exp%3A2=0.0&f.tfidfCoverLetterJobDescriptionSimilarity=0.0&f.jdTfidf%3A485=0.11773996424853242&f.trigramJDTfidf%3A1942=0.03500133306124

8&f.jdLicenceRegexp%3A0=0.0&f.jdLicenceRegexp%3A1=0.0&f.jdTfidf%3A299=0.08046452951090553&f.trigramJDTfidf%3A2261=0.0539089291266

305&f.jdTfidf%3A872=0.08711259378092336&f.trigramJDTfidf%3A1377=0.037898645513041965&f.trigramJDTfidf%3A487=0.022278961460829243&

f.trigramJDTfidf%3A485=0.029495461171052794&f.numMonthsExperience=134.0&f.trigramJDTfidf%3A207=0.040840685741050896&f.trigramJDTfidf

Reuse logs for future models

Logs give us insight

into changing data

Logs allow us to see

what went wrong

Retrain when needed

Quantitative Validation

Training Setclass precision recall f1-score

support0.0 1.00 1.00 1.00

4481.0 0.99 1.00 1.00

6632.0 1.00 0.98 0.99

avg / total 1.00 1.00 1.00 1380

[ 2015-12-15 21:42:27,537 INFO ] [indeed.model_builder]

Test Setclass precision recall f1-score

support0.0 0.85 0.90

0.87 1461.0 0.92 0.96

0.94 2262.0 0.91 0.70

0.79 88

False Positive Rate

ROC Curve

0.0 0.2 1.00.0 0.80.60.4

Qualitative Validation

Review your Models

Another perspective

Transparency and Reproducibility

Awareness

1 Context

2 Data

3 Response variable

4 Features

5 Model selection and performance

6 Transparency and recommendations

Context

What should this model enable us to do

(highlighting, filtering, sorting, etc.)?

What products / interfaces / workflows

will initially use this model ?

What queries and filters were used?

From what time range did your data originate?

Did you sample your dataset?

Response variable

How was the response variable

labeled or collected?

What the model outputs (predictions) represent

and how they should be scaled or thresholded?

Features

How were your features generated?

Which features were most important?

Model selection and performance

Performance reports on train / test sets

Overall CV search strategy and scoring function

Other performance tests

(e.g. newer hold out sets, stress testing)

Expected model performance

Transparency and recommendations

Properties files for Model Builder

Link to branch of Model Builder code

Examples of Model Predictions

Possible directions for future improvements

A couple sentences on why you think the

model is ready for production

Retrain when needed

Features and data are hard dependencies

Need a post deploy plan

Use log data to check for feature changes

tfidf:`excel`

Test Name ttest_ind ks_2samp mannwhit levene ranksums

p-value 3.79e-09 0.00021 8.41e-05 3.79e-09 0.00017

Check prediction class distributions

Retrain when needed

Every model should be validated,

retraining is time expensive

Use feature monitoring to

determine feature stability

Choose less sensitive features

Avoid counts

Full stack data scientists

Full stack data science organizations

More Indeed Engineering

Careers

indeed.jobs

Twitter

@IndeedEng

Engineering Blog &

indeed.tech

Open Source

opensource.indeedeng.io

Questions?

Retrain when needed

from data to deployment- full stack data science

Data & Analytics

securing the stack: hardening your drupal deployment

elk stack deployment w/ vagrant

elk stack deployment w/ vagrant - inovex gmbh · elk stack...

data stack

agoda open stack in a large scale deployment

deployment best practices: open identity stack

hydrologic information system workgroup server: software...

full stack dep: modern mac deployment

microsoft data stack smackdown !

stack applications in data structure

data stack instructions

what’s next for the berkeley data analytics stack? · pdf...

linear data structures (stack)

struktur data pertemuan 5 stack

data structure stack

the big data stack

big data technology stack

berkley data analysis stack

automatic deployment on .net web stack (minsk .net meetup...

stack operation in data structure