could you be a data scientist? quantify data scientist profiles using machine learning and linkedin...

Post on 27-Jan-2015

119 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Short presentation about my final project at Zipfian Academy about quantifying Data Scientist profiles using Linkedin data. The prototype web app is available at: bit.ly/cybads

TRANSCRIPT

Could You Be a Data Scientist?

Carlo Torniai, Ph.D.@carlotorniai

• Quantify data scientist profiles features • Analyze aspirant data scientist profiles• Provide useful feedback

Goal

?

Why this is relevant?

• A quantitative characterization of data scientists profiles can help closing the loop between job seekers and recruiters

Image: http://www.getelastic.com/wp-content/uploads/puzzle1.jpg

Data Collection Data AnalysisFeature Extraction Model Testing Data Product

• Linkedin API:– General Information– Past work history– Education

• Web Scraping:– Skills

• 1500 profiles– Data Scientists– Software Engineer– Business Analysts– Mathematicians– Statisticians

Data Collection Data AnalysisFeature Extraction Model Testing Data Product

Business AnalystsData scientists

Software Engineers

StatisticiansMathematicians

Data Collection Data AnalysisFeature Extraction Model Testing Data Product

Bioi

nfor

mati

cs

Biol

ogy

Com

pute

r Sc

ienc

e

Econ

omic

s

Elec

tron

ics

Astr

onom

y

Mat

h

Neu

rosc

ienc

e

Oth

er

Phys

ics

Psyc

holo

gy

Stat

s

Engi

neer

ing

Number of PhDs by topic and profiles

Data Collection Data AnalysisFeature Extraction Model Testing Data Product

For the purpose of this project I trained with skills and education features the following models:Random Forest• Classify the profileNaïve Bayes• Multi class probabilities to asses profiles

background componentsK-means• Capability of suggesting similar and relevant profiles

Data Collection Data AnalysisFeature Extraction Model Testing Data Product

For the purpose of this project I trained with skills and education features the following models:

Model Training set Purpose

Random Forest

All 5 categories Classify the profile

Naïve Bayes 4 classic categories: SE, BA, MT, ST

Asses profile backgrounds components with multi class probabilities

K-means All 5 categories Identify similar profiles

Data Collection Data AnalysisFeature Extraction Model Testing Data Product

bit.ly/cybads

Data Collection Data AnalysisFeature Extraction Model Testing Data Product

Naïve BayesMulti class probabilities

Random Forest

Data Collection Data AnalysisFeature Extraction Model Testing Data Product

K-meansclustering

Next Steps

Data Collection Data AnalysisFeature Extraction Model Testing Data Product

Get more data:- Other websites- Indeed- User input on

Web app

- Fine grained parsing of education- Experiment with additional features (industry, years of experience)

• Extend feature set and test more models

• Fuzzy C-means

• Add interactive data collection

• Personalized links for skills

• Explanation about similarity results

Close the loop by analyzing job offers and suggest matching profiles

Thank you!

Technologies

Web App: Flask, jQuery, Vega, MongoDB

NMF, HC, RF ,DT, NB, K-means models:: scikit-learn

Visualizations:Vincent, Vega, NetworkX, Gephi

Acknowledgementyatish27 : Ruby Linkedin public profile Web Scraperozgut : Linkedin API Python wrapper

top related