sung park predict 422 group project presentation

13
TEXT MINING DATA SCIENCE JOBS IN R Sung Park, MSPA Candidate August 20, 2015 Northwestern University PREDICT 422DL SecGon 55 1

Upload: sung-park

Post on 11-Apr-2017

337 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: SUNG PARK PREDICT 422 Group Project Presentation

TEXT  MINING  DATA  SCIENCE  JOBS  IN  R  

Sung  Park,  MSPA  Candidate  August  20,  2015  

Northwestern  University  PREDICT  422-­‐DL  SecGon  55  

1  

Page 2: SUNG PARK PREDICT 422 Group Project Presentation

SUMMARY  •  IntroducGon  •  Resources  •  Data  Source    •  Data  ExtracGon  •  Data  PreparaGon  •  Supervised  Learning  

2  

Page 3: SUNG PARK PREDICT 422 Group Project Presentation

INTRODUCTION  •  ExploraGon  of  web  scraping  and  text  mining  

capabiliGes  in  R  •  Unstructured  data  

•  Kaggle.com  job  posGngs  •  ClassificaGon  using  machine  learning  algorithm  •  Data  scienGsts  vs.  non-­‐data  scienGsts    

3  

Page 4: SUNG PARK PREDICT 422 Group Project Presentation

RESOURCES  •  Text  AnalyGcs  Tutorial  in  R  

•  Timothy  D’Auria,  Boston  Decision,  LLC  •  hUps://www.youtube.com/watch?v=j1V2McKbkLo  

•  Web  Scraping  Tutorial  in  R  •  Sharon  Machlis,  Computerworld  •  hUps://www.youtube.com/watch?v=TPLMQnGw0Vk  

•  Data  Science  in  R:  A  Case  Study  Approach  to  ComputaGonal  Reasoning  and  Problem  Solving  •  Deborah  Nolan  and  Duncan  Temple  Lang  

•  Google  and  Stack  Overflow  

4  

Page 5: SUNG PARK PREDICT 422 Group Project Presentation

DATA  SOURCE  •  Kaggle.com/jobs  •  August  17,  2015  •  1,025  Job  PosGngs  

•  Data  ScienGst  •  Big  Data  Engineer  •  Data  Science  

Architect  •  Data  Analyst  •  MarkeGng  Analyst  •  StaGsGcian  •  Data  Science  

Director  

5  

Page 6: SUNG PARK PREDICT 422 Group Project Presentation

DATA  EXTRACTION  •  Extracted  job  links  

•  XML  Package  •  xpathSApply(doc,  "//h3/a/@href[starts-­‐with(.,  '/jobs')]")            

 •  Extracted  job  posGng  text  

•  rvest  Package  •  html_text(html_nodes(htmlpage,  "div.postcontent"))  

6  

Page 7: SUNG PARK PREDICT 422 Group Project Presentation

DATA  PREPARATION  •  Cleaned  the  text  data  •  tm  Package  •  tm_map()  

•  Remove  punctuaGons  •  Remove  white  spaces  •  Lower-­‐casing  •  Remove  stopwords  

•  “a”,  “the”,  “and”,  “but”,  etc.  

7  

Page 8: SUNG PARK PREDICT 422 Group Project Presentation

DATA  PREPARATION  •  Created  the  term  document  matrix  (TDM)  

8  

Page 9: SUNG PARK PREDICT 422 Group Project Presentation

DATA  PREPARATION  •  TDM  consists  of  959  job  posGngs  and  73  terms  •  375  data  scienGsts  and  584  non-­‐data  scienGsts  

•  Split  TDM  into  training  set  and  test  set  •  864  job  posGngs  in  training  sample  •  95  job  posGngs  in  test  sample  

9  

Page 10: SUNG PARK PREDICT 422 Group Project Presentation

SUPERVISED  LEARNING  •  K-­‐Nearest  Neighbor  •  Find  the  K  value  with  the  highest  classificaGon  accuracy              

•  K=8  shows  the  best  result  with  82.98%  accuracy  rate  •  Confusion  matrix  shows  the  model  correctly  predicted  22  

out  of  35  data  scienGst  job  posGngs  

10  

Page 11: SUNG PARK PREDICT 422 Group Project Presentation

SUPERVISED  LEARNING  •  ClassificaGon  Decision  Tree  (Gini  index)  •  The  classificaGon  accuracy  rate  is  96.8%  •  Confusion  matrix  shows  the  model  correctly  predicted  30  

out  of  33  data  scienGst  job  posGngs  

•  Key  terms  for  tree  construcGon:  

11  

Page 12: SUNG PARK PREDICT 422 Group Project Presentation

SUPERVISED  LEARNING  •  Bagging  •  The  classificaGon  accuracy  rate  is  96.8%    •  Confusion  matrix  shows  the  same  results  as  the  

classificaGon  tree  

12  

Page 13: SUNG PARK PREDICT 422 Group Project Presentation

QUESTIONS?  COMMENTS?  

Sung  Park,  MSPA  Candidate  August  20,  2015  

Northwestern  University  PREDICT  422-­‐DL  SecGon  55  

13