sung park predict 422 group project presentation
TRANSCRIPT
TEXT MINING DATA SCIENCE JOBS IN R
Sung Park, MSPA Candidate August 20, 2015
Northwestern University PREDICT 422-‐DL SecGon 55
1
SUMMARY • IntroducGon • Resources • Data Source • Data ExtracGon • Data PreparaGon • Supervised Learning
2
INTRODUCTION • ExploraGon of web scraping and text mining
capabiliGes in R • Unstructured data
• Kaggle.com job posGngs • ClassificaGon using machine learning algorithm • Data scienGsts vs. non-‐data scienGsts
3
RESOURCES • Text AnalyGcs Tutorial in R
• Timothy D’Auria, Boston Decision, LLC • hUps://www.youtube.com/watch?v=j1V2McKbkLo
• Web Scraping Tutorial in R • Sharon Machlis, Computerworld • hUps://www.youtube.com/watch?v=TPLMQnGw0Vk
• Data Science in R: A Case Study Approach to ComputaGonal Reasoning and Problem Solving • Deborah Nolan and Duncan Temple Lang
• Google and Stack Overflow
4
DATA SOURCE • Kaggle.com/jobs • August 17, 2015 • 1,025 Job PosGngs
• Data ScienGst • Big Data Engineer • Data Science
Architect • Data Analyst • MarkeGng Analyst • StaGsGcian • Data Science
Director
5
DATA EXTRACTION • Extracted job links
• XML Package • xpathSApply(doc, "//h3/a/@href[starts-‐with(., '/jobs')]")
• Extracted job posGng text
• rvest Package • html_text(html_nodes(htmlpage, "div.postcontent"))
6
DATA PREPARATION • Cleaned the text data • tm Package • tm_map()
• Remove punctuaGons • Remove white spaces • Lower-‐casing • Remove stopwords
• “a”, “the”, “and”, “but”, etc.
7
DATA PREPARATION • Created the term document matrix (TDM)
8
DATA PREPARATION • TDM consists of 959 job posGngs and 73 terms • 375 data scienGsts and 584 non-‐data scienGsts
• Split TDM into training set and test set • 864 job posGngs in training sample • 95 job posGngs in test sample
9
SUPERVISED LEARNING • K-‐Nearest Neighbor • Find the K value with the highest classificaGon accuracy
• K=8 shows the best result with 82.98% accuracy rate • Confusion matrix shows the model correctly predicted 22
out of 35 data scienGst job posGngs
10
SUPERVISED LEARNING • ClassificaGon Decision Tree (Gini index) • The classificaGon accuracy rate is 96.8% • Confusion matrix shows the model correctly predicted 30
out of 33 data scienGst job posGngs
• Key terms for tree construcGon:
11
SUPERVISED LEARNING • Bagging • The classificaGon accuracy rate is 96.8% • Confusion matrix shows the same results as the
classificaGon tree
12
QUESTIONS? COMMENTS?
Sung Park, MSPA Candidate August 20, 2015
Northwestern University PREDICT 422-‐DL SecGon 55
13