how to crack down big data?
TRANSCRIPT
How to crack down BIG DATA?Required Skills for Data Scientist
hello!I AM
DAVIDHUANG
I am here because Iwant to find more lovers for
data science . You can find me at:
tawei.huang1@gmail,com
My Experience• Data Scientist Intern, Yoctol• Data & Strategy Intern , Chocolabs• Summer Intern Student, Institute of
Mathematics, Academic Sinica
My Education Background• Master in Statistics, NTU• BSc. In Quantitative Finance, NTHU• Research Student, PKU
“Big data is a big trend, but it is very
difficult to hire a data scietist.
It’s also hard to find a job in TW XD
“
1. Who is a Data Scientist?The skill sets you need to be a data scientist.
In a big data project, we need these people!
Data Backend Engineer
Database Architect
Data Analyst / Data Scientist
Domain Expert
Develop and operate backend systems related to data access, collection, processing and storage,
Architect and design Database solutions for the enterprise, and lead the effort on database performance and optimization
To use advanced quantitative analysis, data mining techniques and strong industry acumen to interpret, connect and predict data to deliver insight and recommendations for decisions.
Assist the data team to understand the domain problem & knowledge.
Data analyst / Machine Leaning
Lots of people say that they are different, but I think “every data analyst should be a data scientist, and the converse holds!”
Explanatory Analytics
Theory-based, statistical testing of causal hypothesis (commonly see in economics)
Strength of relationship in statistical model
Data analyst
PredictiveAnalytics
Empirical method for predicting new observations (in statistical / math / CS ways)
Ability to accurately predict new individuals
Data scientist
Both fields are important for discovering knowledge.
Data UnicornA data unicorn expertises in all
fields… Mission impossible?
The Data Scientist Venn Diagram
Math &Statistics
HackingSkills
DomainExpertise
MachineLearning
ResearchProgram
Unicorn
First become a(1) researcher,(2) machine learner,(3) programmer,and then find your ownway to be a data scientist.
Skill Sets for Data Scientist – Math & Stat
Mathematics & Statistics
Multivariable CalculusLinear Algebra
Probability TheoryStatistics / Math Statistics
Convex OptimizationDiscrete Analysis
Basic Knowledge
Regression Analysis / GLMExperimental Design
Causal Inference
Multivariate AnalysisBiz Analytics & Data Mining
Data Mining
Machine LearningDeep Learning (ANN/CNN)
Machine Learning
Time Series Analysis
Forecasting
1
Skill Sets for Data Scientist – Programming
Programming Skills
Python(Scripting Language)
R(Statistical Software)
Matlab(Super Fast but Expensive)
Programming Skill SQL & Relational AlgebraNoSQL / Cassandra / etc.
HDFS / Map ReduceHadoop and Hive /Pig
Spark & Scala
Database Querying
A little bit JavaData Structure & Algorithm
Data Munging (python!)Data Viz (d3.js / Tableau)
Software Engineering
2
1. D3.js visualization: http://goo.gl/cVlTX72. Spark MiLib: http://goo.gl/VNMQ97
Skill Sets for Data Scientist – Business Sense
Business Professionalism
Hypothesis ThinkingPyramid Principles
BizPro is a good choice!
Logical Thinking
To be honest, the crucial truth is that “this part is very important, but the less important skill set!”
Presentation & PresenceCommunication Skill
Upward Management
Communication Skill
I think this is the niche for business school students. Specific knowledge about marketing, financial analysis, etc. helps a lot.
3
My Learning Path for you – Math matters!
CalculusLinear AlgebraProbability TheoryMath Statistics
Freshman - Junior
1
ProgrammingC / Java / R
2
Financial MarketMarketingManagement
3
Advance StatisticsData MiningEconometrics
Senior
R ProgrammingMatlab (Basic)
CompetitionsAdvanced FinanceMacroeconomics
Statistical LearningCompress Sensing
Current
Python & SQLHadoop & Spark
BizPro TrainingLogical ThinkingMarketing Analytic
2. Master in Data Science free!How to become the data unicorn without any tuition fee
Data Scientist 101: Johns Hopkins MOOC
The Coursera Specializations offered by Johns Hopkins University give a very good general exposure to the world of data science.
Executive Data Science
I think this specialization is designed for those who don’t want to become a data scientist but may work in a data-driven company.URL: https://goo.gl/ZNBF7N
Data Science
I think this specialization is designed for those who don’t have a very strong academic background but want to become a data scientist.URL: https://goo.gl/8OzBhe
Difficulty
Difficulty
Basic Math: Calculus & Linear Algebra
Calculus and linear algebra are fundamental tools for data scientists and statisticians. Having a solid foundation will help a lot.
Calculus I & II, NTHU
This course gives you a solid foundation of Euclidean space and multivariable calculus, which is very important for a data scientist.URL: http://ocw.nthu.edu.tw/ocw/index.php?page=course&cid=7&
Linear Algebra, NCTU
A data scientist usually thinks data with a matrix representation. The concept of vector algebra helps a lot for high dimensional data analysis.URL: http://goo.gl/KFdJTT
Difficulty
Difficulty
Advance Math: Convex Optimization
This is a very advanced topic we will use when doing machine learning. However, I don’t think every data scientist should understand this field.
Convex Optimization, Stanford
This course should benefit anyone who uses or will use scientific computing or optimization in engineering or related work (e.g., machine learning, finance, operational research).URL: http://stanford.edu/class/ee364a/MOOC: https://goo.gl/KBQ473
Difficulty
Basic Stat: Probability & Math Statistics
If you don‘t have a probability & math statistics, you can’t learn any advanced data analytics method. Please learn it!
Probability, NTHU
This course gives you a solid foundation of Euclidean space and multivariable calculus, which is very important for a data scientist.URL: http://goo.gl/G4MhIj
Math Statistics, NTHU
A data scientist usually thinks data with a matrix representation. The concept of vector algebra helps a lot for high dimensional data analysis.URL: http://goo.gl/nQ2cE2
Difficulty
Difficulty
Stat Method: Advanced Methods
These three fields are core data analytics methods. You will find them everywhere, like in econometrics, machine learning, and so on.
Regression Analysis, NTHU
URL: http://goo.gl/YQBAla
Difficulty
Multivariate Analysis, NTHU
URL: http://goo.gl/934GKd
Difficulty
Experimental Design, NTHU
URL: http://goo.gl/ED9HMr
Difficulty
Data Mining: Illinois & Stanford MOOC
Data mining is the most powerful tools for business analytics. It can be applied to user behavior data, questionnaire design, and financial market.
Data Mining, UIUC
The Data Mining Specialization teaches data mining techniques for both structured data which conform to a clearly defined schema, and unstructured data which exist in the form of natural language text.
URL: https://goo.gl/Tyzm6Z
Difficulty
Mining Massive Dataset, Stanford
Introduce the participant to modern distributed file systems and MapReduce, including what distinguishes good MapReduce algorithms from good algorithms in general. The rest of the course is devoted to algorithms for extracting models and information from large datasets.
URL: https://goo.gl/NYyxy9
Difficulty
Data Mining: Illinois & Stanford MOOC
Data mining is the most powerful tools for business analytics. It can be applied to user behavior data, questionnaire design, and financial market.
Data Mining, UIUC
The Data Mining Specialization teaches data mining techniques for both structured data which conform to a clearly defined schema, and unstructured data which exist in the form of natural language text.
URL: https://goo.gl/Tyzm6Z
Difficulty
Mining Massive Dataset, Stanford
Introduce the participant to modern distributed file systems and MapReduce, including what distinguishes good MapReduce algorithms from good algorithms in general. The rest of the course is devoted to algorithms for extracting models and information from large datasets.
URL: https://goo.gl/NYyxy9
Difficulty
Machine Learning: Stnaford / NTU MOOC
Machine learning is the science of getting computers to act without being explicitly programmed.
Machine Learning, Stanford
This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition.URL: https://www.coursera.org/learn/machine-learning
Difficulty
Machine Learning, NTU
The students shall enjoy a story-like flow moving from "When Can Machines Learn" to "Why", "How" and beyond.. (Very tough course!)URL: https://www.coursera.org/course/ntumlone
Difficulty
3. What I’ve done in practice!How to become the data unicorn without any tuition fee
SOP for Data Analytic Project
Data Task Formulation
Data Collection
DataCleaning
Data Exploration
Data Modeling
Define Purpose
Model Selection
Performance Evaluation
Model Deployment
Initial Phase90% Efforts
Middle Phase90% Professions
Final Phase90% Domain
25,054,386 vcMonthly View Counts
751,631,580 valuesLots of user behavior!
1,785,244 usersMonthly Active Users
My workspace
R, Google Analytics, Spark
Big DataAll about math, statistics, and coding.But how about business knowledge?