data analytcis-first-steps
TRANSCRIPT
Data analytics Aka Machine Learning
Data analytics as an area where the available digital data is treated as a Gold Mine from where tangible output is obtained which when applied impacts businesses and it’s efficiency.
Machine Learning is the tool in the form of y=f(x) which co-relates all the parameters in the data to obtain the relation which it learns from these parameters and keeps on improving the relationship
2
Data analytics Aka Machine Learning`
Data : It is a set of values of quantitative and qualitative variables. Historic information or knowledge represented in usable form
Population - Entire groupIt’s the collection of data which represents whole of the problem domain
Sample - A portion of the groupSubset of the population to be taken for inference which is the true representation of the overall population
3
Data analytics – How to start
Data Science/Data analytics With what ever name it’s
been known to you has essentially 3 areas to cover
Business
StatisticsProgramming
4
Data analytics – How to start
Business – Critical thinking 1. Objective analysis and evaluation of an issue in order to form a judgement
2. This is the stage to build the hypothesis for the problem domain in context
3. The model below could be a way to follow
5
Data analytics – How to start
Statistics – Mathematical Analysis
Data is considered as variable and the hierarchy is as follows
Data (Variables)
Numerical
(Quantitative)
Discrete Continuous
Categorical
(Qualitative)
Ordinal
(Logically ordered)
Nominal
(Unordered)
Continuous
Any values between a permitted range(5.3, 5.35,5.45 6.0)
Discrete
Whole no: 5, 10
Ordinal
Logical order like Low; Med; High
Nominal
Male ;Female , Different types of 4 wheelers
6
Data analytics – How to start
Programming - Execution
R is the widely used tool due it’s historical
statistical usage and it’s abundant statistical
libraries
Python the interpreted language provides
a wide variety of packages for application
development and it’s statistical library .
Data ingestion Tools: Spark, Hadoop
7
Data analytics – Problem perspective
Solution Hypothesis
Supervised Learning
Numerical Data
(Target Variable)
Regression
Linear Regression Time Series
Categorical data
(Target Variable)
Classification
Decision Trees Random Forest K NNLogistic
RegressionDemand
Forecasting
Reinforcement learning
Semi-Supervised NLP and AI
Unsupervised
Clustering
K MeansHierarchical clustering
Dimensionality Reduction
Collaborative filtering
8
Classifying the problem
Data analytics – Problem Complexity
The solution complexity and data volume increases with the kind of business value being generated
Credits : odoscope: Overview of analytics methods
9
Data analytics – The execution
Basic Terminology
• Attribute - Features are a quantitative attributes of the samples being observed
• Axis - Features are orthogonal axes of their feature space, if they are linearly independent
• Column/Independent variables - Features are represented as columns in your dataset
• Dimension - A dataset's features, grouped together can be treated as a n-dimensional coordinate space
• Input - Feature values are the input of data-driven, machine learning algorithms
• Predictor/Dependent variable - Features used to predict other attributes are called predictors
• View - Each feature conveys a quantitative trait or perspective about the sample being observed
• Independent Variable - Autonomous features used to calculate others are like independent variables in algebraic equations
Structuring the data
10
Data analytics – The execution
The rule of Seven
The steps are iterative at any stage
• Data collection(Problem context)
• Data Wrangling/Data Munging(Data cleaning)
• Data exploring/Analysis
• Data Transforming
• Modelling
• Model evaluation
• Data Visualization( Intelligence)
The machine learning models works only on clean structured data . 5 out of 7 steps are
related to pre-processing of the data given to model.
11
Data analytics – The execution
1. Data collection /selection1.No bias in the data feature
2.Relevant data feature
3.Techniques to handle
a) Data Collection:
1. Data from sources related to problem i..e DB’s,Weblogs,emails etc..
2. Any audio,video,sensor data etc .
3. The 6 Vs of data , Variety ,Velocity,Verasity,Volume,Value,Viable
b) Data Selection:
1. PCA : Unsupervised data
2.LDA (Linear discrimant analysis) : Supervised data
The rule of Seven12
Data analytics – The execution
2. Data cleaning (Garbage in Garbage Out)1. Data obtained is not clean and have below issues:
1. Outliers 4. Erroneous data 7. Need formatting
2. Missing data 5. Irrelevant data
3. Malicious data 6. Inconsistent data
2. Techniques to handle
1.Impute values by Mean ,Median or Mode
2. Treat outliers by deleting the row if not at all related else analyze with more data
3.Binning
4.Creating new features from given features
5.Dummy variables
The rule of Seven13
Data analytics – The execution
3. Data Analysis (Data exploring)1.Find the relevance of the feature set. Apply all the basic statistical exploration i..e moments
2. Obtain the statistical relation.
3.Perform basic visualizations for obtaining the concrete feature set.
4.Techniques to handle
1.Univariate analysis ( Mean ,mode, Normal distrubution,Variance,Skewness,Kurtosis)
2.Bi-Variate analysis ( Scatter plot, Box plot, Histogram)
3.Multi-variate analysis (Probability distribution functions PDFs)
The rule of Seven14
Credits: https://jixta.wordpress.com/
Data analytics – The execution
Data analysis – Adopt few basics visualizations from the list
15
Data analytics – The execution
4. Data Transformation(Data on the same scale)1. Ensure that the rest of the features are informative and transformation changes the no. of features or
the feature values. This is also known as Feature engineering
2. Dimensionality Reduction
3. Curse of dimensionality
4. Techniques to handle
1.PCA : Principal component analysis
2.Kernel Trick
3.Normalization
The rule of Seven16
Data analytics – The execution
6. Machine learning modeling
1. Split data as Test , Train.
2. Keep some data never tested or get
some sample termed as “out of sample”
3. Apply the appropriate ML algorithm on the train data.
4.Check the accuracy with the test data .
5.Observer the Bias and Variance
a)Bias is how far is the target value w.r.t actual value
b)Variance is how distributed is the value w.r.t actual value
c)Error = variance + Bias²
The rule of Seven17
Data analytics – The execution
The rule of Seven6.1 Machine learning modeling
2.Apply the appropriate algorithm
as described by solution hypothesis
Ref: cheatsheet
18
Data analytics – The execution
6.2 Machine learning model
1. Model Performance
1. Model validation
1. MSE ( Mean square error) 2. Hypothesis testing 3.Cross-validation
2. Algorithm tuning
1.Tuning the co-efficient parameters 2..Increasing the splits
3. Feature engineering (iterate again for features)
4. Cross validation
1. K-Fold
5. Ensemble method ( Combining the ML algorithms)
1. Voting ( Selection based on voting on performance) 2.Bagging( Bootstrapping + Aggregating) 3.Boosting (Weak learner
to strong learner.
The rule of Seven19
Data analytics Aka Machine Learning
6.3.1 Machine learning model performance
1. Confusion matrix ( Hypothesis testing
Measurement terms
1. Precision 3.Accuracy 5.False positive(Fallout-rate)
2. Recall 4.Specificity 6.False negative (Miss rate)
20
The rule of Seven
Data analytics Aka Machine Learning
6.3.2 Machine learning model performance
1. Cross-fold validations• Random division of data sets
• ML algorithm check for each
subset
• Overall efficiency as the final
accuracy of the model
21
The rule of Seven
Data analytics Aka Machine Learning
7. Data Visualization
1. Storifying the data analysis as Descriptive ,prescriptive or predictive
2. Effective use of the visuals graphs.
3.Tools like Tableau ,D3.js ,Matplotlib,chart.js
22
The rule of Seven
Data analytics Aka Machine Learning
Tools in practice
Core – Python library
NumPy
PandasMatplotlib
Scikit-learn
(Machine learning algos)
(Mathematical computing functions /N- Dimensional array )
(Data Analysis ,Data munging by in
memory data representation) (2 D Visualization library)
For a high level language user python is the best tool available to use
23
Data analytics Aka Machine Learning
Tools sources
1. Anaconda
1. Use IPython universal editor
2. Python 2.7+ or 3.5
3. Careful about the version because of supporting function
4. A good starting tool
5. Spyder Interactive editor tool for basic python learning
2. Enthought Canopy.
1. Interactive environment
3. Pycharm by jetbrains : Interactive IDE debugger tool
24
Data analytics Aka Machine Learning
Tools cheat sheets
Must visit sites
KdNuggets
Kaggle
DatascienceCentral
DataCamp
https://www.class-central.com/
http://analyticsvidhya.com/
https://www.odsc.com/
http://www.pythonlearn.com/
http://datascienceplus.com/
Practice data sets
http://ipython-books.github.io/minibook/
http://learnds.com/
https://vincentarelbundock.github.io/Rdatasets/
25