chapter 1: introduction, exploring data
TRANSCRIPT
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.
...... Chapter 1: Introduction, Exploring Data
Richard LiuSchool of Mathematics, XMU
February 26, 2020
Richard Liu Chapter 1: Introduction, Exploring Data 1 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Content...1 Introduction...2 Terminologies Identi ication...3 Summary Statistics...4 Basic Visualization...5 Visualization for High-Dimensional Data...6 OLAP and Multidimensional Data Analysis...7 Supplementary
Richard Liu Chapter 1: Introduction, Exploring Data 2 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
Section 1
Introduction
Richard Liu Chapter 1: Introduction, Exploring Data 3 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. IntroductionFour topics are discussed in this chapter.
Terminologies Identi icationSummary StatisticsVisualizationOn-Line Analytical Processing(OLAP)
Used for exploring multidimensional arrays of values.This chapter is tightly related to the area known asExploratory Data Analysis (EDA), other parts of which areCluster Analysis and Anomaly Detection which are coveredin Chapter 8 to 10.
Richard Liu Chapter 1: Introduction, Exploring Data 4 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Terminologies Identi ication
Data ScienceMachine LearningData MiningBusiness Analytics
Richard Liu Chapter 1: Introduction, Exploring Data 5 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Terminologies Identi ication
Classi icationRegressionClustering
Richard Liu Chapter 1: Introduction, Exploring Data 6 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Terminologies Identi ication
Numerical DataCategorical DataBinary Data
Richard Liu Chapter 1: Introduction, Exploring Data 7 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Terminologies Identi ication
Low-dimensional DataHigh-dimensional Data
Richard Liu Chapter 1: Introduction, Exploring Data 8 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Machine Learning Structures
Training SetTest SetValidation Set
Richard Liu Chapter 1: Introduction, Exploring Data 9 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
So, what is data science?
Richard Liu Chapter 1: Introduction, Exploring Data 10 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
Section 3
Summary Statistics
Richard Liu Chapter 1: Introduction, Exploring Data 11 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Summary StatisticsA quantity which describes an overall characteristic of oneset of values..De inition 1: Frequency..
......
Given a random variable x, which can take values{v1, · · · , vn} (vi ̸= vj for i ̸= j), then
Frequency(vi) =#{vi}n
where #{vi}means the number of value vi in a speci ic dataset S.
Richard Liu Chapter 1: Introduction, Exploring Data 12 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Summary Statistics
.De inition 2: Mode..
......Given a data set S, Mode(S) = argmaxvi Frequency(vi)
For the data in reality, usually mode happens more thanonce.Frequently used as an indicator of the missingvalue.(Why?)
Richard Liu Chapter 1: Introduction, Exploring Data 13 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Summary Statistics.De inition 3: Percentiles..
......The pth percentile xp is de ined as a value x such that p% ofthe observed values of x are less than xp.
Obviously x0% = min(x), x100% = max(x).De inition 4: Mean..
......
Assume {x1, · · · , xm} is an ordered set of observed values,denoted by x, then
mean(x) = x̄ =1
m
m∑i=1
xi
Richard Liu Chapter 1: Introduction, Exploring Data 14 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Summary Statistics.Notation 1..
......De ine x↓ = {x(1), · · · , x(m)} a permutation of x,x(1) ≥ x(2) ≥ · · · ≥ x(m)..De inition 5: Median..
......
median(x) = median(x↓) ={x(r+1) m = 2r+ 112(x(r) + x(r+1)) m = 2r
wherem = |x|, the cardinality of set x.
Mean and median are the measures of the location of a set ofvalues. Richard Liu Chapter 1: Introduction, Exploring Data 15 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Summary Statistics.De inition 6: Range........ range(x) = max(x)−min(x).De inition 7: Variance..
......variance(x) = s2x =
1
m− 1
m∑i=1
(xi − x̄)2
.De inition 8: Standard Deviation..
...... sd(x) = sx =√
s2x
There three are the measures of the spread of a set of values.Richard Liu Chapter 1: Introduction, Exploring Data 16 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Other Quantities
Due to the existence of outliers, previous statistics are notrobust, some alternatives are considered therefore..De inition 9: Absolute Average Deviation(AAD)..
......AAD(x) = 1
m
m∑i=1
|xi − x̄|
.De inition 10: Median Absolute Deviation(MAD)..
...... MAD(x) = median({|xi − x̄|}mi=1)
Richard Liu Chapter 1: Introduction, Exploring Data 17 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Other Quantities
.De inition 11: Interquartile Range(IQR)..
...... IQR(x) = x75% − x25%
Richard Liu Chapter 1: Introduction, Exploring Data 18 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Multivariate CaseSometimes a data set consists of several sets of valuesbelonging to different attributes of that. In other words,
x = (x1, · · · , xn)
where n is the number of attributes, xi is a set of values,comprising the ith attribute of all observed data.In this case, every observed data point is a vector, not anumber. So we de ine mean as
x̄ = (x̄1, · · · , x̄n)
Richard Liu Chapter 1: Introduction, Exploring Data 19 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Multivariate Case.De inition 12: Covariance..
......covariance(xi, xj) =
1
m− 1
m∑k=1
(xki − x̄i)(xkj − x̄j)
.De inition 13: Correlation..
......rij = correlation(xi, xj) =
covariance(xi, xj)sisj
Evidently correlation(xi, xj) = variance(xi)
Richard Liu Chapter 1: Introduction, Exploring Data 20 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
Section 4
Basic Visualization
Richard Liu Chapter 1: Introduction, Exploring Data 21 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Motivation
People could absorb large amounts of data from onegraph quickly.Make use of the domain knowledge that is 'locked up inpeople's heads.'(Hard for data mining)
Richard Liu Chapter 1: Introduction, Exploring Data 22 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. One example
Richard Liu Chapter 1: Introduction, Exploring Data 23 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. General ConceptsData objects, attributes and the relationships amongdat a objects are translated into graphical elementssuch as points, lines, shapes and colors.Note that the representation depends on the type ofattribute(nominal, ordinal, continuous). When thevalue itself has order it is OK to represent them into acoordinate system (with x, y, z axes).You need to preserve some important informationabout relative attributes(such as physical location)People would like to believe that data points that arevisually close to each other have similar values for theirattributes. Richard Liu Chapter 1: Introduction, Exploring Data 24 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Challenge
How to be easily observable? This is what visualization paysattention to most.
Richard Liu Chapter 1: Introduction, Exploring Data 25 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Data Arrangement
Richard Liu Chapter 1: Introduction, Exploring Data 26 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Data Arrangement
Richard Liu Chapter 1: Introduction, Exploring Data 27 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Selection
Sometimes it is extremely hard to show all data objectsand attribute on one graph. So we use selection.Example: Many attributes -> a series oftwo-dimensional plots.
Richard Liu Chapter 1: Introduction, Exploring Data 28 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Visualization Techniques
Stem and Leaf Plots
Splitting values into groups, each containing thosevalues that are the same except for the last digit.Suitable for small values.
Histograms
Displaying the distribution of values for attributes bydividing the possible values into bins and showing thenumber of objects that fall into each bin.
Richard Liu Chapter 1: Introduction, Exploring Data 29 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Examples
Richard Liu Chapter 1: Introduction, Exploring Data 30 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Examples
Richard Liu Chapter 1: Introduction, Exploring Data 31 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Examples
Richard Liu Chapter 1: Introduction, Exploring Data 32 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Visualization Techniques
Box Plot
Show the distribution of the values of a singlenumberical attribute.
Pie Chart
Similar to histogram, typically used for dividingcategorical attributes.Hard for distinguishing. Not preferred in technicalwork.
Richard Liu Chapter 1: Introduction, Exploring Data 33 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Examples
Richard Liu Chapter 1: Introduction, Exploring Data 34 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Examples
Richard Liu Chapter 1: Introduction, Exploring Data 35 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Visualization Techniques
Scatter PlotWidely used for judging the relation between twoattributes given a series of data objects.
Richard Liu Chapter 1: Introduction, Exploring Data 36 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Examples
Richard Liu Chapter 1: Introduction, Exploring Data 37 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Visualization Techniques
Percentile PlotsEmpirical Cumulative Distribution Functions(ECDF)
A function graph.For any given x, it shows the fraction(or, probability) ofthe points that are less than x.
.Notation 2..
......F(x) = P(X ≤ x), where X is a random variable.
Richard Liu Chapter 1: Introduction, Exploring Data 38 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Example
Richard Liu Chapter 1: Introduction, Exploring Data 39 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
..One special case: Demonstrating severalattributes in one graph
Richard Liu Chapter 1: Introduction, Exploring Data 40 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Visualization Techniques (Details Removed)
Contour PlotsSurface PlotsVector Field PlotsLower-Dimensional SlicesAnimation
Richard Liu Chapter 1: Introduction, Exploring Data 41 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
Section 5
Visualization for High-DimensionalData
Richard Liu Chapter 1: Introduction, Exploring Data 42 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Visualization Techniques
MatrixCorrelation Matrix: One important step for dataanalysis, in order to capture the features of the wholedata set.A simple kind of Clustering.
Parallel CoordinatesOne coordinate axis for each attribute but the differentaxes are parallel to one other instead of perpendicular.Also, sometimes could be used to ind features.
Richard Liu Chapter 1: Introduction, Exploring Data 43 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Examples
Richard Liu Chapter 1: Introduction, Exploring Data 44 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Examples
Richard Liu Chapter 1: Introduction, Exploring Data 45 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
Section 6
OLAP and Multidimensional DataAnalysis
Richard Liu Chapter 1: Introduction, Exploring Data 46 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Discretizing
Discretizing continuous attribute into categoricalattribute.In this way we could arrange the data within a table ora multidimensional data representation.Cross Tabulation can also be implemented after doingso.
Richard Liu Chapter 1: Introduction, Exploring Data 47 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Examples
Richard Liu Chapter 1: Introduction, Exploring Data 48 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Examples
Richard Liu Chapter 1: Introduction, Exploring Data 49 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Cross Tabulation
Richard Liu Chapter 1: Introduction, Exploring Data 50 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Aggregation
Aggregation is one of the most general method foranalyzing multidimensional data objects.One example is summation.
Richard Liu Chapter 1: Introduction, Exploring Data 51 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Examples
Richard Liu Chapter 1: Introduction, Exploring Data 52 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Examples
Richard Liu Chapter 1: Introduction, Exploring Data 53 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Dimensionality Reduction
In descriptive data analysis, this means trying to reducethe number of attributes shown in the table/graph withaggregation.In regression, this means trying to select fewerresponses in order to diminish the correlation amongthem.Some examples: PCA, SVD.
Richard Liu Chapter 1: Introduction, Exploring Data 54 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. These subjects are ignored
PivotingSlicing and DicingRoll-UpDrill-Down
Richard Liu Chapter 1: Introduction, Exploring Data 55 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
Section 7
Supplementary
Richard Liu Chapter 1: Introduction, Exploring Data 56 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Preliminary
Sample SpaceProbabilityDistribution FunctionExpectationVarianceCentral Limit Theorem
Richard Liu Chapter 1: Introduction, Exploring Data 57 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Propositions in Mathematical StatisticsAssume x1, · · · , xn ∼ N(µ, σ2), then.Proposition 1..
......
Let x̄ = 1n∑n
i=1 xi, then
E(x̄) = µ
.Proposition 2..
......
Let s2 = 1n−1
∑ni=1(xi − x̄)2, then
E(s2) = σ2
Richard Liu Chapter 1: Introduction, Exploring Data 58 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Propositions in Mathematical Statistics.Proposition 3..
......
Let f(c) = ∑ni=1(xi − c)2, then
x̄ = argminc
f(c)
.Proposition 4..
......
Let f(c) = ∑ni=1 |xi − c|, then
median(x) = argminc
f(c)
Richard Liu Chapter 1: Introduction, Exploring Data 59 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
.. Python in Numerical Computation
For more information, see:https://weakcha.github.io/WISERCLUB-Final/Experiments.html
Richard Liu Chapter 1: Introduction, Exploring Data 60 / 61
. . . . . .
IntroductionTerminologies Identi ication
Summary StatisticsBasic Visualization
Visualization for High-Dimensional DataOLAP and Multidimensional Data Analysis
Supplementary
Thank you!
Richard Liu Chapter 1: Introduction, Exploring Data 61 / 61