data basics. data matrix many datasets can be represented as a data matrix. rows corresponding to...

127
Data Basics

Post on 15-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Data Basics

Page 2: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Data Matrix• Many datasets can be represented as a data matrix.• Rows corresponding to entities• Columns represents attributes.

• N: size of the data• D: dimensionality of the data

• Univariate analysis: the analysis of a single attribute.• Bivariate analysis: simultaneous analysis of two attributes.• Multivariate analysis: simultaneous analysis of multiple attributes.

Page 3: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Example for Data Matrix

Page 4: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Attributes • Categorical Attributes

• composed of a set of symbols• has a set-valued domain• E.g., Sex with domain(Sex) = {M, F}, Education with

domain(Education) = { High School, BS, MS, PhD}.

• Two types of categorical attributes– Nominal

• values in the domain are unordered• Only equality comparisons are allowed• E.g. Sex

– Ordinal• Values are ordered• Both equality and inequality comparisons are allowed• E.g. Education

Page 5: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Attributes Cont.• Numeric Attributes

– Has real-valued or integer-valued domain– E.g. Age with domain (Age) = N, where N denotes the set of natural numbers

(non-negative integers).

• Two types of numeric attributes– Discrete: values take on finite or countably infinite set.– Continuous: values take on any real value

• Another Classification– Interval-scaled

• for attributes only differences make sense• E.g. temperature.

– Ratio-scaled• Both difference and ratios are meaningful• E.g. Age

Page 6: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Algebraic View of Data• If the d attributes in the data matrix D are all numeric • each row can be considered as a d-dimensional point

• or equivalently, each row may be considered a d-dimensional column vector

• Linear combination of the standard basis vectors

Page 7: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Example of Algebraic View of Data

Page 8: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Geometric View of Data

Page 9: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Distance of Angle

Page 10: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 11: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 12: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Example of Distance and Angle

Page 13: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Mean and Total Variance

Page 14: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Centered Data Matrix

• The centered data matrix is obtained by subtracting the mean from all the points

Page 15: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Orthogonality

• Two vectors a and b are said to be orthogonal if and only if

• It implies that the angle between them is 90◦ or π/2 radians.

Page 16: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Orthogonal Projection

P: orthogonal projection of b on the vector a;R: error vector between points b and p

Page 17: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Example of Projection

Page 18: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Linear Independence and Dimensionality

• : the set of all possible linear combinations of the vectors.

• If then we say that v1, · · · , vk is a spanning set for .

Page 19: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Row and Column Space

• The column space of D, denoted col(D) is the set of all linear combinations of the d column vectors or attributes

• The row space of D, denoted row(D), is the set of all linear combinations of the n row vectors or points

• Note also that the row space of D is the column space of

Page 20: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Linear Independence

Page 21: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Dimension and Rank• Let S be a subspace of Rm.• A basis for S: a set of linearly independent vectors v1, · · · , vk , and

span(v1, · · · , vk) = S.• orthogonal basis for S: If the vectors in the basis are pair-wise

orthogonal• If in addition they are also normalized to be unit vectors, then they

make up an orthonormal basis for S.• For instance, the standard basis for Rm is an orthonormal

basis consisting of the vectors

Page 22: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

• Any two bases for S must have the same number of vectors.• Dimension: The number of vectors in a basis for S, denoted as

dim(S).• For any matrix, the dimension of its row and column space

are the same, and this dimension is also called as the rank of the matrix.

Page 23: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Data: Probabilistic View

• Assumes that each numeric attribute Xj is a random variable, defined as a function that assigns a real number to each outcome of an experiment.

• Given as Xj : O → R, where O, the domain of Xj , called as the sample space

• R, the range of Xj , is the set of real numbers.• If the outcomes are numeric, and represent the observed

values of the random variable, then Xj : O → O is simply the identity function: Xj (v) = v for all v O.∈

Page 24: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Data: Probabilistic View

• A random variable X is called a discrete random variable if it takes on only a finite or countably infinite number of values in its range.

• X is called a continuous random variable if it can take on any value in its range.

Page 25: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Example

• Be default, consider the attribute X1 to be a continuous random variable, given as the identity function X1(v) = v, since the outcomes are all numeric.

• On the other hand, if we want to distinguish between iris flowers with short and long sepal lengths, we define a discrete random variable A as follows

• In this case the domain of A is [4.3, 7.9]. The range of A is {0, 1}, and thus A assumes non-zero probability only at the discrete values 0 and 1.

Page 26: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 27: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Example: Bernoulli and Binomial Distribution

• only 13 irises have sepal length of at least 7cm

• In this case we say that A has a Bernoulli distribution with parameter p [0, 1]. ∈p denotes the probability of a success, whereas 1− p represents the probability of a failure

Page 28: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Example: Bernoulli and Binomial Distribution

• Let us consider another discrete random variable B, denoting the number of irises with long sepal lengths in m independent Bernoulli trials with probability of success p.

• B takes on the discrete values [0,m], and its probability mass function is given by the Binomial distribution

• For example, taking p = 0.087 from above, the probability of observing exactly k = 2 long sepal length irises in m = 10 trials is given as

Page 29: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

full probability mass function for different values of k

Page 30: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Probability Density Function

• If X is continuous, its range is the entire set of real numbers R. • probability density function: specifies the probability that the

variable X takes on values in any interval [a, b] R⊂

Page 31: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Cumulative Distribution Function

• For any random variable X, whether discrete or continuous, we can define the cumulative distribution function (CDF) F : R → [0, 1], that gives the probability of observing a value at most some given value x

Page 32: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 33: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

The following examples are from Andrew Moore

Page 34: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 35: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 36: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 37: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 38: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 39: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 40: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 41: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 42: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 43: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 44: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 45: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Probability Density Function f(x)

• What is P(X=x) when x is on a real domain»

f(x) >=0 and

Page 46: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Normal Distribution• Let us assume that these values follow a Gaussian or normal

density function, given as

Page 47: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 48: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 49: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 50: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 51: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 52: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 53: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Bivariate Random Variables

• considering a pair of attributes, X1 and X2, as a bivariate random variable

Page 54: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 55: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 56: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 57: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

In 2-Dimensions

Page 58: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 59: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 60: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 61: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 62: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 63: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 64: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 65: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 66: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 67: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 68: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 69: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 70: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 71: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 72: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 73: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 74: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 75: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 76: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 77: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 78: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Multivariate Random Variable

Page 79: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Multivariate Random Variable

Page 80: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Numeric Attribute Analysis• Sample and Statistics

• Univariate Analysis

• Bivariate Analysis

• Multivariate Analysis

• Normal Distribution

Page 81: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Random Sample and Statistics

• Population: is used to refer to the set or universe of all entities under study.

• However, looking at the entire population may not be feasible, or may be too expensive.

• Instead, we draw a random sample from the population, and compute appropriate statistics from the sample, that give estimates of the corresponding population parameters of interest.

Page 82: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Univariate Sample

• Let X be a random variable, and let xi (1 ≤ i ≤ n) denote the observed values of attribute X in the given data, where n is the data size.

• Given a random variable X, a random sample of size n from X is defined as a set of n independent and identically distributed (IID) random variables S1, S2, · · · , Sn.

• since the variables Si are all independent, their joint probability function is given as

Page 83: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Multivariate Sample

• xi: the value of a d-dimensional vector random variable Si = (X1,X2, · · · ,Xd ).

• Si are independent and identically distributed, and thus their joint distribution is given as

• Assume d attributes X1,X2, · · · ,Xd are independent, (1.43) can be rewritten as

Page 84: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Statistic

• Let Si denote the random variable corresponding to data point xi , then a statistic ˆθ is a function ˆθ : (S1, S2, · · · , Sn) → R.

• If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called as an estimator of the parameter.

Page 85: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 86: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Numeric Attribute Analysis• Sample and Statistics

• Univariate Analysis

• Bivariate Analysis

• Multivariate Analysis

• Normal Distribution

Page 87: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Univariate Analysis

Univariate analysis focuses on a single attribute at a time, thus the data matrix D can be thought of as a n × 1 matrix, or simply a column vector.

Page 88: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Univariate Analysis

X is assumed to be a random variable, and each point xi (1 ≤ i ≤ n) is assumed to be the value of a random variable Si , where the variables Si are all independent and identically distributed as X, i.e., they constitute a random sample drawn from X.

In the vector view, we treat the sample as an n-dimensional vector, and write X Rn. ∈

Page 89: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

What can sample analysis do?

• Unknown f(X) and F(X)• Parameters(μ,δ)

Page 90: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Empirical Cumulative Distribution Function

Where

Page 91: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Inverse Cumulative Distribution Function

Page 92: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Empirical Probability Mass Function

Where

Page 93: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Central Tendency (Mean)Population Mean:

Sample Mean (Unbiased, not robust):

Page 94: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Central Tendency (Median) Population Median:

or

Sample Median:

Page 95: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Central Tendency (Mode) Sample Mode:

1. may not be very useful but not affected by the outliers too much

Page 96: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Example

Page 97: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Dispersion (Range)Range:

Not robust, sensitive to extreme values

Sample Range:

Page 98: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Dispersion (Inter-Quartile Range)

Inter-Quartile Range (IQR):

More robust

Sample IQR:

Page 99: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Dispersion (Variance and Standard Deviation)

Standard Deviation:

Variance:

Page 100: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Dispersion (Variance and Standard Deviation)

Standard Deviation:

Variance:

Sample Variance & Standard Deviation:

Page 101: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Normalization

Z-Score:

Linear Normalization:

Page 102: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Normalization Example

Page 103: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Topics• Sample and Statistics

• Univariate Analysis

• Bivariate Analysis

• Multivariate Analysis

• Normal Distribution

Page 104: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Bivariate Analysis

Bivariate analysis focuses on Two attributes at a time, thus the data matrix D can be thought of as a n × 2 matrix, or two column vectors.

Page 105: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Empirical Joint Probability Mass Function

or

where

Page 106: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Central Tendency (Mean)Population Mean:

Sample Mean:

Page 107: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Association (Covariance)Covariance:

Sample Covariance:

Page 108: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Association (Correlation)

Correlation:

Sample Correlation:

Page 109: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Association (Correlation)

Page 110: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Correlation Example

Page 111: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Topics• Sample and Statistic

• Univariate Analysis

• Bivariate Analysis

• Multivariate Analysis

• Normal Distribution

Page 112: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Multivariate Analysis

Multivariate analysis focuses on multiple attributes at a time, thus the data matrix D can be thought of as a n × d matrix, or d column vectors.

Page 113: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Central Tendency (Mean)Population Mean:

Sample Mean:

Page 114: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Association (Covariance Matrix)

Page 115: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Measures of Association (Correlation)

Correlation:

Sample Correlation:

Page 116: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Topics• Sample and Statistic

• Univariate Analysis

• Bivariate Analysis

• Multivariate Analysis

• Normal Distribution

Page 117: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Univariate Normal Distribution

Page 118: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

Multivariate Normal Distribution

Page 119: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the

• Thank You!

Page 120: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 121: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 122: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 123: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 124: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 125: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 126: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the
Page 127: Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the