Download - Slides Distance Co Variance
![Page 1: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/1.jpg)
TECHNIQUES FOR BIG DATA
FEATURE EXTRACTION USING
DISTANCE COVARIANCE
BASED PCA
![Page 2: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/2.jpg)
Big Data
Big Data' is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011 McKinsey report suggests suitable technologies include crowdsourcing, data fusion and integration, genetic algorithms, machine learning, natural language processing, signal processing, simulation, time series analysis and visualization.
![Page 3: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/3.jpg)
How Big is Big Data?
Very large, distributed aggregations of loosely structured data – often incomplete and inaccessible.
Petabytes/exabytes of data Millions/billions of people Billions/trillions of records.
Loosely-structured and often distributed data.
Flat schemas with few complex interrelationships
Often involving time-stamped events
Often made up of incomplete data
Often including connections between data elements that must be probabilistically inferred.
Applications that involved Big-data can be: Transactional (e.g., Facebook, PhotoBox), or, Analytic (e.g., ClickFox, Merced Applications).
(Reference Wikibon.org)
![Page 4: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/4.jpg)
Big Data
Big Data Can be of three types:
1. Large number of attributes (>16)
2. Large number of samples
3. Large number both of attributes and samples
I have tried to work on the first case.
![Page 5: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/5.jpg)
What is Dimensionality Reduction?
Dimensionality reduction or dimension reduction
is the process of reducing the number of random
variables under consideration (or attributes or
features or descriptors), and can be divided into
feature selection and feature extraction.
![Page 6: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/6.jpg)
Feature Selection
Filters: Pearson’s Correlation
Wrappers: Run a classifier again and again, each
time with a new set of features selected using
backward selection or forward selection.
![Page 7: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/7.jpg)
Feature Extraction
Feature extraction transforms the data in the high-
dimensional space to a space of fewer dimensions.
The data transformation may be linear, as in
principal component analysis (PCA), but many
nonlinear dimensionality reduction techniques also
exist. For multidimensional data, tensor
representation can be used in dimensionality
reduction through multilinear subspace learning.
![Page 8: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/8.jpg)
Feature Extraction
The main linear technique for dimensionality
reduction, principal component analysis, performs a
linear mapping of the data to a lower-dimensional
space in such a way that the variance of the data in
the low-dimensional representation is maximized
![Page 9: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/9.jpg)
What is Principal Component Analysis?
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. Principal components are guaranteed to be independent if the data set is jointly normally distributed. PCA is sensitive to the relative scaling of the original variables.
![Page 10: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/10.jpg)
That is fine, but show me the MATH!
Online tutorial
(http://www.cs.otago.ac.nz/cosc453/student_tutori
als/principal_components.pdf)
![Page 11: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/11.jpg)
PCA and BIG DATA
BIG DATA containing thousands will require a lot of
computation time for an average computer.
PCA becomes an important tool while drawing
inference from such large data sets.
![Page 12: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/12.jpg)
What is Distance Correlation?
Distance correlation is a measure of statistical dependence between two random variables or two random vectors of arbitrary, not necessarily equal dimension. An important property is that this measure of dependence is zero if and only if the random variables are statistically independent. This measure is derived from a number of other quantities that are used in its specification, specifically: distance variance, distance standard deviation and distance covariance. These take the same roles as the ordinary moments with corresponding names in the specification of the Pearson product-moment correlation coefficient.
![Page 13: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/13.jpg)
Distance Covariance Solved Example
Sample Data
Column 1 Column 2
1 1
2 0
-1 2
0 3
Mean 0.5 1.5
![Page 14: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/14.jpg)
Distances
For Column 1 (aij = pow((ai^2 – aj^2), 0.5))
0 1.73 0 1
1.73 0 1.73 2
0 1.73 0 1
1 2 1 0
Using Euclidean formula to calculate distances
Mean 0.62 1.365 0.62 1
0.62
1.365
0.62
1
Grand Mean : 0.932
![Page 15: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/15.jpg)
Similarly
Distances for column 2 (bij)
0 1 1.73 2.8
1 0 2 3
1.73 2 0 2.23
2.8 3 2.23 0
Mean
Mean 1.38 1.5 1.49
2.66
1.38
1.5
1.49
2.66
Grand Mean : 1.595
![Page 16: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/16.jpg)
Centering both the columns
Aij = aij – ~ai – ~aj + ~a;
where
~ai = row mean of ai
~aj = column mean of aj
~a = grand mean of a
![Page 17: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/17.jpg)
Aij
-0.308 0.677 -0.308 0.312
0.677 -1.668 0.677 0.567
-0.308 0.677 -0.308 0.312
0.312 0.567 0.312 -1.608
![Page 18: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/18.jpg)
Similarly
We can calculate Bij
Distance Covariance = (Aij*Bij)/n^2
![Page 19: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/19.jpg)
Distance Covariance Principal
Component Analysis
After we have obtained distance covariance, we
can find the highest eigen vectors of the covariance
matrix and then use those eigen vectors to extract
new features
These eigen vectors can be multiplied by the real
dataset to generate the reduced dataset.
![Page 20: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/20.jpg)
PCA vs D-PCA
The classical measure of dependence, the Pearson correlation coefficient, is mainly sensitive to a linear relationship between two variables. Distance correlation was introduced in 2005 by Gabor J Szekely in several lectures to address this deficiency of Pearson’s correlation, namely that it can easily be zero for dependent variables. Correlation = 0 (uncorrelatedness) does not imply independence while distance correlation = 0 does imply independence. The first results on distance correlation were published in 2007 and 2009.
![Page 21: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/21.jpg)
Confusion Matrix
![Page 22: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/22.jpg)
Modifications of D-PCA
1. pow((ai^2 – aj^2),0.5)/ai+aj
2. pow((ai^2 – aj^2),0.5)/ai
These modification can be used to scale the data
which can then eliminate Normalization Step.
![Page 23: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/23.jpg)
Results
![Page 24: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/24.jpg)
Drawbacks
Cannot handle time series data
Cannot handle noisy data
Assumes data distribution to be normal
Sensitive to scaling of the data
![Page 25: Slides Distance Co Variance](https://reader034.vdocument.in/reader034/viewer/2022042722/577cc78a1a28aba711a140b8/html5/thumbnails/25.jpg)
Future work
Rank correlation
Distance based source separation