data mining taylor statistics 202: data miningjtaylo/courses/stats202/restricted/note… · data...
TRANSCRIPT
Statistics 202:Data Mining
c©JonathanTaylor
Statistics 202: Data MiningOther datatypes, preprocessing
Based in part on slides from textbook, slides of Susan Holmes
c©Jonathan Taylor
October 3, 2012
1 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Other datatypes
Document data
You might start with a collection of n documents (i.e. allof Shakespeare’s works in some digital format; all ofWikileaks’ U.S. Embassy Cables).
This is not a data matrix . . .
Given p terms of interest: {Al Qaeda, Iran, Iraq, etc.} onecan form a term-document matrix filled with counts.
2 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Other datatypes
Document data
Document ID Al Qaeda Iran Iraq . . .
1 0 0 3 . . ....
......
......
250000 2 1 1 . . .
3 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Other datatypes
Time series
Imagine recording the minute-by-minute prices of allstocks in the S & P 500 for last 200 days of trading.
The data can be represented by a 78000 × 500 matrix.
BUT, there is definite structure across the rows of thismatrix.
They are not unrelated “cases” like they might be in otherapplications.
4 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Transaction data
5 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Social media data
6 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Graph data
7 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Other data types
Graph data
Nodes on the graph might be facebook users, or publicpages.
Weights on the edges could be number of messages sentin a prespecified period.
If you take weekly intervals, this leads to a sequence ofgraphs
Gi = communication over i-th week.
How this graph changes is of obvious interest . . . .
Even structure of just one graph is of interest – we’ll comeback to this when we talk about spectral methods . . .
8 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Data quality
Some issues to keep in mind
Is the data experimental or observational?
If observational, what do we know about the datagenerating mechanism? For example, although the S&P500 example can be represented as a data matrix, there isclearly structure across rows.
General quality issues:
How much of the data missing? Is missingness informative?Is it very noisy? Are there outliers?Are there a large number of duplicates?
9 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Preprocessing
General procedures
Aggregation Combining features into a new feature. Example:pooling county-level data to state-level data.
Discretization Breaking up a continuous variable (or a set ofcontinuous variables) into an ordinal (or nominal)discrete variable.
Transformation Simple transformation of feature (log or exp)or mapping to a new space (Fourier transform /power spectrum, wavelet transform).
10 / 1
Statistics 202:Data Mining
c©JonathanTaylor
A continuous variable that could be discretized
4 2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
11 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Discretization by fixed width
4 2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
12 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Discretization by fixed quartile
4 2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
13 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Discretization by clustering
4 2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
14 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Variable transformation: bacterial decay
15 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Variable transformation: bacterial decay
16 / 1
Statistics 202:Data Mining
c©JonathanTaylor
BMW daily returns (fEcofin package)
17 / 1
Statistics 202:Data Mining
c©JonathanTaylor
ACF of BMW daily returns
18 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Discrete wavelet transform of BMW daily returns
19 / 1
Statistics 202:Data Mining
c©JonathanTaylor
20 / 1