data mining taylor statistics 202: data miningjtaylo/courses/stats202/restricted/note… · data...

20
Statistics 202: Data Mining c Jonathan Taylor Statistics 202: Data Mining Other datatypes, preprocessing Based in part on slides from textbook, slides of Susan Holmes c Jonathan Taylor October 3, 2012 1/1

Upload: others

Post on 10-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningOther datatypes, preprocessing

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

October 3, 2012

1 / 1

Page 2: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Other datatypes

Document data

You might start with a collection of n documents (i.e. allof Shakespeare’s works in some digital format; all ofWikileaks’ U.S. Embassy Cables).

This is not a data matrix . . .

Given p terms of interest: {Al Qaeda, Iran, Iraq, etc.} onecan form a term-document matrix filled with counts.

2 / 1

Page 3: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Other datatypes

Document data

Document ID Al Qaeda Iran Iraq . . .

1 0 0 3 . . ....

......

......

250000 2 1 1 . . .

3 / 1

Page 4: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Other datatypes

Time series

Imagine recording the minute-by-minute prices of allstocks in the S & P 500 for last 200 days of trading.

The data can be represented by a 78000 × 500 matrix.

BUT, there is definite structure across the rows of thismatrix.

They are not unrelated “cases” like they might be in otherapplications.

4 / 1

Page 5: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Transaction data

5 / 1

Page 6: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Social media data

6 / 1

Page 7: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Graph data

7 / 1

Page 8: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Other data types

Graph data

Nodes on the graph might be facebook users, or publicpages.

Weights on the edges could be number of messages sentin a prespecified period.

If you take weekly intervals, this leads to a sequence ofgraphs

Gi = communication over i-th week.

How this graph changes is of obvious interest . . . .

Even structure of just one graph is of interest – we’ll comeback to this when we talk about spectral methods . . .

8 / 1

Page 9: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Data quality

Some issues to keep in mind

Is the data experimental or observational?

If observational, what do we know about the datagenerating mechanism? For example, although the S&P500 example can be represented as a data matrix, there isclearly structure across rows.

General quality issues:

How much of the data missing? Is missingness informative?Is it very noisy? Are there outliers?Are there a large number of duplicates?

9 / 1

Page 10: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Preprocessing

General procedures

Aggregation Combining features into a new feature. Example:pooling county-level data to state-level data.

Discretization Breaking up a continuous variable (or a set ofcontinuous variables) into an ordinal (or nominal)discrete variable.

Transformation Simple transformation of feature (log or exp)or mapping to a new space (Fourier transform /power spectrum, wavelet transform).

10 / 1

Page 11: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

A continuous variable that could be discretized

4 2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

11 / 1

Page 12: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Discretization by fixed width

4 2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

12 / 1

Page 13: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Discretization by fixed quartile

4 2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

13 / 1

Page 14: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Discretization by clustering

4 2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

14 / 1

Page 15: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Variable transformation: bacterial decay

15 / 1

Page 16: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Variable transformation: bacterial decay

16 / 1

Page 17: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

BMW daily returns (fEcofin package)

17 / 1

Page 18: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

ACF of BMW daily returns

18 / 1

Page 19: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

Discrete wavelet transform of BMW daily returns

19 / 1

Page 20: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats202/restricted/note… · Data Mining c Jonathan Taylor Other data types Graph data Nodes on the graph might be facebook

Statistics 202:Data Mining

c©JonathanTaylor

20 / 1