statistical data mining - 1 edward j. wegman a short course for interface ‘01

Statistical Data Mining - 1

Edward J. Wegman

A Short Course for Interface ‘01

Outline of LectureOutline of Lecture

ComplexityData Mining: What is it?Data Preparation

Complexity

Descriptor Data Set Size in Bytes Storage Mode

Tiny 102 Piece of Paper Small 104 A Few Pieces of Paper Medium 106 A Floppy Disk Large 108 Hard Disk Huge 1010 Multiple Hard Disks e.g. RAID StorageMassive 1012 Robotic Magnetic

Tape Storage Silos

The Huber Taxonomy of Data Set Sizes

Complexity

O(r),O(n 1/2) Plot a scatterplotO(n) Calculate means, variances, kernel

density estimates

O(n log(n)) Calculate fast Fourier transformsO(nc) Calculate singular value

decomposition of an rc matrix; solve a multiple linear regression

O(n2) Solve most clustering algorithms.

Algorithmic Complexity

Complexity

Table 2: Number of Operations for Algorithms of VariousComputational Complexities and Various Data Set Sizes

n n1/2 n n log(n) n3/2 n2

tiny 10 102 2x102 103 104

small 102 104 4x104 106 108

medium 103 106 6x106 109 1012

large 104 108 8x108 1012 1016

huge 105 1010 1011 1015 1020

Complexity

Table 4: Computational Feasibility on a Pentium PC10 megaflop performance assumed


tiny 10-6

seconds10-5

seconds2x10-5

seconds.0001

seconds.001

seconds

small 10-5

seconds.001

seconds.004

seconds.1

seconds10

seconds

medium .0001seconds

.1seconds

.6seconds

1.67minutes

1.16days

large .001seconds

10seconds

1.3minutes

1.16days

31.7years

huge .01seconds

16.7minutes

2.78hours

3.17years

317,000 years

Complexity

Table 5: Computational Feasibility on a Silicon Graphics Onyx Workstation300 megaflop performance assumed


tiny 3.3x10-8

seconds3.3x10-7

seconds6.7x10-7

seconds3.3x10-6

seconds3.3x10-5

seconds

small 3.3x10-7

seconds3.3x10-5

seconds1.3x10-4

seconds3.3x10-3

seconds.33

seconds

medium 3.3x10-6

seconds3.3x10-3

seconds.02

seconds3.3

seconds55

minutes

large 3.3x10-5

seconds.33

seconds2.7

seconds55

minutes1.04years

huge 3.3x10-4

seconds33

seconds5.5

minutes38.2days

10,464years

Complexity

Table 6: Computational Feasibility on an Intel Paragon XP/S A44.2 gigaflop performance assumed


tiny 2.4x10-9

seconds2.4x10-8

seconds4.8x10-8

seconds2.4x10-7

seconds2.4x10-6

seconds

small 2.4x10-8

seconds2.4x10-6

seconds9.5x10-6

seconds2.4x10-4

seconds.024

seconds

medium 2.4x10-7

seconds2.4x10-4

seconds.0014

seconds.24

seconds4.0

minutes

large 2.4x10-6

seconds.024

seconds.19

seconds4.0

minutes27.8days

huge 2.4x10-5

seconds2.4

seconds24

seconds66.7

hours761

years

Complexity

Table 7: Computational Feasibility on a Teraflop Grand Challenge Computer1000 gigaflop performance assumed


tiny 10-11

seconds10-10

seconds2x10-10

seconds10-9

seconds10-8

seconds

small 10-10

seconds10-8

seconds4x10-8

seconds10-6

seconds10-4

seconds

medium 10-9

seconds10-6

seconds6x10-6

seconds.001

seconds1

second

large 10-8

seconds10-4

seconds8x10-4

seconds1

second2.8

hours

huge 10-7

seconds.01

seconds.1

seconds16.7

minutes3.2

years

Complexity

Table 8: Types of Computers for Interactive FeasibilityResponse Time < 1 second


tiny PersonalComputer

PersonalComputer

PersonalComputer

PersonalComputer

PersonalComputer

small PersonalComputer

PersonalComputer

PersonalComputer

PersonalComputer

SuperComputer

medium PersonalComputer

PersonalComputer

PersonalComputer

Super Computer TeraflopComputer

large PersonalComputer

Workstation Super Computer TeraflopComputer

---

huge PersonalComputer

SuperComputer

TeraflopComputer

--- ---

Complexity

Table 9: Types of Computers for FeasibilityResponse Time < 1 week


tiny PersonalComputer

PersonalComputer

PersonalComputer

PersonalComputer

PersonalComputer

small PersonalComputer

PersonalComputer

PersonalComputer

PersonalComputer

PersonalComputer

medium PersonalComputer

PersonalComputer

PersonalComputer

PersonalComputer

PersonalComputer

large PersonalComputer

PersonalComputer

PersonalComputer

PersonalComputer

TeraflopComputer

huge PersonalComputer

PersonalComputer

PersonalComputer

Super Computer ---

Complexity

Table 10: Transfer Rates for a Variety of Data Transfer Regimes

n standardethernet10 mega-bits/sec

fastethernet

100 mega-bits/sec

hard disktransfer

2027 kilo-bytes/sec

cachetransfer @ 200

megahertz

1.25x106

bytes/sec1.25x107

bytes/sec2.027x106

bytes/sec2x108

bytes/sec

tiny 8x10-5

seconds8x10-6

seconds4.9x10-5

seconds5x10-6

seconds

small 8x10-3

seconds8x10-4

seconds4.9x10-3

seconds5x10-5

seconds

medium .8seconds

.08seconds

.49seconds

5x10-3

seconds

large 1.3minutes

8seconds

49seconds

.5seconds

huge 2.2hours

13.3minutes

1.36hours

50seconds

Complexity

Table 11: Resolvable Number of Pixels AcrossScreen for Several Viewing Scenarios

19 inchmonitor @24 inches

25 inchTV @12 feet

15 footscreen @

20 feet

immersion

Angle 39.005o 9.922o 41.112o 140o

5 seconds of arcresolution(Valyus)

28,084 7,144 29,601 100,800

1 minute of arcresolution

2,340 595 2,467 8,400

3.6 minute of arcresolution(Wegman)

650 165 685 2,333

4.38 minutesof arc resolution

(Maar 1)

534 136 563 1,918

.486 minutes ofarc/foveal cone

(Maar 2)

4,815 1,225 5,076 17,284

Complexity

ScenariosTypical high resolution workstations,

1280x1024 = 1.31x106 pixelsRealistic using Wegman, immersion, 4:5 aspect ratio,

2333x1866 = 4.35x106 pixels Very optimistic using 1 minute arc, immersion, 4:5

aspect ratio, 8400x6720 = 5.65x107 pixels

Wildly optimistic using Maar(2), immersion, 4:5 aspect ratio,

17,284x13,828 = 2.39x108 pixels

Massive Data SetsMassive Data Sets

One Terabyte Datasetvs

One Million Megabyte Data Sets

Both difficult to analyze,but for different reasons.

Massive Data Sets: Massive Data Sets: Commonly Used LanguageCommonly Used Language

Data Mining = DMKnowledge Discovery in Databases =

KDDMassive Data Sets = MDData Analysis = DA

Data Mining of Massive Data Mining of Massive DatasetsDatasets

Data Mining is Exploratory Data Analysis with Little or No Human Interaction using

Computationally Feasible Techniques, i.e., the Attempt to find Interesting Structure

unknown a priori

Statistical Data MiningStatistical Data Mining

Techniques- Classification

- Clustering- Neural Networks & Genetic Algorithms- CART- Nonparametric Regression- Time Series: Trend & Spectral Estimation- Density Estimation, Bumps and Ridges


Major Issues Complexity Non-homogeneity

Examples Huber’s Air Traffic Control Highway Maintenance Ultrasonic NDE


Air Traffic Control 6 to 12 Radar stations, several hundred

aircraft, 64-byte record per radar per aircraft per antenna turn

megabyte of data per minute


Highway Maintenance Records of maintenance records and

measurements of road quality for several decades

Records of uneven quality Records missing


NDE using Ultrasound Inspection of cast iron projectiles Time series of length 256, 360 degrees,

550 levels = 50,688,000 observations per projectile

Several thousand projectiles per day

Massive Data Sets: Massive Data Sets: A DistinctionA Distinction

Human Analysis of the Structure of Data and Pitfalls

vs Human Analysis of the Data Itself

Limits of HVS and computational complexity limit the latter

Former is the basis for design of the analysis engine


Data Types Experimental Observational Opportunistic

Data Types Numerical Categorical Image

Data Preparation

Data Preparation

0

10

20

30

40

50

60

ObjectivesDetermination

Data Preparation Data Mining Analysis &Assimilation

Eff

ort

(%

)

Data Preparation

Data Cleaning and QualityTypes of DataCategorical versus Continuous DataProblem of Missing Data

Imputation Missing Data Plots

Problem of OutliersDimension Reduction, Quantization,

Sampling

Data Preparation

Quality Data may not have any statistically significant patterns or

relationships Results may be inconsistent with other data sets Data often of uneven quality, e.g. made up by respondent Opportunistically collected data may have biases or errors Discovered patterns may be too specific or too general to

be useful

Data Preparation

Noise - Incorrect Values Faulty data collection instruments, e.g. sensors Transmission errors, e.g. intermittent errors

from satellite or Internet transmissions Data entry problems Technology limitations Naming conventions misused

Data Preparation

Noise - Incorrect Classification Human judgment Time varying Uncertainty/Probabilistic nature of data

Data Preparation

Redundant/Stale data Variables have different names in different

databases Raw variable in one database is a derived

variable in another Irrelevant variables destroy speed (dimension

reduction needed) Changes in variable over time not reflected in

database

Data Preparation

Data cleaningSelecting and appropriate data set

and/or sampling strategyTransformations

Data Preparation

Data Cleaning Duplicate removal (tool based) Missing value imputation (manual, statistical) Identify and remove data inconsistencies Identify and refresh stale data Create unique record (case) ID

Data Preparation

Categorical versus Continuous Data Most statistical theory, many graphics tools

developed for continuous data Much of the data if not most data in databases

is categorical Computer science view often takes continuous

data into categorical, e.g. salaries categorized as low, medium, high, because more suited to Boolean operations

Data Preparation

Problem of Missing Values Missing values in massive data sets may or

may not be a problemMissing data may be irrelevant to desired result, e.g.

cases with missing demographic data may not help if I am trying to create selection mechanism for good customers based on demographics

Massive data sets if acquired by instrumentation may have few missing values anyway

Imputation has model assumptions

Suggest making a Missing Value Plot

Data Preparation

Missing Value Plot A plot of variables by cases Missing values colored red Special case of “color

histogram” with binary data

“Color histogram” also known as “data image”

This example is 67 dimensions by 1000 cases

This example is also fake

Data Preparation

Problem of Outliers Outliers easy to detect in low dimensions A high dimensional outlier may not show up in

low dimensional projections MVE or MCD algorithms are exponentially

computationally complex Fisher Info Matrix and Convex Hull Peeling

more feasible but still too complex for Massive datasets

Data PreparationDatabase Sampling

Exhaustive search may not be practically feasible because of their size

The KDD systems must be able to assist in the selection of appropriate parts if the databases to be examined

For sampling to work, the data must satisfy certain conditions (not ordered, no systematic biases)

Sampling can be very expensive operation especially when the sample is taken from data stored in a DBMS. Sampling 5% of the database can be more expensive that a sequential full scan of the data.

Data Compression

Often data preparation involves data compression Sampling Quantization

Subject of my talk later in the conference. See that talk for more details on this subject.

statistical data mining - 1 edward j. wegman a short course for interface ‘01

Documents

structure of data

exploratory data analysis

daymassive data sets

kddmassive data sets

antenna turnmegabyte

data itselflimits of

mddata analysis

distinctionhuman analysis